There is a really interesting paper from Microsoft on 1-bit/1.58 transformers using only 1, 0, -1. It's titled 'BitNet: Scaling 1-bit Transformers for Large Language Models'.
@@jrheritaIt's an AMD (RDNA) T-shirt, and he's talking about NVidia, so I guess I'm not totally surprised here... though you'd think he'd just wear a different shirt instead, the flickering is rather distracting.
The adoption of reduced/mixed precision compute is mostly about restraining memory footprint and register allocation. The last one determines the effective thread parallelism for the GPU execution model and is particularly suited for ML workloads, where large data types are just overkill. Every byte and joule matters in this AI hype race.
I've thought about how I would convert Int8 into FP8. It has 256 possible values -128 to +127. So just divide by 100 to get -1.28 to +1.27 It seems like a good range because NN's like numbers between -1 to +1 in general. but if you need more range just *2 to get --2.56 to +2.54 (step value .02 instead of .01 )
@@defeqel6537 My example is Fixed Decimal point and 3 digits. Very easy to interpret. Seems useful in NNs. In LLM matrix Operations they may need NaNs and +/- 0's for some reason I don't know, it could be adjusted in various ways to allow for them, and maintain simpliciy. fp8_emulation.py on github and float_table_generator.py seems pretty complex when working with FP8.
I had a crazy thought. I've heard of homologous encryption, so why not homologous compression? I think the answer is that homologous encryption is processor-intensive, and if it's the same for homologous compression, it would be pointless. That's if it's even possible. So, is homologous compression so crazy it will never work, or just completely mad unless an absolute genius has a brilliant idea for making it work? (It might make a good April 1st tech story.)
Sorry, I got the term wrong, I was thinking of Homomorphic (not homologous) Encryption. It's like, if you want software in the cloud to make a calculation, but you want to keep the numbers secret, HE is a way of sending the numbers encrypted to other numbers in such a way that the calculation can be done without decrypting the numbers. The result is naturally in encrypted form, and isn't decrypted until it comes back from the cloud. It's more secure than if the numbers were decrypted before adding them. So my crazy idea is for homomorphic compression, i.e. compressing the numbers and only decompressing the end result. (BTW homomorphic encryption can be split into FHE ("F" for Fully) or PHE ("P" for partial), and I don't know if either is used much or at all in practice, because of the overhead.)
I'm back. When I googled "homologous encryption" I got no results, because I'd remembered the term wrong. I've just googled "homomorphic compression" and it's at least a concept already. Here's an extract from one result: "Accelerating Distributed Deep Learning Using Tensor ... The author introduces Tensor Homomorphic Compression (THC) as a novel bi-directional compression framework to address communication overhead in distributed "
want to see a fully programmable floating point unit that can handle a wide array of formats. signed, unsigned, different exponent/mantissa sizes. the offset mentioned, vector operations with shared components, huge formats (fp128) as well as fixed point formats (float is really just fixed + size data).
@@unvergebeneid it makes the hardware implementation of some things easier. - you're trying to save hardware power expenditure, not convenience to humans - fp16/32/64 is the convenient for us values :)
@@andytroo sure, but you're wasting bits so it's somewhat surprising to me using an integer type isn't more efficient. I'm not doubting the engineers working on this mind you, I'm just voicing my surprise.
next step: -inf 0 1 +inf new bool-logic: "may be", "i know" and "you do not know" Marketing cannot be defeated. Marketing of Nvidia-marketing cannot be defeated double
I wonder what impact the earthquake in Taiwan will have on TSMC to deliver for Nvidia and others, has there been much said about damage to fabs as it was quite a large one.
Going to mixed/low precision requires to know exactly what the used number range is, and spacing the few(er) available numbers on the number line such that they cover the most frequently used range the best. This is why we see so many custom formats now for different applications - I've even created my own custom FP16 format for lattice Boltzmann. But FP4 can hold only 16 different number states, of which at least 5 are unusable for ±NaN, ±Inf and -0, there is even more NaNs when going with IEEE-754 definition. Can't do any sensible math with that, not even with shared exponent (which makes it not a simple code change at all). Calling FP4 "floating-point" operations or Flops is very far-fetched. The decimal point doesn't float anymore with so few states and fixed-point/integer would be more appropriate for this kind of bit-mashing. IMO it's just marketing nonsense to claim better "Flops" performance. Nvidia even goes as far as to compare FP4 performance with FP8 and FP16, which makes no sense at all as you can't simply replace FP16 with FP8 or FP4. And I thought it was peak marketing already when they named their proprietary 19-bit floating-point format "TF32". I'm waiting for the day when Nvidia invents the FP1 format, with only 2 possible states of ±NaN, and sell this to you as XX PFlops/s performance.
@degenerate_mercenary9898yes, and FP4 behaves very much like integer here but on top with very poor granularity. Any sort of arithmetic between two FP4 numbers is going to fail.
Dr. Cultress: Can you please do a "preview" on the Blackwell "Prosumer" GPU for the mass? As far as A.I is concerned. Like LLM training using local data only on my own PC. For ALL video/text/graphics. THANK U. I DON'T CARE about the Internet. I'm *ONLY* concerned with my own data.
I am assuming that CUDA will understand how to use FP4 & FP6. Given that CDNA4 will also have microscaling, PyTorch & Onnx will need to handle FP4 & FP6.
I must be missing something. Couldn't you do the calculation in higher precision and compare to the same calculation in lower precision automatically and let the program set its own precision? It would take a little bit more calculation than 100% optimal but after a while the program would find the optimal precision and only check once in a time to make sure the results are still good enough.
1.0 + 2.0 = +inf
i have seen this kind of math before by Valve. They were years a head in number formats. This is why we cant have portal 3!
There is a really interesting paper from Microsoft on 1-bit/1.58 transformers using only 1, 0, -1.
It's titled 'BitNet: Scaling 1-bit Transformers for Large Language Models'.
Geez .. this is almost like .. if-then-else-elseif .. or switch:case-0:case-1:else
Makes me wonder now ..
I'm going to keep watching these in the hope that I one day understand what the hell you're actually saying. Starting to recognize words.
I you play the video on your phone and put the phone under your pillow, you will get smarter in your sleep. At least that's my approach.
You are censoring your poor t-shirt. Let it speak!
He is just demonstrating the low precision AI/ML.
It's not about the print. It's too prevent groupies from hitting on him. 😏
I saw a red logo on left and the letter D in white color. Some additional letter on the right side.
@@jrheritaIt's an AMD (RDNA) T-shirt, and he's talking about NVidia, so I guess I'm not totally surprised here... though you'd think he'd just wear a different shirt instead, the flickering is rather distracting.
Be proud of wearing that RDNA shirt! I wish I had one lol
Trying to hide it just brings more attention to it. Should not have bothered imo.
What is the problem with the RDNA2 shirt?
The adoption of reduced/mixed precision compute is mostly about restraining memory footprint and register allocation. The last one determines the effective thread parallelism for the GPU execution model and is particularly suited for ML workloads, where large data types are just overkill. Every byte and joule matters in this AI hype race.
FP4: Let's have 6 NaNs instead of ±[ 0, 0.5, 1.0, *1.5,* 2, *2.5* ].
NVidia: BRILLIANT!
LET THE TSHIRT SPEAK!
I love your die-shots background!!
Ian is hinting that Blackwell's inside secret is RDNA2 😅
Any reason microscaling doesn't work with an integer on top of an FP32? Seems much less wasteful.
Where are the sources? Specifically the website with the dojo format sheets. They look interesting
What's going on with your shirt?
He's clearly being overtaken by ChatGPT.
that's the Blackwell tshirt
AMD😂.
When we finally figure out the binary level process are we going to drop all these wonky intermediate number formats?
If u can set up Ur inference engine such that it only needs to distinguish a and nota u only need 1 bit
But it's a big "if"
Why would they waste bit for infinity?
It's more like, if this bit is a 1, then it's an infinity.
I've thought about how I would convert Int8 into FP8. It has 256 possible values -128 to +127.
So just divide by 100 to get -1.28 to +1.27 It seems like a good range because NN's like numbers between -1 to +1 in general. but if you need more range just *2 to get --2.56 to +2.54 (step value .02 instead of .01 )
how important is accuracy there? going from 7 bits to 3 bits significand will result in quite a bit of loss
@@defeqel6537 My example is Fixed Decimal point and 3 digits. Very easy to interpret. Seems useful in NNs. In LLM matrix Operations they may need NaNs and +/- 0's for some reason I don't know, it could be adjusted in various ways to allow for them, and maintain simpliciy. fp8_emulation.py on github and float_table_generator.py seems pretty complex when working with FP8.
Looks like some version of floating point compression.
I had a crazy thought. I've heard of homologous encryption, so why not homologous compression? I think the answer is that homologous encryption is processor-intensive, and if it's the same for homologous compression, it would be pointless. That's if it's even possible. So, is homologous compression so crazy it will never work, or just completely mad unless an absolute genius has a brilliant idea for making it work? (It might make a good April 1st tech story.)
@@riffsoffov9291 theoretically compression would work on any data that has duplicate data, beyond that, its tough for me to visualize
Sorry, I got the term wrong, I was thinking of Homomorphic (not homologous) Encryption. It's like, if you want software in the cloud to make a calculation, but you want to keep the numbers secret, HE is a way of sending the numbers encrypted to other numbers in such a way that the calculation can be done without decrypting the numbers. The result is naturally in encrypted form, and isn't decrypted until it comes back from the cloud. It's more secure than if the numbers were decrypted before adding them. So my crazy idea is for homomorphic compression, i.e. compressing the numbers and only decompressing the end result. (BTW homomorphic encryption can be split into FHE ("F" for Fully) or PHE ("P" for partial), and I don't know if either is used much or at all in practice, because of the overhead.)
I'm back. When I googled "homologous encryption" I got no results, because I'd remembered the term wrong. I've just googled "homomorphic compression" and it's at least a concept already. Here's an extract from one result: "Accelerating Distributed Deep Learning Using Tensor ...
The author introduces Tensor Homomorphic Compression (THC) as a novel bi-directional compression framework to address communication overhead in distributed "
FP4 green screen. haha jk great video
want to see a fully programmable floating point unit that can handle a wide array of formats. signed, unsigned, different exponent/mantissa sizes. the offset mentioned, vector operations with shared components, huge formats (fp128) as well as fixed point formats (float is really just fixed + size data).
try building it in an fpga and you'll see why people don't do that
@@qbqbqdbq its known that fpgas are slow. but im talking more like an instruction set extension.
@@LordOfNihil multipliers have quadratic gate complexity on asics too dude you need to learn some digital design
Jensens jelly beans, I dont know what this means.
When you're determined to look like a potato😂
wasting 4/16 values with nan seems idiotic
Wait till you hear about fp32 which has 16,777,214 NaNs
Also, having +0 and -0 seems a bit decadent when that's a quarter of your total numbers.
@@unvergebeneid it makes the hardware implementation of some things easier. - you're trying to save hardware power expenditure, not convenience to humans - fp16/32/64 is the convenient for us values :)
@@andytroo sure, but you're wasting bits so it's somewhat surprising to me using an integer type isn't more efficient. I'm not doubting the engineers working on this mind you, I'm just voicing my surprise.
next step: -inf 0 1 +inf
new bool-logic: "may be", "i know" and "you do not know"
Marketing cannot be defeated. Marketing of Nvidia-marketing cannot be defeated double
I wonder what impact the earthquake in Taiwan will have on TSMC to deliver for Nvidia and others, has there been much said about damage to fabs as it was quite a large one.
Hear me out... FP1
Negative infinity vs positive infinity. I like it.
The rdna2 tshirt flicking in and out is insanely distracting
Why +0, -0, and multiple NaNs (Not a Number) ❓
Nice AMD RDNA2 t-shirt
Great review. The scaling factor helps reduce the hit to accuracy, while maintaining the memory and compute savings of the lower precision formats
Hmm, it doesn't sound like micro scaling saves die space.
It will save working memory however, that is a factor too.
4 bits quantized LLMs are delivering amazingly accurate inference results at high speed.
They are INT4 though, not FP4
Going to mixed/low precision requires to know exactly what the used number range is, and spacing the few(er) available numbers on the number line such that they cover the most frequently used range the best. This is why we see so many custom formats now for different applications - I've even created my own custom FP16 format for lattice Boltzmann.
But FP4 can hold only 16 different number states, of which at least 5 are unusable for ±NaN, ±Inf and -0, there is even more NaNs when going with IEEE-754 definition. Can't do any sensible math with that, not even with shared exponent (which makes it not a simple code change at all). Calling FP4 "floating-point" operations or Flops is very far-fetched. The decimal point doesn't float anymore with so few states and fixed-point/integer would be more appropriate for this kind of bit-mashing. IMO it's just marketing nonsense to claim better "Flops" performance. Nvidia even goes as far as to compare FP4 performance with FP8 and FP16, which makes no sense at all as you can't simply replace FP16 with FP8 or FP4. And I thought it was peak marketing already when they named their proprietary 19-bit floating-point format "TF32". I'm waiting for the day when Nvidia invents the FP1 format, with only 2 possible states of ±NaN, and sell this to you as XX PFlops/s performance.
I have no idea about the ML field, but it seems to me that INT4/INT6 would be much more usable. I guess NaNs and Infs can be useful in the field
@degenerate_mercenary9898yes, and FP4 behaves very much like integer here but on top with very poor granularity. Any sort of arithmetic between two FP4 numbers is going to fail.
all the real ones know trinary is the ultimate precision reduction
Lmao RDNA2 shirt
A video about artesia labs would be cool
NGL that tee shirt starting to mess with me
GOPs, GFLOPs, TOPs, POPs
Why bother with a bit for infinities?
It's not really a bit for infinities, but any time there's a 1, it's an infinity
you are terrible at explaining bro, sorry
Good to know. It's why Intel, AMD, IBM, Qualcomm all pay me to help them explain stuff.
Dr. Cultress: Can you please do a "preview" on the Blackwell "Prosumer" GPU for the mass? As far as A.I is concerned. Like LLM training using local data only on my own PC. For ALL video/text/graphics. THANK U. I DON'T CARE about the Internet. I'm *ONLY* concerned with my own data.
I am assuming that CUDA will understand how to use FP4 & FP6. Given that CDNA4 will also have microscaling, PyTorch & Onnx will need to handle FP4 & FP6.
Interesting, now it makes more sence, since simply going to fp4 wouldn't work, there had to be more to it and as you showed, there is, microscaling.
Why is your shirt a green screen
I just checked the OCP MX spec, and the values look different from 1:45 . Not sure how to interpret that. Is this a different spec?
So... the real question is... what's the inscription on the t-shirt?
When forgetting to change t-shirt has real consequences ;)
Nvidia abstraction scam
Thank you for this video!
Should've changed the T shirt before you started this Ian.
Yeah, I know. Only saw it a couple days later after filming
I must be missing something. Couldn't you do the calculation in higher precision and compare to the same calculation in lower precision automatically and let the program set its own precision? It would take a little bit more calculation than 100% optimal but after a while the program would find the optimal precision and only check once in a time to make sure the results are still good enough.
AI SUCKS!
Take THAT AI!
*After 10 years when AI takes over Humanity*
AI:- So how should we execute him for speaking against us
@@aidanm5578 wait, there’s more. AI Sucks!
@@computerscience1101 you suck