Making AI More Accurate: Microscaling on NVIDIA Blackwell

TechTechPotato

Просмотров 16 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 сен 2024

Комментарии • 91

@cem_kaya 5 месяцев назад ⁺⁸⁰
1.0 + 2.0 = +inf
i have seen this kind of math before by Valve. They were years a head in number formats. This is why we cant have portal 3!
@Jonas_Meyer 5 месяцев назад ⁺²⁰
There is a really interesting paper from Microsoft on 1-bit/1.58 transformers using only 1, 0, -1.
It's titled 'BitNet: Scaling 1-bit Transformers for Large Language Models'.
@johnkost2514 5 месяцев назад ⁺²
Geez .. this is almost like .. if-then-else-elseif .. or switch:case-0:case-1:else
Makes me wonder now ..
@jortmungandr1112 5 месяцев назад ⁺¹⁹
I'm going to keep watching these in the hope that I one day understand what the hell you're actually saying. Starting to recognize words.
@HighYield 5 месяцев назад ⁺²
I you play the video on your phone and put the phone under your pillow, you will get smarter in your sleep. At least that's my approach.
@DanielWolf555 5 месяцев назад ⁺⁷⁹
You are censoring your poor t-shirt. Let it speak!
@ReivecS 5 месяцев назад ⁺²²
He is just demonstrating the low precision AI/ML.
@kamipls6790 5 месяцев назад
It's not about the print. It's too prevent groupies from hitting on him. 😏
@jrherita 5 месяцев назад ⁺¹
I saw a red logo on left and the letter D in white color. Some additional letter on the right side.
@greggmacdonald9644 5 месяцев назад ⁺¹²
@@jrheritaIt's an AMD (RDNA) T-shirt, and he's talking about NVidia, so I guess I'm not totally surprised here... though you'd think he'd just wear a different shirt instead, the flickering is rather distracting.
@Mark_Williams. 5 месяцев назад ⁺¹
Be proud of wearing that RDNA shirt! I wish I had one lol
Trying to hide it just brings more attention to it. Should not have bothered imo.
@vitormoreno1244 5 месяцев назад ⁺²⁷
What is the problem with the RDNA2 shirt?
@Ivan-pr7ku 5 месяцев назад ⁺⁴
The adoption of reduced/mixed precision compute is mostly about restraining memory footprint and register allocation. The last one determines the effective thread parallelism for the GPU execution model and is particularly suited for ML workloads, where large data types are just overkill. Every byte and joule matters in this AI hype race.
@bug5654 5 месяцев назад ⁺⁴
FP4: Let's have 6 NaNs instead of ±[ 0, 0.5, 1.0, *1.5,* 2, *2.5* ].
NVidia: BRILLIANT!
@JordansTechJunk 5 месяцев назад ⁺¹⁰
LET THE TSHIRT SPEAK!
@VideogamesAsArt 3 месяца назад
I love your die-shots background!!
@davidgunther8428 5 месяцев назад ⁺³
Ian is hinting that Blackwell's inside secret is RDNA2 😅
@unvergebeneid 5 месяцев назад ⁺²
Any reason microscaling doesn't work with an integer on top of an FP32? Seems much less wasteful.
@jacobjacob5735 5 месяцев назад ⁺¹
Where are the sources? Specifically the website with the dojo format sheets. They look interesting
@metalhead2550 5 месяцев назад ⁺¹⁵
What's going on with your shirt?
@predaalex3210 5 месяцев назад ⁺⁸
He's clearly being overtaken by ChatGPT.
@fist003 5 месяцев назад ⁺⁶
that's the Blackwell tshirt
@mikelay5360 5 месяцев назад
AMD😂.
@RickeyBowers 5 месяцев назад
When we finally figure out the binary level process are we going to drop all these wonky intermediate number formats?
@KrissyD-px9gj 5 месяцев назад
If u can set up Ur inference engine such that it only needs to distinguish a and nota u only need 1 bit
But it's a big "if"
@pawelczubinski6413 5 месяцев назад
Why would they waste bit for infinity?
@TechTechPotato 5 месяцев назад ⁺¹
It's more like, if this bit is a 1, then it's an infinity.
@ScottzPlaylists 5 месяцев назад ⁺¹
I've thought about how I would convert Int8 into FP8. It has 256 possible values -128 to +127.
So just divide by 100 to get -1.28 to +1.27 It seems like a good range because NN's like numbers between -1 to +1 in general. but if you need more range just *2 to get --2.56 to +2.54 (step value .02 instead of .01 )
@defeqel6537 5 месяцев назад ⁺¹
how important is accuracy there? going from 7 bits to 3 bits significand will result in quite a bit of loss
@ScottzPlaylists 5 месяцев назад ⁺¹
@@defeqel6537 My example is Fixed Decimal point and 3 digits. Very easy to interpret. Seems useful in NNs. In LLM matrix Operations they may need NaNs and +/- 0's for some reason I don't know, it could be adjusted in various ways to allow for them, and maintain simpliciy. fp8_emulation.py on github and float_table_generator.py seems pretty complex when working with FP8.
@marktackman2886 5 месяцев назад
Looks like some version of floating point compression.
@riffsoffov9291 5 месяцев назад
I had a crazy thought. I've heard of homologous encryption, so why not homologous compression? I think the answer is that homologous encryption is processor-intensive, and if it's the same for homologous compression, it would be pointless. That's if it's even possible. So, is homologous compression so crazy it will never work, or just completely mad unless an absolute genius has a brilliant idea for making it work? (It might make a good April 1st tech story.)
@marktackman2886 5 месяцев назад
@@riffsoffov9291 theoretically compression would work on any data that has duplicate data, beyond that, its tough for me to visualize
@riffsoffov9291 5 месяцев назад
Sorry, I got the term wrong, I was thinking of Homomorphic (not homologous) Encryption. It's like, if you want software in the cloud to make a calculation, but you want to keep the numbers secret, HE is a way of sending the numbers encrypted to other numbers in such a way that the calculation can be done without decrypting the numbers. The result is naturally in encrypted form, and isn't decrypted until it comes back from the cloud. It's more secure than if the numbers were decrypted before adding them. So my crazy idea is for homomorphic compression, i.e. compressing the numbers and only decompressing the end result. (BTW homomorphic encryption can be split into FHE ("F" for Fully) or PHE ("P" for partial), and I don't know if either is used much or at all in practice, because of the overhead.)
@riffsoffov9291 5 месяцев назад
I'm back. When I googled "homologous encryption" I got no results, because I'd remembered the term wrong. I've just googled "homomorphic compression" and it's at least a concept already. Here's an extract from one result: "Accelerating Distributed Deep Learning Using Tensor ...
The author introduces Tensor Homomorphic Compression (THC) as a novel bi-directional compression framework to address communication overhead in distributed "
@jarail 5 месяцев назад
FP4 green screen. haha jk great video
@LordOfNihil 5 месяцев назад
want to see a fully programmable floating point unit that can handle a wide array of formats. signed, unsigned, different exponent/mantissa sizes. the offset mentioned, vector operations with shared components, huge formats (fp128) as well as fixed point formats (float is really just fixed + size data).
@qbqbqdbq 5 месяцев назад
try building it in an fpga and you'll see why people don't do that
@LordOfNihil 5 месяцев назад
@@qbqbqdbq its known that fpgas are slow. but im talking more like an instruction set extension.
@qbqbqdbq 5 месяцев назад
@@LordOfNihil multipliers have quadratic gate complexity on asics too dude you need to learn some digital design
@supabass4003 5 месяцев назад
Jensens jelly beans, I dont know what this means.
@Alex.The.Lionnnnn 5 месяцев назад
When you're determined to look like a potato😂
@cj09beira 5 месяцев назад ⁺³⁴
wasting 4/16 values with nan seems idiotic
@WalnutOW 5 месяцев назад ⁺⁶
Wait till you hear about fp32 which has 16,777,214 NaNs
@unvergebeneid 5 месяцев назад ⁺⁴
Also, having +0 and -0 seems a bit decadent when that's a quarter of your total numbers.
@andytroo 5 месяцев назад ⁺⁴
@@unvergebeneid it makes the hardware implementation of some things easier. - you're trying to save hardware power expenditure, not convenience to humans - fp16/32/64 is the convenient for us values :)
@unvergebeneid 5 месяцев назад ⁺¹
@@andytroo sure, but you're wasting bits so it's somewhat surprising to me using an integer type isn't more efficient. I'm not doubting the engineers working on this mind you, I'm just voicing my surprise.
@ruby_linaris 5 месяцев назад
next step: -inf 0 1 +inf
new bool-logic: "may be", "i know" and "you do not know"
Marketing cannot be defeated. Marketing of Nvidia-marketing cannot be defeated double
@karehaqt 5 месяцев назад ⁺⁶
I wonder what impact the earthquake in Taiwan will have on TSMC to deliver for Nvidia and others, has there been much said about damage to fabs as it was quite a large one.
@alroe4067 5 месяцев назад ⁺⁵
Hear me out... FP1
@TechTechPotato 5 месяцев назад ⁺⁵
Negative infinity vs positive infinity. I like it.
@ronjatter 5 месяцев назад ⁺²
The rdna2 tshirt flicking in and out is insanely distracting
@ScottzPlaylists 5 месяцев назад ⁺²
Why +0, -0, and multiple NaNs (Not a Number) ❓
@metallurgico 5 месяцев назад ⁺¹
Nice AMD RDNA2 t-shirt
@jordonkash 5 месяцев назад ⁺¹
Great review. The scaling factor helps reduce the hit to accuracy, while maintaining the memory and compute savings of the lower precision formats
@TheEVEInspiration 5 месяцев назад ⁺¹
Hmm, it doesn't sound like micro scaling saves die space.
It will save working memory however, that is a factor too.
@windkit0124 5 месяцев назад ⁺¹
4 bits quantized LLMs are delivering amazingly accurate inference results at high speed.
@Midicow 4 месяца назад
They are INT4 though, not FP4
@ProjectPhysX 5 месяцев назад ⁺³
Going to mixed/low precision requires to know exactly what the used number range is, and spacing the few(er) available numbers on the number line such that they cover the most frequently used range the best. This is why we see so many custom formats now for different applications - I've even created my own custom FP16 format for lattice Boltzmann.
But FP4 can hold only 16 different number states, of which at least 5 are unusable for ±NaN, ±Inf and -0, there is even more NaNs when going with IEEE-754 definition. Can't do any sensible math with that, not even with shared exponent (which makes it not a simple code change at all). Calling FP4 "floating-point" operations or Flops is very far-fetched. The decimal point doesn't float anymore with so few states and fixed-point/integer would be more appropriate for this kind of bit-mashing. IMO it's just marketing nonsense to claim better "Flops" performance. Nvidia even goes as far as to compare FP4 performance with FP8 and FP16, which makes no sense at all as you can't simply replace FP16 with FP8 or FP4. And I thought it was peak marketing already when they named their proprietary 19-bit floating-point format "TF32". I'm waiting for the day when Nvidia invents the FP1 format, with only 2 possible states of ±NaN, and sell this to you as XX PFlops/s performance.
@defeqel6537 5 месяцев назад
I have no idea about the ML field, but it seems to me that INT4/INT6 would be much more usable. I guess NaNs and Infs can be useful in the field
@ProjectPhysX 5 месяцев назад
@degenerate_mercenary9898yes, and FP4 behaves very much like integer here but on top with very poor granularity. Any sort of arithmetic between two FP4 numbers is going to fail.
@MasamuneX 5 месяцев назад
all the real ones know trinary is the ultimate precision reduction
@azertyQ 5 месяцев назад
Lmao RDNA2 shirt
@ols7462 5 месяцев назад ⁺¹
A video about artesia labs would be cool
$@fracturedlife1393$
@fracturedlife1393 5 месяцев назад
NGL that tee shirt starting to mess with me
@handlemonium 5 месяцев назад
GOPs, GFLOPs, TOPs, POPs
@MrBillythefisherman 5 месяцев назад ⁺¹
Why bother with a bit for infinities?
@TechTechPotato 5 месяцев назад ⁺³
It's not really a bit for infinities, but any time there's a 1, it's an infinity
@fpgamachine 5 месяцев назад
you are terrible at explaining bro, sorry
@TechTechPotato 5 месяцев назад ⁺²
Good to know. It's why Intel, AMD, IBM, Qualcomm all pay me to help them explain stuff.
@lil----lil 5 месяцев назад
Dr. Cultress: Can you please do a "preview" on the Blackwell "Prosumer" GPU for the mass? As far as A.I is concerned. Like LLM training using local data only on my own PC. For ALL video/text/graphics. THANK U. I DON'T CARE about the Internet. I'm *ONLY* concerned with my own data.
@tringuyen7519 5 месяцев назад
I am assuming that CUDA will understand how to use FP4 & FP6. Given that CDNA4 will also have microscaling, PyTorch & Onnx will need to handle FP4 & FP6.
@rudypieplenbosch6752 5 месяцев назад
Interesting, now it makes more sence, since simply going to fp4 wouldn't work, there had to be more to it and as you showed, there is, microscaling.
@plasma06 5 месяцев назад
Why is your shirt a green screen
@organichand-pickedfree-ran1463 5 месяцев назад
I just checked the OCP MX spec, and the values look different from 1:45 . Not sure how to interpret that. Is this a different spec?
@aldozampatti 5 месяцев назад
So... the real question is... what's the inscription on the t-shirt?
@EyesOfByes 5 месяцев назад
When forgetting to change t-shirt has real consequences ;)
@marktackman2886 5 месяцев назад
Nvidia abstraction scam
@abheekgulati8551 5 месяцев назад
Thank you for this video!
@clansome 5 месяцев назад ⁺²
Should've changed the T shirt before you started this Ian.
@TechTechPotato 5 месяцев назад ⁺³
Yeah, I know. Only saw it a couple days later after filming
@Ferdinand208 5 месяцев назад ⁺¹
I must be missing something. Couldn't you do the calculation in higher precision and compare to the same calculation in lower precision automatically and let the program set its own precision? It would take a little bit more calculation than 100% optimal but after a while the program would find the optimal precision and only check once in a time to make sure the results are still good enough.
@JoTokutora 5 месяцев назад ⁺²
AI SUCKS!
@aidanm5578 5 месяцев назад ⁺⁶
Take THAT AI!
@computerscience1101 5 месяцев назад ⁺²
*After 10 years when AI takes over Humanity*
AI:- So how should we execute him for speaking against us
@JoTokutora 5 месяцев назад ⁺²
@@aidanm5578 wait, there’s more. AI Sucks!
@JoTokutora 5 месяцев назад ⁺¹
@@computerscience1101 you suck

Следующие

Автовоспроизведение