Transformers Explained From The Atom Up (Many Inaccuracies! Revised Version Is Out Now!)

Jacob Rintamaki

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 окт 2024

Комментарии • 29

@sophiawisdom3429 3 месяца назад ⁺²²
Some thoughts as I watched the video:
Tensor cores don't do FMAs, they do MMAs (matrix-multiply add). FMA is a different thing they can also do that typically refers to a *single* fused multiply-add. Kudos for mentioning they do the add though, most people skip over this. At 12:58 you have a slide with Register/Sram/L1$/SRAM/L2$/DRAM. All of these are made of SRAM.
Under ISA you mention the ISA for a tensor core, which I don't think makes sense. The tensor core is within the SM and is called just like any other part of the chip like the MUFU. All of the stuff you put on the slide at 14:24 is also not part of the ISA as most people would understand it. Outputs also can't be written to memory (though as of Hopper they can be read from shared memory!).
You're correct that CUDA is compiled to PTX and then SASS, but SASS probably doesn't stand for Source And Assembly (it probably stands for Shader Assembly but NVIDIA never specifies) and CUBIN is a format for storing compiled SASS. What you're saying is equivalent to "C gets optimized to LLVM IR then to armv9-a aarch64 then to ELF" on CPU.
Ignoring Inductor, Torch does not compile pytorch into CUDA -- this is an important distinction that is meaningful for both Torch's strengths and weaknesses. It calls pre-existing CUDA kernels that correspond to the calls you make.
For transformers, I find it somewhat confusing you're teaching encoder-decoder instead of decoder-only, but whatever. The dot product of things that are close would not be close to 1 -- the *softmax* of the dot product of things that are close would be close to 1. MHA is also not based on comparing the embeddings, but on comparing "queries" for each token to "keys" for each other token. The network *learns* specific things to look for. The addition is *not* about adding little numbers to it but about adding *the previous value* to it. The intuition is that attention etc. compute some small *update* to the previous value as opposed to totally transforming it. I think your explanation of MLP also leaves something to be desired -- there are already nonlinearities in the network you described (layer norm and softmax). It also doesn't do an FMA, but a matrix multiply. your explanation of the linear embedding at the end is confusing. Typically the unembedding layer *increases* the number of values per token because the number of tokens is larger than d_model.
you say all the matrix and addition happen in the tensor cores, inside of the SM, whereas the intermediate stuff happens in the registers. All of the stuff "happens in the registers" in the sense that the data starts and ends there, but more correctly it happens in the ALUs or the tensor cores.
When you say that DRAM hasn't kept up as much, DRAM is made of the same stuff as the SMs -- it's all transistors. You mention you would have to redesign your ISA -- the ISA is redesigned every year, see e.g. docs.nvidia.com/cuda/cuda-binary-utilities/index.html .
@d_polymorpha 3 месяца назад
Hello do you know of any resources to dive deeper into this higher level intro video? Specially towards cuda/pytorch/ actual transformer?
@maximumwal 3 месяца назад ⁺¹
Very good post, but Jacob's right about DRAM. DRAM also uses capacitors to store the bits, and then transistors for reading, writing, and refreshing the bit. In addition, the manufacturing process is quite different. Moore's law for DRAM has been consistently slower than Logic scaling, which is why nvidia pays 5x as much for HBM than the main die, and still, the compute : bandwidth ratio keeps getting more and more skewed every generation towards compute. Even SRAM, which is purely made of transistors, can't keep up because leakage gets worse and worse, and if you're refreshing it all the time, it's unusable. Logic is scaling faster both due to 1. physics and 2. better/larger tensor cores.
@sophiawisdom3429 3 месяца назад
@@maximumwal ah true, though i thought DRAM uses a transistor and a capacitor (?).
I feel like you should expect they pay more for HBM than the die because the main die is 80B transistors but the HBM is 80GB*8 bits/byte=640B transistors+640B capacitors. HBM is also much more expensive than regular DRAM I believe, like $30 vs $5 per GB.
@maximumwal 3 месяца назад ⁺²
@@sophiawisdom3429 Yes, there's 1 transistor per capacitor, whose channel and gate connect to the bit and word lines. Branch education has a great video on this. as for HBM being roughly the same transistors/$: True, but they used to be much cheaper, because logic has tens of layers of wires/vias on top of the transistors at the bottom, vs just 2 simple layers of wires on dram. With b100 and beyond, HBM will be more expensive than logic on a transistor basis. There are many reasons for this, including the fact that smaller capacitors have to be refreshed more often, and the hard limits of memory frequency + bits per pulse (a100 -> h100 doubled bits per pulse, but lowered frequency, probably since it's harder to parse the signal at low power, but possibly because of greater resistance with thinner bitlines), which were previously leaned on to improve GB/s/pin, whereas on the die you can just build a larger systolic array/tensor core, and get more flops/(transistors * clock cycles), and increase clock frequency more easily, you just have to manage power. Right now we're stacking HBM with even more layers (8 -> 12 -> 16), and using more stacks (5 -> 8). Nvidia will eat the cost, and lower their margins. The normalizations + activations are soon going to use more gpu seconds than the matmuls. Everyone knows this, so tricks on the algorithms, scheduling, and hardware sides are being aggressively pursued to provide life support to Huang's law.
@jacobrintamaki 3 месяца назад ⁺¹¹
Time-Stamps:
0:00 Optimus Prime
0:12 Overview
0:33 Atoms
0:48 Semiconductors
3:51 Transistors
5:53 Logic Gates
6:47 Flip-Flops
7:39 Registers
8:35 ALUs/Tensor Cores
10:34 SMs
12:08 GPU Architecture
13:44 ISA
14:42 CUDA
16:25 PyTorch
17:27 Transformers
21:41 Transformers (Hardware)
22:29 Final Thought
@En1Gm4A 3 месяца назад ⁺¹⁴
Highest Signal to noise ever observed
@aniksamiurrahman6365 2 месяца назад
I'll say noise to singal. Not the other way around.
@milos_radovanovic 3 месяца назад ⁺¹
you skipped quarks and gluons
@baby-maegu28 3 месяца назад ⁺¹
I apreciate it. AAAAA make me down here.
@nicholasdominici 3 месяца назад ⁺²
This video is my comp sci degree
@codenocode 3 месяца назад ⁺¹
Great timing for me personall (I was just dipping my toes into A.I.)
@isaac10231 3 месяца назад ⁺¹
So in theory this is possible in Minecraft
@scottneels2628 2 месяца назад
Bloody Hell!
@baby-maegu28 3 месяца назад
14:50
@Nurof3n_ 3 месяца назад ⁺¹
you just got 339 subscribers 👍 great video
@logan4565 3 месяца назад ⁺¹
This is awesome. Keep it up
@ramanShariati 3 месяца назад ⁺¹
LEGENDARY 🏆
@tsugmakumo2064 3 месяца назад
i was talking with gpt-4o about exactly this
abstraction layers from the atom until a compiler. So this video will be a great refresher.
@sudhamjayanthi 3 месяца назад
damn super underrated channel - i'm the 299th sub! keep posting more vids like this :)
@pragyanur2657 3 месяца назад ⁺¹
Nice
@Barc0d3 3 месяца назад
This was a great and comprehensive high level intro. Oh wow😮 can we hope to get a continuation of these lectures?
@rahultewari7016 3 месяца назад
Dudee this is so fucking cool 🤩kudos!!
@boymiyagi 3 месяца назад
Thanks
@CheeYuYang 3 месяца назад
Amazing
@prfkct 3 месяца назад
wait Nvidia invented GPUs? wtf
@d_polymorpha 3 месяца назад
GPUs have only existed for about 25 years!🙂
@enticey 3 месяца назад
they weren't the first ones, no

Следующие

Автовоспроизведение

The Gate-All-Around Transistor is Coming