Our thanks to David Kanter for joining us. It's old now, but we think you might really like his article explaining tile-based rasterization on NVIDIA GPUs: www.realworldtech.com/tile-based-rasterization-nvidia-gpus/ Check out our DDR5 memory news: ruclips.net/video/evBKIfBeGJk/видео.html Back-order a GN Modmat here for the next round! We sold out already. Again. store.gamersnexus.net/
Thank you, David Kanter! I'm very interested in machine learning and have been messing with Tensorflow recently. It's amazing to see stuff like random forests, which were invented way ahead of their time, finally being used for things like neural networks. Good stuff, would love to see more videos like this.
Frank R If you're smart enough to keep up with this, you're smart enough to know why this cannot be the regular content. Sorry to say but it's views that pay the bills and thus making it accessible to the majority is financially the best option. Leaning either way too often kills revenue.
I discovered David Kanter through Arstechnica (one of my favorite tech/science websites) and learned a lot about chip engineering. While that website tends to cater to a rather technical demographic, Gamers Nexus has come close to matching their attention to detail. There is a lot of science behind Steve's content and his audience (based off his current subscription total) is far from the majority that visits, say LTT.
So if bob owns a orchard with apple trees and its over 400 apples per tree.....(does math) omg bob has over a million cores on his farm, hope its a crypto farm
some computer mice have actually an embedded microcontroller. these run their firmware what is a tiny bit of software. that does probably some stuff to smoothen your mouse movements and to detect when you lift up your mouse. this mouse could actually have a working processor in it. nothing powerful but it executes commands and processes sensor data
As a Computer Science PhD student I find this channel to be an amazing source of technical information that, at least for bleeding edge hardware, can't be found anywhere else. Keep up the great work!
The most interesting thing about CUDA cores and SMs (at least in my opinion) is that one SM can render multiple pixels at once, by having it's different CUDA cores calculating with different values. But since there is only one instruction fetch and decode unit all CUDA cores must be state locked (meaning calculating the same instructions). This may result in wired behavior when running shaders with many branches, since all CUDA cores may run all branches, but discard the results of the calculation. This is why it can be faster to calculate everything the shader could need and then multiplying it by one or zero instead of using an if, since both parts of the if statement may get exicuted if the different pixels need to take different branches. This is something every graphics programmer should know, but many don't. Also that's why it's actually not that bad to market with the amount of CUDA cores, since that directly corelates to the amount of pixels that can be drawn simultaneously. (Please note that this is simplified you don't need to tell me that some details are not perfectly correct :p )
A lot of graphics programmers are familiar with branchless programming. It's some what relevant to CPU optimisation as well were missed-branch predictions cost so much performance.
How is always calculating what each branch needs first more efficient than letting one branch calculate (while the others are on hold), and then the others doing their thing? It seems equal to me, except that in the case where an entire warp takes a single branch, using the second method, you save the time of doing the calculations for the other branch. Given that a warp is only 32 ''threads', the odds of everyone taking the same branch sometimes aren't that bad.
@@sjh7132 Because splitting the work across cores isn't free. When the GPU is executing code inside of an if statement, it needs to enable and disable its execution flag for the cores that aren't going to execute this cycle. Enabling/disabling the execution flag takes time. Additionally, doing so breaks the pipelining process, giving you less than 1 instruction per cycle before and after the execution flag is changed. (Instructions usually take several cycles to execute, the reason why cores achieve 1 instruction per cycle is because they're pipelined) If you have short if statements, the enable/disable issues are a significant fraction of the time spent executing the whole if statement. If the if and the else statement both have very long bodies but are doing something similar, branchless programming can often mean combining them into one body. For a CPU it doesn't matter, for a GPU it's twice as fast in warps that execute both sides of the if. Of course, if you're just wrapping a texture access with an if-statement, it's better to keep it an if than access the texture every single time (Unless you expect the if to happen a very high percentage of the time, but that's up to you)
Everyone likes to forget about the GeForce 8 Series where 'CUDA Cores' had yet to receive their name and they were just called Streaming Processors. In the same way that Hyperthreading is often called an Intel thing but SMT is an AMD thing, yet they are one-and-the-same. Hyperthreading is just marketing.
Basically a core is a self sufficient & independent structure that can access memory and other things on it's own. What the guy is saying is that he doesn't consider CUDA cores as true cores because they are not independent from the rest of the structure. A vector execution unit (which is what a CUDA "core" really is) is the part of the chip that traces 3D vectors, angles, magnitudes etc... it does the math to move & transform 3D objects. 3D graphics is mostly vector math. A vector is essentially a position in 3D space with a direction & an amplitude. Think of it as an arrow pointing in some direction. It's the fundamental structure of all 3D graphics.
In shader programming, in order to create the illusion of lighting (lit/unlit objects), you have to multiply or in simpler terms; join two or more colors together. These colors are stored as floating point numbers or decimal numbers. Once all the math is done and the final color is defined, the pixel is drawn on the screen. Now before it actually becomes a pixel, during the math process it's actually a vector stored inside a matrix, not a pixel. That's about as simple as I can explain his multiplication process.
Videos like this are why you guys are my favorite tech channel! Keep them up! Educating the community is so valuable, I'm glad you guys take the time to get into the real details!
Steve can you do technical analysis just like his some time in the future? It’s always great to get sources from multiple people. As shown at the beginning. I love in depth info of how computers work transistor level.
just found this video and for someone who has a bachelor degree in computer science this makes a hell of a lot more sense than the usual jargon that is usually thrown around. In short a "core" is usually made up of: 1. A set of circuits that handle integer instructions (full number arithmetic/logic) or "Arithmetic Logic Unit". 2. A set of circuits that handle floating point instructions (decimal place arithmetic) or "Floating Point Unit". 3. An "Instruction Decoder" which directs the instruction to the right set of circuits (ALU or FPU) and converts the instruction into the electrical signals that go into that set of circuits. 4. lastly we have the "clock" which turns "on" and "off" at a set speed to differentiate between one instruction and the next. In this video he says the GPUs use a core setup where there are still only one "Instruction Decoder" and "clock" but a large number of FPUs designed for specific instructions. Thus what Nvidia calls the shader core is not the whole package but just the FPU that actually computes the decoded instructions. By limiting the core to only handle a small set of instructions and only use floating point numbers they can make these FPUs much smaller and cram many more of them into the same area.
Awesome video, I love this discussion format. This is a really interesting topic, explained simply enough to make sense to me, but just far enough over my head to get me to do some additional reading. I love it! I would really enjoy more videos like this. o/
in short what GPUs do most of is a Fuse-Multiply-aggregate instruction the take 2 values in binary put them together using the "ADD" instruction then multiplies them together then adds it to the value in the holding place that it got the data from until the FMA stream is done and does it very fast as it does not need anything but the FMA instruction for them to use the other instructions done are done with less units or more that nVidia and AMD do not tell the public how many are in that block of their GPU's block diagram
@Jordan Rodrigues how does that relate to the fuse-multiply-aggregate instruction on the instruction set of the GPU? II agree they are more open about what they are doing but you most have misread or not understood what i meant. Both nVidia, AMD, ARM and Intel use some version of this instruction. The programmer never gets to see the binary most times as programming using APIs is a lot quicker and a lot more compatible with more hardware than programming using binary code. Now how they do it is different in APIs from it on a GPU to it on a CPU due to what is being "called" but it is the same operation being done. It is all math so you only have add, subtract, and store. That is all all microprocessor can do when the instructions and data is on the microprocessor.
Kanter's deep-dive article on Bulldozer is on his site: www.realworldtech.com/bulldozer/. Lots of block diagrams and detailed analysis comparing BD to previous AMD and Intel microarchitectures. Bulldozer is an interesting design, but several of its features were mistakes in hindsight that they "fixed" for Zen. e.g. Zen has normal write-back L1d cache vs. Bulldozer's weird 16k write-through L1d with a 4k write-combining buffer. Having 2 weak integer cores means that single-threaded integer performance is not great on Bulldozer-family. Zen uses SMT (the generic name for what Hyperthreading is) to let two threads share one wide core, or let one thread have all of it if there isn't another thread. Zen also fixed the higher latency of instructions like `movd xmm, eax` and the other direction, which is like 10 cycles on Bulldozer (agner.org/optimize/ ) vs. about 3 on Intel and Zen. Steamroller improved this some but it's still not great. Kinda makes sense that 2 cores sharing an FP/SIMD unit would have higher latency to talk to it, and it's not something that normal code has to do very often. Although general-purpose integer XMM does happen for integer FP conversions if the integer part doesn't vectorize.
Best analogy I heard is that a CPU is a room with a few geniuses that solve complex problems while GPUs are like gymnasium filled with teens trying to solve for x.
Interesting chat. I love these little talks you have with people now and then, please keep that up! No matter if it's random cool people or industry insiders, it's always interesting to hear people who know what they're talking about, well, talk! About stuff they know! :)
1. So that's THE David Kanter, amazing 2. As far as I know (before watching this video) the CUDA "cores" are more like ALU than complete core. Nice deeper explanation here.
the ray tracing illustration at @3:09 are a bit inaccurate. most ray tracing programs send the rays in the opposite direction. they send the rays from the camera position into the scene and from the objects hit by the rays to the light sources. the rays usually only bounce off objects that have a reflective surface. one program that works this way is Cinebench
Huge thanks for publishing technical material like this! It's really nice to get behind all the marketing nonsense to get a better understanding of the underlying technology.
I think a little clearer explanation is to point out that all the threads in a warp / workgroup share a single program counter. If two sides of a branch are taken by different threads then you wind up with branch divergence and the threads have to serially execute the two branches individually. Following that serial execution the thread scheduler can then reconverge the threads so they all run in parallel again. But basically if you're not branching then the performance of all the cuda "cores" behaves the same as if they were all full-fledged "cores"
Niceeeee :D Also some point that were not mentioned: Programming on a gpu is much harder than multi threading (use every core of a cpu) due to their physical properties And even for multi cpu, most of the time parralelising a workload is not so easy
I'd like Steve and GN to do a multi-episode detailed analysis of graphics architectures since the beginning of PCs. I've been PC gaming since the final days of vertex and pixel shaders (2005 & 2006), and the switch to unified shaders rightfully took the PC gaming world by storm. Interestingly enough, the first unified shader GPU - Xenos in the 360 - more or less were directly related to the pixel shaders found in the Radeon X1000 series.
Please makes these interviews much longer. When you are already interviewing someone who knows about that low level stuff that you should try to get even more information out of them. Good interview. Keep going!
Very cool, knowledge is power. This video opens up a whole different perspective on how GPU's and CPU's differ but yet the same. Very interesting. Thumbs up!
We are at the brink of advancement in computing technology. The only problem is making the computing technology smaller without affecting temps and transfer. If we ever start using gallium nitride as a core component and replacement of silicon then we might be looking at the next jump in hardware evolution. So exciting.
So for example, Vega 10 has 4096 SP's that's inside of 64 CU's that are divided by 4 Geometry Engines. In AMD's case, would the Geometry Engines be considered a 'Core', or is it a matter of the number of ACE's per Hardware Scheduler?
So, is there a memory controller that’s doing the fetching and delivering for all of the clusters, some of the clusters or is that part of the CPUs job? (Not likely but I had to ask)
I strongly disagree with the title / thesis of the argument. CPUs operate at a standard 1 operation per second, so Hz is a measurement of FLOPS. CUDA cores / Shader Cores are identical. They execute 1 FLOP per cycle. Now, in 1997, Intel came out with MMX, allowing a CPU to setup a 64bit register with two 32bit floats, and execute 2 FLOPs in a single clock cycle. MMX. Note that videocards predate MMX, where MMX didn't become common place until into the 21st century and software support was minimal for a long time. MMX doesn't make any sense in GPUs, because GPUs are already designed to be parallel instead of serial. Hence, they use the same phrase "core" which meant exactly what core has always meant: something that executed one operation per cycle. ====== Shader cores are true, honest to goodness, "cores" and meet the proper definition. Both CPU and GPUs have their own unique performance penalties when used improperly. ~ GPUs have the requirement that the entire shader must be executing the same code. As a penalty, if two cores in the same 32-core warp take different branches of an if statement, both must be executed in serial. They run at 1-2GHz. ~ CPUs have the penalty that if you access memory from two cores, they must wait a few clock cycles to reach consensus. If an if statement is mispredicted, many clock cycles are wasted as it un-does the work it did. They have special SIMD instructions that act on wide registers. They run at 4-5GHz. Both are true honest cores. BOTH have severe penalties for using if statements improperly. GPUs' defining characteristic is that it runs one code on all of the cores. CPUs' defining characteristic is high GHz. Obviously, because a GPU is only ever designed to run one shader code, no, they didn't give every CUDA core its own instruction processing unit. Not because they couldn't, but because it's already forced that they're all running the same code anyway! It doesn't make it any less of a core.
I feel this is a rather subjective definition based problem. For a more "layman" example, think about an apartment complex versus a mansion. Both are a single building and can be equal size. But they have different number of "homes" within them. The question then is, is "core" a building or a home. One can further equate each processing lane or whatever the proper term is to a room, where in theory both building and home are a collection of and therefore a home qualifies as a "core" as a collection of "rooms" or processing lanes. But then we have studios, which is in effect a home of a single room. Edit: perhaps it is "fair" to say the building is a CPU/GPU, a home is a core while rooms are processing lanes or whatever. But then, we once again arrive at the fact that there ARE homes of a single room. There is no reason, as far as definition goes, why we can't have a "core" of a single processing lane. I mean all else aside, if I had a "stupid" calculator that can ONLY do the fused multiply add function on the single "CUDA" core that it process, does that mean it have no core at all even if all its function is done on a single unit as opposed to distributed on a whole PCB etc? On the one hand, you can argue that each CUDA is part of a core, as only a processing lane or whatever the correct term is. On the other, you can argue that each CUDA is a stripped down core, down to a single processing lane(or whatever) because GPU "core" doesn't need all the functionality of a CPU core so overtime, they "devolved" from full fledged cores to what they are now. Personally, I am much more annoyed by Nvidia mis-appropriation of the term "Max-Q" which have NOTHING to do with the actual meaning of the word in its originality as well as the "standard" "4K" which is entirely a marketing term that have no proper definition (2160p is what people should use, for what is commonly considered "4K" and what people SHOULD use. But 4k sounds "twice as good" as 2160p so here we are). Because those are more clear-cut. Something like this... I don't think it's clear-cut at all because a proper analysis would require considering the development history of both the components and word. So no, I'm not saying you are wrong. But you are also not "right", not unless you write a 200 page study on the historical evolution of the term and GPUs and note down the correlation every step of the way.
Good content, love the more in-depth stuff. This really separates your channel from the other tech channels!! Everyone talking about the same stuff in short videos is kinda boring. While i don't understand everything this triggers me to do some learning and read more.
Gah! I know the video was about the CUDA. But I wanted more. Like how NVIDIA and AMD builds affect the diference in performance between them when using DX12/Vulka vs older DX. As for the AMD Buldozer I've always say they are "modules" with a form of hyper thread instead of "cores". So I saw as if AMD can call their "incomplete" cores "core" then Nvidia could called the CUDAs, "core". Although they should call the whole SN block "cores" which would be a LOT less. But really nice video. Love this kid of content.
My brain went back to the tape archive memory and it found linear algebra classes matrix rotation operations from the first year of CS. How did this stick is beyond my understanding
Well not sure I'm inside the core audience, but i really like these type of contents here. Other tech channels most of the time tries to dumb down the topics to appeal more to a broader audience making the content kinda boring for people who love move deep complex tech topics. I think the geek-i-ness of gamers nexus is one of the reason what makes this channel unique. (y)
Our thanks to David Kanter for joining us. It's old now, but we think you might really like his article explaining tile-based rasterization on NVIDIA GPUs: www.realworldtech.com/tile-based-rasterization-nvidia-gpus/
Check out our DDR5 memory news: ruclips.net/video/evBKIfBeGJk/видео.html
Back-order a GN Modmat here for the next round! We sold out already. Again. store.gamersnexus.net/
Thank you, David Kanter! I'm very interested in machine learning and have been messing with Tensorflow recently. It's amazing to see stuff like random forests, which were invented way ahead of their time, finally being used for things like neural networks. Good stuff, would love to see more videos like this.
And the GPU is throttling for sure, because the draw distance is dynamically reducing.
The FPS are rock solid though, good job.
Draw distance is a bit low, did you forget to set it to high in settings? Tho hair physics are spot on, are you using TressFX?
Error989 Is the exaggerated levels of depth of field.
Rendered using N64
They had to turn the view distance down to reach 60FPS at 4K.
They had to reduce draw distance to get better out of the box thermals.
Steve HairWorks
David sounds like the kind of guy who could take a few empty soda cans, some plastic spoons + duck tape and produce a 1080ti in the garage.
someone get that man a cape!
OR" rebuilt Radeon AMD Vega to... R.A.V.2.
Ggchb Gig ghb David needs help wiping his own ass. Show me something he has done in the last year? 2 years? 5 years? Ok, lets make it easy, EVER?
Tan Tan hes done more than you
joshcogaming proof?
THIS! Steve, this is the type of content I love. Pretty please, keep this kind of content coming!
Frank R If you're smart enough to keep up with this, you're smart enough to know why this cannot be the regular content.
Sorry to say but it's views that pay the bills and thus making it accessible to the majority is financially the best option. Leaning either way too often kills revenue.
I discovered David Kanter through Arstechnica (one of my favorite tech/science websites) and learned a lot about chip engineering. While that website tends to cater to a rather technical demographic, Gamers Nexus has come close to matching their attention to detail. There is a lot of science behind Steve's content and his audience (based off his current subscription total) is far from the majority that visits, say LTT.
So that could imply that the Corsair Dark Core SE does not contain an actual core either? /mindblown
Brought to you by the Corsair Dark Lane on a Vector Unit SE!
Doesn't quite have the same ring to it.
It really has stream processors.
So if bob owns a orchard with apple trees and its over 400 apples per tree.....(does math) omg bob has over a million cores on his farm, hope its a crypto farm
no its a apples farm... xD
some computer mice have actually an embedded microcontroller. these run their firmware what is a tiny bit of software. that does probably some stuff to smoothen your mouse movements and to detect when you lift up your mouse. this mouse could actually have a working processor in it. nothing powerful but it executes commands and processes sensor data
Really loving the technical topics GN has been presenting lately
The inner machinations of my mind are an enigma!
-Patrick Star
Atilolzz poetic
**Bill Fagerbakke
As a Computer Science PhD student I find this channel to be an amazing source of technical information that, at least for bleeding edge hardware, can't be found anywhere else. Keep up the great work!
This video aged really well. Thanks for making these Videos, Steve.
The most interesting thing about CUDA cores and SMs (at least in my opinion) is that one SM can render multiple pixels at once, by having it's different CUDA cores calculating with different values. But since there is only one instruction fetch and decode unit all CUDA cores must be state locked (meaning calculating the same instructions). This may result in wired behavior when running shaders with many branches, since all CUDA cores may run all branches, but discard the results of the calculation. This is why it can be faster to calculate everything the shader could need and then multiplying it by one or zero instead of using an if, since both parts of the if statement may get exicuted if the different pixels need to take different branches. This is something every graphics programmer should know, but many don't. Also that's why it's actually not that bad to market with the amount of CUDA cores, since that directly corelates to the amount of pixels that can be drawn simultaneously. (Please note that this is simplified you don't need to tell me that some details are not perfectly correct :p )
i think you left out to much detail but yeah
A lot of graphics programmers are familiar with branchless programming. It's some what relevant to CPU optimisation as well were missed-branch predictions cost so much performance.
How is always calculating what each branch needs first more efficient than letting one branch calculate (while the others are on hold), and then the others doing their thing? It seems equal to me, except that in the case where an entire warp takes a single branch, using the second method, you save the time of doing the calculations for the other branch. Given that a warp is only 32 ''threads', the odds of everyone taking the same branch sometimes aren't that bad.
listen man, you left out some details. I want to speak to your manager.
@@sjh7132 Because splitting the work across cores isn't free. When the GPU is executing code inside of an if statement, it needs to enable and disable its execution flag for the cores that aren't going to execute this cycle. Enabling/disabling the execution flag takes time. Additionally, doing so breaks the pipelining process, giving you less than 1 instruction per cycle before and after the execution flag is changed.
(Instructions usually take several cycles to execute, the reason why cores achieve 1 instruction per cycle is because they're pipelined)
If you have short if statements, the enable/disable issues are a significant fraction of the time spent executing the whole if statement.
If the if and the else statement both have very long bodies but are doing something similar, branchless programming can often mean combining them into one body. For a CPU it doesn't matter, for a GPU it's twice as fast in warps that execute both sides of the if.
Of course, if you're just wrapping a texture access with an if-statement, it's better to keep it an if than access the texture every single time (Unless you expect the if to happen a very high percentage of the time, but that's up to you)
Steve's hair flowing majestically in the wind.
This guy knows what the hell he's talking about!
Everyone likes to forget about the GeForce 8 Series where 'CUDA Cores' had yet to receive their name and they were just called Streaming Processors. In the same way that Hyperthreading is often called an Intel thing but SMT is an AMD thing, yet they are one-and-the-same. Hyperthreading is just marketing.
i think i am to stupid for this type of content......will watch anyway :)
Sebastian wait until the discussion about CGI real time DXR gaming technology.
it's two
Same, but it excites me to have a lane to further my understanding of computers and a direction to look for it also.
Basically a core is a self sufficient & independent structure that can access memory and other things on it's own. What the guy is saying is that he doesn't consider CUDA cores as true cores because they are not independent from the rest of the structure. A vector execution unit (which is what a CUDA "core" really is) is the part of the chip that traces 3D vectors, angles, magnitudes etc... it does the math to move & transform 3D objects. 3D graphics is mostly vector math. A vector is essentially a position in 3D space with a direction & an amplitude. Think of it as an arrow pointing in some direction. It's the fundamental structure of all 3D graphics.
In shader programming, in order to create the illusion of lighting (lit/unlit objects), you have to multiply or in simpler terms; join two or more colors together. These colors are stored as floating point numbers or decimal numbers. Once all the math is done and the final color is defined, the pixel is drawn on the screen. Now before it actually becomes a pixel, during the math process it's actually a vector stored inside a matrix, not a pixel.
That's about as simple as I can explain his multiplication process.
More David plz, always a treat. I really miss the old TechReport podcast days.
Need to watch it again, I'm damn lost
This was immensely interesting. I love the series of interviews you have been doing lately, it's great.
Videos like this are why you guys are my favorite tech channel! Keep them up! Educating the community is so valuable, I'm glad you guys take the time to get into the real details!
You really Hammered that answer out !!
Sorry but had to it . !! lol !!
Videos like these are the reason I come to you. Also because you are honest about your reviews. All hail tech Jesus.
Steve can you do technical analysis just like his some time in the future? It’s always great to get sources from multiple people. As shown at the beginning. I love in depth info of how computers work transistor level.
This is good content. These technical videos are a great change of pace compared to the usual content GN produces. Love it.
just found this video and for someone who has a bachelor degree in computer science this makes a hell of a lot more sense than the usual jargon that is usually thrown around.
In short a "core" is usually made up of:
1. A set of circuits that handle integer instructions (full number arithmetic/logic) or "Arithmetic Logic Unit".
2. A set of circuits that handle floating point instructions (decimal place arithmetic) or "Floating Point Unit".
3. An "Instruction Decoder" which directs the instruction to the right set of circuits (ALU or FPU) and converts the instruction into the electrical signals that go into that set of circuits.
4. lastly we have the "clock" which turns "on" and "off" at a set speed to differentiate between one instruction and the next.
In this video he says the GPUs use a core setup where there are still only one "Instruction Decoder" and "clock" but a large number of FPUs designed for specific instructions.
Thus what Nvidia calls the shader core is not the whole package but just the FPU that actually computes the decoded instructions. By limiting the core to only handle a small set of instructions and only use floating point numbers they can make these FPUs much smaller and cram many more of them into the same area.
roax206 you forgot the fetch unit and the store unit, also modern cores have other units like branch prediction
I like this (minus the hammering in the background). Let's have more expert-interviews where they explain something in detail!
Awesome video, I love this discussion format. This is a really interesting topic, explained simply enough to make sense to me, but just far enough over my head to get me to do some additional reading. I love it! I would really enjoy more videos like this. o/
I really like all those interviews you are putting out lately
Thank you! So many questions answered! And then some! I appreciate videos like this Steve, keep up the great work, love your channel!
I loved this talk, so much great information packed into 17min!
HAMMER TIME !, awesome video btw, keep it coming!.
Best content GN has ever made! Thank you so much! Need more of this.
Love these highly detailed videos!
I really like this type of content; in depth technical descriptions of how GPUs are excellent at linear algebra!
I like the use of "knocking on heavens door" as your backtrack throughout the video.
Neat. I still don't get it, but I'm not a microprocessor engineer so what are you gonna do.
in short what GPUs do most of is a Fuse-Multiply-aggregate instruction the take 2 values in binary put them together using the "ADD" instruction then multiplies them together then adds it to the value in the holding place that it got the data from until the FMA stream is done and does it very fast as it does not need anything but the FMA instruction for them to use the other instructions done are done with less units or more that nVidia and AMD do not tell the public how many are in that block of their GPU's block diagram
@Jordan Rodrigues how does that relate to the fuse-multiply-aggregate instruction on the instruction set of the GPU?
II agree they are more open about what they are doing but you most have misread or not understood what i meant. Both nVidia, AMD, ARM and Intel use some version of this instruction. The programmer never gets to see the binary most times as programming using APIs is a lot quicker and a lot more compatible with more hardware than programming using binary code. Now how they do it is different in APIs from it on a GPU to it on a CPU due to what is being "called" but it is the same operation being done. It is all math so you only have add, subtract, and store. That is all all microprocessor can do when the instructions and data is on the microprocessor.
Finally a video that goes into detail. Would have loved some more insight into Bulldozer.
Kanter's deep-dive article on Bulldozer is on his site: www.realworldtech.com/bulldozer/. Lots of block diagrams and detailed analysis comparing BD to previous AMD and Intel microarchitectures. Bulldozer is an interesting design, but several of its features were mistakes in hindsight that they "fixed" for Zen. e.g. Zen has normal write-back L1d cache vs. Bulldozer's weird 16k write-through L1d with a 4k write-combining buffer.
Having 2 weak integer cores means that single-threaded integer performance is not great on Bulldozer-family. Zen uses SMT (the generic name for what Hyperthreading is) to let two threads share one wide core, or let one thread have all of it if there isn't another thread.
Zen also fixed the higher latency of instructions like `movd xmm, eax` and the other direction, which is like 10 cycles on Bulldozer (agner.org/optimize/ ) vs. about 3 on Intel and Zen. Steamroller improved this some but it's still not great. Kinda makes sense that 2 cores sharing an FP/SIMD unit would have higher latency to talk to it, and it's not something that normal code has to do very often. Although general-purpose integer XMM does happen for integer FP conversions if the integer part doesn't vectorize.
i always thought it was strange that a GPU could have thousands of cores while CPUs are lucky to reach double digits
also, knock knock
Valfaun who's there?
Daniel Bryant Me a nugger
Daniel Bryant someone who wants them nerds off their roof, perhaps
Best analogy I heard is that a CPU is a room with a few geniuses that solve complex problems while GPUs are like gymnasium filled with teens trying to solve for x.
More like a gym-ful of teens told to go paint one picket each on a fence.
Could this be a series please, I like this guy. Also I love the haireffects you added. Way to use them floating wave points there bud!
Interesting chat. I love these little talks you have with people now and then, please keep that up!
No matter if it's random cool people or industry insiders, it's always interesting to hear people who know what they're talking about, well, talk! About stuff they know! :)
Love this this kind of technical discussion. Keep them coming.
Someone's trying real hard to tell us a *knock knock* joke
Thanks! This exactly the video i needed. This CUDA core thing was driving me nuts.
1. So that's THE David Kanter, amazing
2. As far as I know (before watching this video) the CUDA "cores" are more like ALU than complete core. Nice deeper explanation here.
the ray tracing illustration at @3:09 are a bit inaccurate. most ray tracing programs send the rays in the opposite direction. they send the rays from the camera position into the scene and from the objects hit by the rays to the light sources. the rays usually only bounce off objects that have a reflective surface. one program that works this way is Cinebench
I love these intelligent discussions/interviews/explainations. Please continue doing these GN. 😁
Wow this video is so excellent, David is amazing!
very nice, I really enjoy it, I wish more talks with this guy
Best. Channel! This contend is amazing!
Huge thanks for publishing technical material like this! It's really nice to get behind all the marketing nonsense to get a better understanding of the underlying technology.
I think a little clearer explanation is to point out that all the threads in a warp / workgroup share a single program counter. If two sides of a branch are taken by different threads then you wind up with branch divergence and the threads have to serially execute the two branches individually. Following that serial execution the thread scheduler can then reconverge the threads so they all run in parallel again. But basically if you're not branching then the performance of all the cuda "cores" behaves the same as if they were all full-fledged "cores"
As a C++ Programmer (dabbled in GPGPU) I found this video very interesting!
That guy with hammer is really annoying 😠
Steve looking over the edge to check if the guy is within spitting range before realizing he's on camera and proceeding to nod in agreement.
Omg. I thought maybe there was something wrong with my receiver. I know it's not their fault but holy crap is that distracting.
At least he had parents who gave a damn unlike you who acts like an asshole for no reason.
That's not a hammer, it's the audience banging their heads against the wall trying to follow the explanation!
Great video! Loved the depth of it! Want more!
Great and informative as usual. Thanks for the knowledge!
I like this type of content. Same as the one with primitive shaders. Keep it up!
Niceeeee :D
Also some point that were not mentioned:
Programming on a gpu is much harder than multi threading (use every core of a cpu) due to their physical properties
And even for multi cpu, most of the time parralelising a workload is not so easy
I'd like Steve and GN to do a multi-episode detailed analysis of graphics architectures since the beginning of PCs. I've been PC gaming since the final days of vertex and pixel shaders (2005 & 2006), and the switch to unified shaders rightfully took the PC gaming world by storm. Interestingly enough, the first unified shader GPU - Xenos in the 360 - more or less were directly related to the pixel shaders found in the Radeon X1000 series.
Easy to fallow, great interview!
hahahahaahah
Super interesting conversation!
During filming a construction worker is performing many hammer to nail operations
Bold content. Much appreciated. 10/10.
Please makes these interviews much longer. When you are already interviewing someone who knows about that low level stuff that you should try to get even more information out of them.
Good interview. Keep going!
Thank you for a great piece of content. This is exactly why I subscribed in the first place, keep it up!
Very cool, knowledge is power. This video opens up a whole different perspective on how GPU's and CPU's differ but yet the same. Very interesting. Thumbs up!
Thanks for this one! Learned alot
A very interesting piece of content. Goof stuff, Steve. Keep em coming :)
I want more in depth content like this.
We are at the brink of advancement in computing technology. The only problem is making the computing technology smaller without affecting temps and transfer. If we ever start using gallium nitride as a core component and replacement of silicon then we might be looking at the next jump in hardware evolution. So exciting.
Are the FPU lanes in each SM individually addressable or asynchronous? Or does each SM/CU have to run all the lanes in a single pass?
Good interview technique!!!
i wish everyone listened to their guests as good!!!
👏👏😁😁🖒🇺🇸☯♾
Please do more technical content like this.
love these kinds of videos
Very informative. Thanks!!
The legendary Kanter! Whooooooo!
So for example, Vega 10 has 4096 SP's that's inside of 64 CU's that are divided by 4 Geometry Engines. In AMD's case, would the Geometry Engines be considered a 'Core', or is it a matter of the number of ACE's per Hardware Scheduler?
im dumb
This video teaches more about technology than a entire semester of tech class.
Half of me loves the complicated content because I can learn from it, the other half hates it because it's hard to understand.
great info in this video. Thanks!
This is the exact piece of information that i was looking for
2:47 steve mesmerized by his dream coming true
Cool stuff, thanks for the content!
So, is there a memory controller that’s doing the fetching and delivering for all of the clusters, some of the clusters or is that part of the CPUs job? (Not likely but I had to ask)
I strongly disagree with the title / thesis of the argument. CPUs operate at a standard 1 operation per second, so Hz is a measurement of FLOPS. CUDA cores / Shader Cores are identical. They execute 1 FLOP per cycle.
Now, in 1997, Intel came out with MMX, allowing a CPU to setup a 64bit register with two 32bit floats, and execute 2 FLOPs in a single clock cycle. MMX. Note that videocards predate MMX, where MMX didn't become common place until into the 21st century and software support was minimal for a long time.
MMX doesn't make any sense in GPUs, because GPUs are already designed to be parallel instead of serial. Hence, they use the same phrase "core" which meant exactly what core has always meant: something that executed one operation per cycle.
======
Shader cores are true, honest to goodness, "cores" and meet the proper definition. Both CPU and GPUs have their own unique performance penalties when used improperly.
~ GPUs have the requirement that the entire shader must be executing the same code. As a penalty, if two cores in the same 32-core warp take different branches of an if statement, both must be executed in serial. They run at 1-2GHz.
~ CPUs have the penalty that if you access memory from two cores, they must wait a few clock cycles to reach consensus. If an if statement is mispredicted, many clock cycles are wasted as it un-does the work it did. They have special SIMD instructions that act on wide registers. They run at 4-5GHz.
Both are true honest cores. BOTH have severe penalties for using if statements improperly. GPUs' defining characteristic is that it runs one code on all of the cores. CPUs' defining characteristic is high GHz.
Obviously, because a GPU is only ever designed to run one shader code, no, they didn't give every CUDA core its own instruction processing unit. Not because they couldn't, but because it's already forced that they're all running the same code anyway! It doesn't make it any less of a core.
Is that talk he gave at UCS Davis online somewhere?
I feel this is a rather subjective definition based problem.
For a more "layman" example, think about an apartment complex versus a mansion.
Both are a single building and can be equal size.
But they have different number of "homes" within them.
The question then is, is "core" a building or a home.
One can further equate each processing lane or whatever the proper term is to a room, where in theory both building and home are a collection of and therefore a home qualifies as a "core" as a collection of "rooms" or processing lanes. But then we have studios, which is in effect a home of a single room.
Edit: perhaps it is "fair" to say the building is a CPU/GPU, a home is a core while rooms are processing lanes or whatever. But then, we once again arrive at the fact that there ARE homes of a single room. There is no reason, as far as definition goes, why we can't have a "core" of a single processing lane. I mean all else aside, if I had a "stupid" calculator that can ONLY do the fused multiply add function on the single "CUDA" core that it process, does that mean it have no core at all even if all its function is done on a single unit as opposed to distributed on a whole PCB etc?
On the one hand, you can argue that each CUDA is part of a core, as only a processing lane or whatever the correct term is.
On the other, you can argue that each CUDA is a stripped down core, down to a single processing lane(or whatever) because GPU "core" doesn't need all the functionality of a CPU core so overtime, they "devolved" from full fledged cores to what they are now.
Personally, I am much more annoyed by Nvidia mis-appropriation of the term "Max-Q" which have NOTHING to do with the actual meaning of the word in its originality as well as the "standard" "4K" which is entirely a marketing term that have no proper definition (2160p is what people should use, for what is commonly considered "4K" and what people SHOULD use. But 4k sounds "twice as good" as 2160p so here we are). Because those are more clear-cut. Something like this... I don't think it's clear-cut at all because a proper analysis would require considering the development history of both the components and word. So no, I'm not saying you are wrong. But you are also not "right", not unless you write a 200 page study on the historical evolution of the term and GPUs and note down the correlation every step of the way.
Good content, love the more in-depth stuff. This really separates your channel from the other tech channels!! Everyone talking about the same stuff in short videos is kinda boring. While i don't understand everything this triggers me to do some learning and read more.
Anyone know why the corsair mouse linked in the description os sold out everywhere? This video isnt that old
Was that smog in the background or just fog?
Love such content!
I’m a ML Guy and ended up here now I’m core audience thanks for the great video
Nice! I suggest the next video should be about FPGA‘s and why they perform better for ray tracing.
Is double precision important?
It's like, twice as important
ok, so what is a stream multiprocessor SM? sounded like a recursive definition.
Great video!
Gah! I know the video was about the CUDA. But I wanted more. Like how NVIDIA and AMD builds affect the diference in performance between them when using DX12/Vulka vs older DX.
As for the AMD Buldozer I've always say they are "modules" with a form of hyper thread instead of "cores". So I saw as if AMD can call their "incomplete" cores "core" then Nvidia could called the CUDAs, "core". Although they should call the whole SN block "cores" which would be a LOT less.
But really nice video. Love this kid of content.
My brain went back to the tape archive memory and it found linear algebra classes matrix rotation operations from the first year of CS. How did this stick is beyond my understanding
Loved the video, very informative
I guess CUDA Floating Point Registers just didn't have the right ring to the marketing guys.
sick no-scope @ 8:04
Can you talk about how the fundamental "core" design changed when going from Fermi to Maxwell? I believe this was one of the paradigm shifts for CUDA.
Well not sure I'm inside the core audience, but i really like these type of contents here. Other tech channels most of the time tries to dumb down the topics to appeal more to a broader audience making the content kinda boring for people who love move deep complex tech topics. I think the geek-i-ness of gamers nexus is one of the reason what makes this channel unique. (y)