Don't forget that a CPU core also implements the entire x86/x64 instruction set while a shader core is only going to implement a much smaller and simpler instruction set. This is how they fit so many more cores on a GPU die in the first place.
@@freedom_aint_free Basically the more specific an integrated circuit is the more efficient it is at doing that set of specific tasks or singular task. So the most ASIC will be super fast at one thing, but won't be able to anything else. The opposite are FPGA(Field-Programmable Gate Array), which is even more general purpose than the CPUs we have. You can make it do almost everything.
But if a GPU has sufficient instructions it could be Turing complete and accomplish whatever a CPU can GPUs probably are Turing complete its just we would see more complex programming be it either in source or in the compiler
I remember when NVIDIA did this Tegra presentation and I had to cringe when they claimed they had the first 200 core (or something like that) mobile processor. They really just had a generic arm design and a GPU and added those cores up like they were equivalent.
@@CjqNslXUcM and they still do SOC for their own computers like the Nvidia Orin which has 12 Cortex A78E cores (And a Ampere based GPU with 2048 cuda cores and 64 tensor cores)
@@sparkyispog Yes, they're all ARM based CPUs. That's why nVidia wanted to buy ARM. The Nintendo Switch uses an nVidia CPU. GPUs are ussually paired with their CPUs, though. I think the AMD term for that is APU (Accelerated Processing Unit), which means "A chip with a CPU and a GPU in it."
Basically, CPUs are optimized to minimize latency, GPUs are optimized to maximize throughput (bandwidth) While at first glance they seem to imply the same thing, they do not. You could get a result from the CPU in 1ms, but only process 10 items, but a GPU can process 10,000 items in 100ms. You would expect this to mean 10,000/100 = 100 items in 1ms, but yeah that's not how GPUs work. You pay for the the high bandwidth in latency It is nuanced, but once you understand it, the difference is actually night and day. GPUs aren't also flexible. The programs you write, are "inherently" parallel. No std::thread kinda stuff. You write a scalar like program that is "automagically" parallelized, so you have to write thinking about parallel access from the get-go
This is also why GPU's can be saddled with RAM that has massive bandwidth but not so great latency. GPU work load is fairly predictable in that sense and thus the latency can be worked around.
Better analogy - CPUs are surface streets, GPUs are the express lane. You can travel much faster in the express lane but it only goes one direction. You can head anywhere in the city on surface streets or turn around at any point, but you can't go as fast.
You should have done the mechanical layout of the a CPU vs GPU core. Its a clear the difference that way. Way more parts for the CPU core as they are very different and not even in the same realm.
TFlops don't mean anything for performance in games, because this is what were talking about here, there's many GPU's like the Titan V which has 110 TFLOPS but is no better than a 2080ti which has 14, also many GPU's that have less than that still outperform lower TFLOP GPU's, don't even get the point of using that as a metric to prove how powerful something is. Especially when you put it in context of Game Performance, sure it might be better for editing or other applications but otherwise, no.
@@marmite2956 doesn’t matter my dude. The info presented in the video was factually incorrect in regards to the specs of the 4090. A factual error was made, corrected in the comments, and acknowledged by the OP. That’s as far as the conversation went and had to go. Functionally, the specs of the gpu don’t make any meaningful impact on the content on video.
A great explanation. Thank you! In a base analogy, it can be taken like thus; CPUs are like 3 architect/builders. They can perform numerous amounts of complex stuff well, but they're limited in number, and therefore efficiency when complexity isn't required. GPUs are like ant colonies; not smart enough to build wonders, but there are many enough to work fast and efficiently on singular tasks.
Good video. I'd like to add that programming GPUs in a way which approaches the advertised performance is rather difficult. You have mentioned the branches, but also, they are lacking features (no call stack, no dynamic memory allocation), ideally need specific memory access patterns (search for memory coalescing, and bank conflicts), have manually managed in-core caches, and important technical info is a trade secret (like their instruction sets).
@@FMHikari VRAM bandwidth is about an order of magnitude higher than system RAM. Due to lower bandwidth, system RAM only consume couple watts, and these modules don’t even really need passive heatsinks. Despite hardware vendors are happy to sell modules with heatsinks, to get a few extra bucks. However, due to the much higher bandwidth, VRAM chips in modern GPUs require not just passive heatsinks, they actually need active cooling. While it’s technically possible to engineer modular VRAM which would support active cooling, I’m afraid the costs of that gonna be prohibitive. CPUs do that, but they cost hundreds of dollars, and there’s normally just a single CPU in the computer.
@@FMHikari It would be good if it was possible though, especially for AI things you tend to need a lot of VRAM but Nvidia would rather sell you an expensive datacenter GPU for that
This kind of parallelism is actually called Single Instruction Multiple Threads, as it is slightly different from Single Instruction Multiple Data. In fact a Warp can be Single Instruction Multiple Threads like explained in the video and process multiple pixels at once (and following every branch in unison), while every core can be Single Instruction Multiple Data and process a vec4 at a time, not just a float.
4:38 former maintainer of Intel's OpenCL driver for Linux here, on Intel the Y branch threads would execute after the X branch has finished(reached the "else" statement) and block the X branch threads until the end of the if/else. I'm not familiar with Nvidia but I think they do the same. Also with AVX512 the line seems to be blurring somewhat, AVX512 has the same lane masking capability just like Intel's GPU ISA.
@@hansisbrucker813 compilers have been pretty good at hiding the architectural differences between CPU with SIMD and GPU, even before the execution mask introduced with AVX512 the difference from an application programmer's perspective should be very small. There are still differences, like with Intel's iGPU you can do some weird stuff with its registers, like "treat r2 and r3 like one register, take every second element, add the first element of r4, widen the result, and store to r5/r6 as if it's one single register" which with AVX512 you need multiple instructions to align the lanes before doing the addition.
Man, I'm still waiting for the Linux drivers for the Intel N2600's GPU, the GMA 3600 (PowerVR SGX545). Those drivers were only made for Win7 x86, nothing else has hardware acceleration for video or 3D. It's like the GPU does not exist for the OS. No driver ever came. What happened? This laptop is a paperweight without them, as I can't even watch a locally stored sub-VGA res video on it.
Great presentation! Size / yield / energy : a big clever CPU core is harder to manufacture. Scaling out CPU to the same amount of cores as in a GPU, but keeping the complexity of the CPU… you’ll get insane power draw, low yield and insane prices like super computers. One thing cut for time here: that memory architecture is very different because they are built for so different purposes. GPUs have specialized memory that shuffle a lot of memory at very wide buses; read a lot of closely aligned memory. CPU have very narrow busses (in DDR5, 2x32bit per stick). So a CPU chip can shuffle a lot of different data at the same time while GPUs are good at shuffling a lot more of the same data. So the GPU memory model is bad for running multiple different programs at the same time. So the literal hardware interfaces of the chip are built for extremely different purposes, entire different programming idea :)
@@iamraihan3203 sorry for late reply on something I am not an expert on M2. I imagine that being bandwidth limited is an issue for some applications because when I read about it is based on LPDDR5 DRAM. So maybe extreme gaming at max resolution and max fidelity cannot produce FPS similar to a “real” GPU with monster busses. If there is anything in m2 architecture that makes this less of an issue I don’t know about it. Just being on same chip as CPU shouldn’t make bandwidth larger than the physical RAM bandwidth. CPU having shorter latency to GPU might give some wins in specific workloads where bandwidth isn’t the limiting factor. But max graphics with tons of high resolution textures ought to break it. Similar to consoles I imagine game devs might optimize games for avoiding limits when there are few M chips with known performance. So based on my limited understanding I don’t see how PC games requiring large VRAM + bandwidth requirements could ever be ported to m2 without bottle necking earlier. As long as the graphics problem is bandwidth limited speed between RAM/VRAM to GPU chip is the limiting factor. So with a bunch of caveats; for extreme graphics you want a very fast bus between GPU chip and RAM. Apple is known for good hardware though, such as their video encoder/decoder that was leagues ahead of what NVIDIA shipped. That made a huge difference for Canon R5 footage that that Apple could accelerate but NVIDIA could not. So not all problems are about bandwidth limitations. Caveats caveats caveats :-)
You just shouldnt call GPU FP units cores, that's just a marketing term from NVIDIA. Shaders or PF32 units would be better names for what they are. The closest thing in a NVIDIA GPU to a CPU core would be something like a SM. And there are only like 128 SMs even in the highest End GPUs.
@@jennalove6755 First of all, there is no exact definition for what a core is. Typical properties of cores are: 1. Memory is shared across cores 2. Cores can process data and instructions independently (MIMD) 3. (at least for symmetric multi processing) all cores are hierarchically equal (there is no master core that manages all other cores). 4. Each core has its own front end, Execution engine and memory interface I don't know why you're mentioning Bulldozer?
@@jennalove6755 It would be more appropriate to consider the CUDA Cores as FP-ALUs. Making a full comparison is tricky due to FP vs Int and FMA vs FADD. But if we compare FMA only, then one of the dual-core Bulldozer complexes would have "4 Cores", and that goes up to "8 Cores" if you include the FADD units. So really, AMD could have argued that they have at least 2x as many cores as they did (2x FMA per Int core). And yes, I would claim that each complex was 2 cores as a core is typically defined as the number of front-ends (parts that fetch instructions). This is why the appropriate comparison with a GPU is the SM == Core, and the CUDA Cores are SIMD ALUs similar to the SIMD ALU cores from SSE on x86 (or NEON on ARM).
Clocks don't have to do with density directly. It more has to do with how deep the pipeline stages are and transistor yield and quality. Faster clocks demand that the entire pipeline stage, which can be multiple transistor "layers" deep and very wide, be completed by the next clock. You can either shorten or lengthen, narrow or widen these stages for various benefits. GPUs tend to have very wide, very deep pipeline stages which take a long time to process and guarantee data is fully through all the transistors, thus pushing clocks down on GPUs. But it gives them some amazing bandwidth energy efficiency at the expense of a bit of latency. More current gets this happening faster but produces more heat, which is another consideration, as modern computers are far more heat production limited than anything. Heat cannot escape the dies fast enough at the currents necessary for the speeds we demand and they have to be turned off 80-90% of the time. Dark silicon is still a big problem. It's also part of why low power processors aren't not that much weaker yet consume 1/10th the power.
@@rattlehead999 The clock speeds are typically determined after the design based on the parasitics within the manufactured die. This is also where binning comes into play, chips with higher parasitics are binned to lower clock speeds. The choice of CU/SM size is more of a tradeoff between area (to reduce die size) and pipeline latency (the shorter the pipeline, the more efficient warp switching becomes). Clock speed estimates are taken into consideration at that stage, but in a much more heuristic way that doesn't really consider parasitics (those end up affecting the design as you get closer to tapeout and perform advanced electrical simulations).
I think it's useful to mention that GPUs will frequently and deliberately block on memory as the memory subsystem is geared towards throughput with little caching in the way of reducing access latency. Hence, a SM core may theoretically switch context after every warp instruction.
You could have simplified this video so much and with less name dropping. While supporting it with a tiny spec of math to make it overall more understandable. Even still this was an overall good explenation. If anyone desires a more compact explenation read the following. The CPU has a certain amount of threads (these are comparable to hands, the more hands you have the more you can do at once). Each thread capable of handling a single operation at a time. Usually the CPU has a number of cores and they might or might not be split in two so that there are twice the threads than there are cores (two threads per core), thus doubling the amount of things that can be done at once. A CPU can handle almost any operation but only a certain amount at a time(=amount of threads). A GPU is perfect for Linear Algebra because it has many many tiny "threads*" each solving a small problem of a greater problem toghetter. Which is perfect for Lineair algebra. As Lineair Algebra is an extremely simple and easy form of math. Just takes a really long time for one person or thread to perform or calculate. For simplifications sake Lineair Algebra is just the math revolving around matrices. These are effectively 'packages' containing data. A matrix could for example give you a 3D coordinate on earth with its corresponding temperature and so on instead of having to express each property individually. Summary: CPU is like einstein, you have very few but they can do a wide range of things. GPU is like a peasant, there are many, they can do very few things. But toghetter they topple empires. If my English is broken please reply with where and what so i can correct myself.
One thing to note. Starting with Ampere, Nvidias Cuda cores have a FPU, and a combined Int32 and FPU. They aren't completely split anymore. They did this to increase the max theoretical performance, however, there's almost never a case where some Integer calculations are being run. It's really quite interesting actually. AMD has done the same thing with Navi 31, in the Radeon 7000 series. I guess it's a way to squeeze out some extra performance without increasing die size
And also starting from Ada Lovelace (40 series), NVIDIA's SMs support on-the-fly thread reordering which basically lets them reorganise threads across multiple warps such that each thread in a warp is actively executing the same instructions on neighbouring sets of data. Where past architectures may have some threads block execution when the warp enters a branch, Ada Lovelace can detect this and reorganise these threads so that all threads in a warp are taking the same branch, while the other threads that wouldn't have taken that branch are organised into a different warp populated with other threads from other warps that _also_ wouldn't have taken that branch. Caveat with this is it seems like this is only supported with callable shader execution where one shader can invoke another on-the-fly, which I assume basically gives the GPU time to do the reordering. NVIDIA also has another optimisation up their sleeve called subwarp interleaving, which basically lets a warp dynamically switch between two branches of code whenever one of the branches encounters an instruction that would block execution anyways (such as an instruction depending on data being read from VRAM). With current architectures the entire warp would just end up stalling since the first branch is waiting on a blocking instruction to complete and the second branch is waiting on the first branch to complete, but with this optimisation the warp would be able to switch to the second branch from the first branch, execute the second branch until it also encounters a blocking instruction, then switch back to the first branch where the instruction would hopefully have access to the data it was depending on. This isn't currently available in any publicly available GPUs, but NVIDIA was testing this out on a custom Turing (20 series) GPU and saw anywhere from a 6% to a 20% improvement in raytracing workloads, which is one of the most branch-heavy workloads you could give the GPU.
@@jcm2606 some insider information? or is it available to the general public? Take care bro, I hope this doesn't count as casually breaching classified info on a random comment on YT.
It has alaways been slower, the advantage of a gpu is not in speed (as some might think due to how it can render amazing graphics given some pipelines), but in concurrency, the sheer amount of cores with specialized functions does wonders for its specific needs.
Thanks for this correct and honest answer at the very end :) Ultimatively it depends on what should be done and how it's implemented. Certain tasks can't be done efficiently on a GPU. GPUs are great for pipelining. So if you have 10 operations that have to be done in succession, it actually sets up several parallel pipeline you can pump data through them. You essentially get a fix delay which is about equal to the pipeline length but other than that the data can be pumped through at the clock speed. The worst thing that can happen to a GPU is to stall a pipeline. That's why using tons of different shaders is much worse than a few well designed ones. Though with actual CUDA cores they implement more and more concepts of a CPU into the GPU. Though in the end it's still a piece of specialized hardware for certain tasks. Yes, they can do some tasks a CPU can do as well, but are often times slower. Though for pure data crunching they are great, at least when you can adjust the algorithm for the architechture.
I love this channel. You have done a really good job with explaining concepts that are rather esoteric at times. I found the snippets of code examples overlaid for context rather interesting as well. Code examples for small processes are really helpful for people like who are beginners, understand how simple tasks are performed, and then how they can be implemented en mass to perform complex tasks. I wish someone would talk a little bit about the beginnings of embedded programming, how to get into it, and working with PLD's and understanding the building blocks of creating your first projects. Like, how to scope a programming project, how to attack it, etc. For instance, yesterday while driving home from work. My brain was occupied with the idea "Ok, assume I did not have a cmath library and only had addition. How would I perform exponent math? That was a fun little task.....first asking "Well, Brian.....what does it even mean to 'square' something?" and then the question was "Well wait.....how do I do multiplication with only addition?! I know how to do it as a person, but how does a machine do it with only addition and loops?" That was fun mental exercise. 🙂
There's also the interesting differences between SIMD vs SIMT, and how most modern GPU's actually implement both. (Typically only for packed vectors of int4/8 or for 2*FP16)
5:38 This specific design of GPU cores is also why modern GPUs tend to have multiple types of cores as well - each core type designed to handle one specific type of task.
Amazing Explanation! :) This explains why Half-Life 1 and Counter Strike 1.5 was the last games of the "Software Renderer" era , and why we can have cool graphics nowaday, even with "low end" gpus
No idea how I found this video but I'm glad I watched it! You explained the difference between GPU and CPU cores very well and I could actually follow quite easily. You earned yourself a new sub!
My favorite way to tell people what a cpu is by using a restaurant analogy. The Cores are people, and the customers are incoming data, and each core/person has a task that it has to execute said task. Its the best analogy that i used to understand cpus
I do a lot of work in numerical weather prediction models, and much is now being run on GPUs. Running a simulation on a GPU is much faster than a CPU. The same goes for machine learning, such as a convolutional neural network - which is a popular way to work with large meteorological weather data to improve the accuracy of weather forecasts. Now the numerical weather prediction models are designed to work on Supercomputers, such as the Cheyenne supercomputer that the National Weather Service uses, but they can also run on your desktop computer. In fact, I run the WRF model on my I9-10850K and I'm also doing work with the Unified Forecast System (UFS), which is the next generation of numerical weather prediction model in the US. And for that I'm using my 3080. You have a very informative channel, and I will be subscribing.
Currently, the combined processing power of NWS supercomputers is 8.4 petaflops, which is more than 10,000 times faster than the average desktop computer.
gpu shader cores from my understanding are little more complex than the first pentium cpu cores, but given modern architectural design, emphisis on massively parrallel design, and newer process nodes, you now have a massive number of relatively slow processors but with a huge number of 'threads'
It should be noted that there are tasks other than graphics that lend themselves to SIMS processing - more or less anything that involves vector or matrix operations (signal processing, solving massive systems of linear equations...). That's where Nvidia's CUDA API comes in.
@1:56 I really thought that'd happen someday. Back in the day I had a socket 939 motherboard with a slot to take an AM2 daughterboard. When GPUs started to do compute workloads and high end SSDs. Came on PCIE cards I thought it wouldn't be long before we get whole systems running on PCIE slots.
Btw you can run any directX graphics on your processor. A 13900K is about equal to a midrange card from 10 years ago, you can play CS Go on your CPU even without APU
Nowadays it's a bit more complex than triangles since modern GPUs are designed to support some level of general compute (see compute shaders in most graphics APIs, as well as general compute APIs like CUDA or OpenCL), but yeah, that's basically it.
The GPU cores are almost like hardware version of specialised subroutines. In fact, that is what part of the CUDA cores are for, calculation of the triangles for 3D graphics. Meanwhile, CPU are general purposed processor that can execute any of the instructional sets within its CPU family. Think of how Amiga PCs were capable of arcade like graphics and sound because it had specialised sprite and sound processors in addition to the Motorola 68000 CPUs they used. IBM PCs had only Hercules/CGA/EGA graphics card, which were really ram/register buffers with DAC circuits to flush the data from c800 to the VGA analog output. Sound wise, IBM PCs beeper speaker were limited until Yamaha OPL3 FM chip gave it midi and a few voice channels. This off loaded specialised midi, wave, DSP and other related sound processing to sound cards.
Each NVidia warp is basically equivalent to a single CPU thread capable of executing vector instructions of size 32. A simplistic way to the estimate the sequential performance could be to divide the number of NVIDIA cores by 32. That does not work because of other major design differences with CPUs. The GPU cores are themselves grouped into SM ( Streaming Multiprocessors) designed to execute up to 32 warps in a way that is roughly similar to CPU hyperthreading but at a larger scale. Also, the individual warps are not optimized for speed. A CPU can execute simple instructions from a single thread at a rate of 1 per cycle but a GPU SM can only process the instructions of a single warp with a significant latency ; maybe 1 every 10 cycles for the simple ones to a few hundreds cycles for memory accesses. To make things worse, GPU have very little memory caches because they would be inefficient (or very expensive) when running tens of thousands of threads in parallel. Instead, GPUs rely on the principle that the memory latences between instructions are hidden by the fact that each SM is running up to 64 warps in "hyperthreading" mode (and the whole GPU may contain up to 100 SM).
The best summary i have heard is GPUs are optimized for high throughput while CPUs are optimized for low latency. This generally applies if the problem can benefit from parallelism.
You mean warping material into a form as in related to wrap Old English weorpan "to throw, throw away, hit with a missile," from Proto-Germanic *werpanan "to fling by turning the arm" (source also of Old Saxon werpan, Old Norse verpa "to throw," Swedish värpa "to lay eggs," Old Frisian werpa, Middle Low German and Dutch werpen, German werfen, Gothic wairpan "to throw" Also Old High German warf "warp," Old Norse varp "cast of a net". Most likely where we get the word 'wharf' where ships are moored.
CPUs actually include SIMD execution units and perform tasks in a similar way to GPUs. What sets GPUs apart is their fixed function hardware used for special tasks like performing raster algorithms, sampling textures and in more recent times neural network inference and ray intersection testing. We have also seen that under the right circumstances compute shaders are competitive with fixed function rasterizations and some CPUs including machine learning instructions. This makes me wonder whether someday we might go back to a unified processor, that forgoes that overhead and complicated programming model that comes with using a co-processor.
I'd say comparing the two is like comparing a big commercial ship to a ship building mashine. Very diffrent use cases but one definetly benefits the othet :)
I think it's also important to highlight the fact that FLOPS only measures the performance with floating point numbers. 80+% of your everyday tasks don't even involve floating point numbers at all
If you're looking for a layman's terms distinction you could use general purpose for CPUs, as you did, and then specialized on the other end for GPU cores. Cuda cores, and whatever AMDs equivalent is, are specialized for a very specific kind of math in a very specific way and are thus made to be incredibly fast and efficient at that, whereas a CPU has to be able to do everything at roughly the same effectiveness just don't do as well at those tasks specifically.
One CPU core can execute several instructions in parallel, this is why SMT/HyperThreading is possible. While it's a bit tricky to define boundaries of cores, one could say that a x86 core with 6 ALUs and 4 FPUs would """equal""" 4 GPU "cores" as far as throughput is considered. Otherwise x86 CPUs would be limited to "number of cores * frequency * 2" FLOPs, but instead the number is much higher (and I think CPU FLOPs are calculated by "number of cores * number of FPUs per core * frequency * 2") And then there is rabbit hole with "double purpose FPU/ALU" + FPU combo in the newer Nvidia GPUs, which essentially means that theoretical performance is anywhere between half of what you'd expect (equal number of INT and FP instructions) and what's advertised (pure FP).
Some older GPU arches actually didn't support dynamic branches at all. And even today result is severe cost in performance: the GPU actually computes both if/else code paths and selects the result via boolean (condition) I.e. the shader if-else is likely compiled into branchless code.
This highly depends on a compiler (both IL and driver ones). If it decides that the difference between both paths and the mere execution of a branch will be negligible, then it most likely will result in a removed branch (so both paths execute). You can directly specify and hint the compiler what you want to generate with a [branch] or [flatten] atrributes on HLSL, don't know about other high level languages (glsl has no support for those iirc).
Another reason is that even if you cram many cores into a CPU like in a GPU, most of the time it wouldn't even be faster. Most programs aren't even able to take advantage of the 8 or 16 cores of today's CPUs, so the main bottleneck is currently the clock speed per core. And the clock speed equivalent of a GPU is very low.
Clock speed isn't the limitation..... ipc is the limitation If clock speed was the limitation cpus wouldn't have been improving much in the last decade
As Alucard said clock speed doesn't mater IPC is the main thing that makes a CPU fast. You have 8 core cpus that run at 1.3ghz that give you preformance close to a Ryzen 7 1700x. The biggest bottleneck is the unwillingness of Intel and AMD to switch from x86-64 and cisc to risc or vlwi.
@@xfy123 Yes, the clock speed is not as direct of a measurement as IPC. But it's directly correlated with IPC, higher clock speeds are better, so there is no way you can say they don't matter
And it's not only the core, but the bus that connects the relevant data source and sinks. A GPU would be something like a 64-lane highway full of trucks (VRAM) with only a handful of individual destinations it can reach. That's because a GPU has a reduced instruction set with copy-pasted circuits per lane. A CPU on the other hand would be a large city with maybe a 4 or 8 lane highway (Cache). From any lane you take into the CPU you can reach more or less any destination register in that CPU.
Also GPU are fast in 32 bit float numbers, but using 64 bit floats performance can be 20x slower, so if you need 64 bit precision, CPU can be more performant
from your description, one alder-lake core (with avx512 enabled, so saphire rapids) would be equivalent to 32 cuda cores, since it can initiate 32 floating-point math ops per clock cycle. except that they can be two sets of 16-way simd, and simultaneously can perform another 4 logical ops and more memory ops at the same time, twisting and winding it's way through 2 arbitrary threads at the same time with far greater flexibility, and at a higher clock rate. so top xeon cpu's can theoretically manage >7 terraflops, as long as the memory can keep up. still 7 times slower, but much easier to design diverse code for.
This is a nice overview of the differences between CPU and GPU. It might be better not to use the term CUDA core so often, as that doesn't match the terminology of what a CPU core is, and probably confuses the watcher. Granted, NVIDIA made the terminology confusing. The video also makes it sound like only GPUs perform SIMD operations and CPUs don't, which is a misrepresentation because SSE, AVX, etc instruction sets exist, which would multiply "CPU core" count by 8 or more. Another difference that wasn't mentioned that allows GPUs to have high throughput is that latency can be hidden by switching contexts between several warps on an SM, akin to hyperthreading on a CPU.
The internet really needs a well-researched video on capability hardware. There's a good book on the subject but it only covers mainframes, and nothing after the 80s. I get that CHERI (including ARM Morello) is the big thing now, but the intel i960 was absolute perfection.
This reminds me of Acerola's video where he tries to make a pixel sorting shader, and he found that GPUs are miserably bad at sorting, and he could not get it to run at acceptable framerates without high end hardware. He explained these same concepts as "CPUs are smart slowly, GPUs are stupid faster."
I'd put things a bit differently. CPU cores are designed to be versatile, to be able to perform many kinds of operations, but generally do one thing at a time very fast fast, that's because programs need to execute in a precise order, like add A to B, then devide the result by C. GPU cores on the other side are designed to do only certain, limited things and are super small so you can fit thousands of them next to each other, meaning they can do multiple tiny things in parallel, all at once. Then you can just glue together what they produced to compose an image.
A GPU core is smaller and more stupid, cannot act alone ; everyone in the group does the same instructions (program). A CPU core is much bigger and more independent, and each core run one (or sometimes several) programs at the same time. That was what the SIMD / WARP discussion was about :)
That's a very good takeaway actually. The essence is that a GPU "core" is just a very small math unit, it can do a couple of math operations and that's it, but also it's not an independent unit, it needs to be organised in bigger groups with other "cores" and run the same operations together but just on different data. That's very usefull when for example you want to apply a shader to the entire image, i.e. multiply all the pixels on the screen with a different number to affect their lighting levels. But when e.g. you want to display a webpage that has all kinds of different elements you would have to do that element by element, i.e. the entire GPU thread calculate this button, then another entire GPU thread calculate the position of the other image, etc. it becomes ridiculous. A CPU core on the other hand is essentially like an autonomous computer, you can run hundreds of different instructions, math, logic, weird jumps, memory operations, etc (it's getting really hard to keep this comment high level and not go into details :P).
Note that the Nvidia Cuda cores are probably closer to Intel threads. The best Nvidia equivalent for the core is the Streaming Multiprocessor (SM). For the Ada Lovelace Architecture (RTX 40 series), each SM has 128 Cuda cores. 128 * 128 = 16384. CPUs also tend to handle a larger number of data types with additional instructions (and more hardware) needed for both conversions between each, as well as all different operations performed for each different data type. The GPU core, on the other hand, can be a lot simpler and thus smaller.
Another way to put it: A CPU core: A single jack-of-all trades genius who gets things done ASAP. He even tries to predict what will happen next so he is never waiting on someone else. His desk space takes up an entire floor of the office building. He has multiple assistants that organize things and try to keep whatever he needs within reach of this genius. Trying to manage lots of these people gets… unmanageable. A GPU core: A factory worker who needs to be told what to do all of the time. Give the managers megaphones and they all get lots of stuff done, as long as the manager can utilize the factory workers efficiently. To do that, the factory workers have to all be following the same instructions. See? Totally different!
The funny thing is if I'm not mistaken that the CPU could be the component that feeds instructions to the GPU since it could be independent while the OS is still running. It's also more dynamic in terms of the operations that it can handle so it could be faster or more efficient on a different math operation than the GPU.
The FPU in a CUDA core are optimized for particular floating point operations. They're really good at addition, subtraction, and multiplication. They suck at division and most any other function like trig and exponents. You're also going to suffer from starting GPU kernels and data transfers between the GPU RAM and main RAM/CPU cache. And the largest float they can handle is double. They can't do long doubles. Doubles are usually enough. But double precision is also slower than single precision, but that may be in common in with CPUs. I haven't checked.
We should be doing something "like" using GPUs where we now use CPUs. Not exactly - GPU cores are quite simple in terms of the suite of operations they can perform. But, we screwed the pooch decades ago in CPU design. We decided that we would let legacy C software drive the development of new CPUs - we took as our goal to stay with a single core design but make that core run the old code faster through whatever trickery we could come up with. And while the results have been impressive, they've also turned our CPUs into complex monstrosities that no one can understand completely. And that's exactly how we wound up with things like Spectre / Meltdown and so on. And in the end we had to swallow the multi-core pill ANYWAY. We should have swallowed it decades sooner and moved toward more simple and easy to understand (and easy to secure) cores, and just re-written software as necessary to make use of them. These days there's more logic in a CPU core to do all this trickery than there is actual logic that does the rubber-meets-the-road computation. It was a mistake, and now it's too late to fix it. Just as an example, take instruction re-ordering. That was just a STUPID thing to do in hardware. The programmer or the compiler can handle that part of the problem, but that wasn't good enough because they wanted the chip to run EXISTING CODE BINARIES faster. So we spend logic on that that could have been spent offering us more controllable computing. Come on, it's obvious: just get the instructions in the right order to START WITH, instead of forcing the logic to re-sequence them for you every time. Out of all of the things we "shouldn't have done," this is the one that galls me the most. Speculative execution is another example. Instead of spending logic on that, we could have spent it on "more cores." I'm not even entirely sure about cache memory logic. Having a fast scratchpad on the processor is certainly a good idea, but I suspect that there was a better way to do it that involved the programmer / compiler being conscious of what was going to go into it. Then all that "control logic" could have been used for... you got it - more cores. The general principle here is that we tried to put logic into the chips to make them "think for the programmer." Instead we should have demanded that our programmers LEARN TO THINK. And focused on putting as many simple, easy to understand cores as possible into the programmer's hands. Now, I have no problem with the compiler being smart enough to do some of these things for the programmer - that's still just a one-time cost, and then you reap the benefits forever. But having the hardware do it, at RUNTIME, EVERY TIME? That's just poor resource management.
SIMT limitation in GPUs is no longer true in more recent architectures, however, it is true that they still perform best when they are used in a SIMT manner.
Comparing CPU cores to GPU cores is like comparing a FedEx truck to a freight train. The freight train/GPU has a lot of limitations that allow it to be WAY more efficient compared to a FedEx truck/CPU. Sometimes it's worth it, sometimes it isn't.
I would have gone with a semi truck (GPU) and a scooter (CPU): when you want to move a lot of packages simultaneously, all from the same initial point to the same destination point, nothing beat a semi-truck, but if you want to deliver one package quickly with agility and ability to change your mind, a scooter is the best.
GPUs may have many ALUs, but the way they're organized matters. I.e. AD102 has 18432 ALUs. These are organized into 144 SMs, with each (simplified) being able to simultanously run 4 different FP32 instructions, each on a group of 32 variables. The SM is what logically and in terms of functionality comes the closest to a CPU core. 144SMs definitely is in the same ballpark as the 128 CPU cores that are now common in servers. On such a GPU however, you need 4 independent threads (with 32 work items ea.) per SM in order to reach full utilization. On a modern CPU that mark is somewhere around 1 up to 1.2 (on average) threads being required to fully utilize a core. When not having enough different and independent work, a GPU may not be able to flex it's muscles. Besides having to execute instructions on groups of 32 (some: 64) items, GPUs usually also suck at control flow (if - else, conditional branching and jumps), aswell as special (transcendental) math operations, such as sin(), cos(), tan(), sqrt(), … This leads to not every workload being able to actually benefit from GPU execution.
"Bad for general purpose computing" that's a bit too harsh. What really makes the difference is whether your algorithm supports high data parallelism without needing separate control threads, or at least lends itself easily to massive SIMD
Well it makes sense to me because a CPU is doing a bunch of work while the GPU has ONE task, to generate video. CPu has to do alot run your system load programs etc. Overall it would appear slower because if you think about it a CPU is giving a bunch of data out while the GPU is only doing video task. Now get into APUs that have to do both...
At the end of the day, it is the difference that CPUs are optimized for single threaded tasks or tasks with a mix, while GPUs are beast of multhreading and only multithreading.
CPU vs GPU cores, for me, cpu are extremely smart. One huge difference is the scheduler and preemption mechanism. If a instruction is waiting on io (memory access) the cpu scheduler will make a decision to do something else and then come back to it. However, gpu threads/cores cannot. This is also due to there isn't many instructions a gpu core can run (only two types of algorithmic operations). A gpu core also cannot load memory itself, instead it just receives and outputs.
Another difference, on a higher level, is that the instruction set for gpu's are not open source, or you cannot directly program a gpu core. That's why you need to download the gpu drivers, which is the only way a program tells what the gpu does. It is almost like a functional programming paradigm program where the programmer tells the computer what needs to be done with the computer making a decision how to do it; which comparing to other programming paradigms (using languages like c/c++) you tell exactly what and how a operation is done down to the cpu instructions.
Versatile abacus vs highly parallelized calculator. BUT don't forget, that GPU design lends itself perfectly to data processing. Best example MCP7A or also known as geforce 9300. A wonderful hybrid chip consisting of an nforce 730i and a geforce 9300 with working drivers. And everything YEARS before cpus with crappy igpus. Out of that DNA, DPUs were born.
WIth all that being said, you can harness the power of your GPU for things other than graphics, if it happens to coincide with what the GPU is good at.
So basically, if we were to count CPU cores like Nvidia counts CUDA """cores""" (and AMD counts stream units), then the i7-11800H would have more than 128 cores? (8 cores capable of AVX-512, each ALU capable of doing 16 FP32 operations at once in SIMD, where I imagine each lane would be equivalent to one CUDA "core". And each Tiger Lake-H core most likely has multiple ALUs for superscalarity.)
GPUs are great at processing consecutive chunks of data. CPUs are great at processing arbitrary bits of data in arbitrary order. Car vs Semi. I would date to do Uber with a semi. I'd hate to move many pallets of product with a car.
In short: Think your GPU is a really fast sport car that reach 250mph which can carry only 100 units inside, and your CPU is a not so fast truck that reachs about 100mph, but can carry 2000 units, and lets use 'sand' as the data we need to process. The sport car is about 2.5x faster then truck, so if your data is 100 units of sand or less, the GPU can just handle it better than CPU. But if you have 2000 units of sand, isn't effective using your GPU to transport that sand, because it will took more time then CPU. So, CPU is for 'heavy' data process, and GPU is for small data that can send faster then CPU. Thats why in games, we use GPU to render textures, effects, shadows, and other small things, and CPU to handle physics, AI, and some other big stuffs.
Not quite. SER basically just allows the GPU to determine how much the threads in a warp are diverging from each other (ie executing different instructions to each other or operating on data from different areas of memory to each other), typically by asking the shader developer to provide it some data to base this off of so that the GPU can look at how different the data is between threads. This then lets the GPU attempt to optimise the case happening at 4:15 where the warp enters, say, an if/else statement and half the threads want to execute the X branch while the other threads want to execute the Y branch. By looking at the data that the shader developer provided, the GPU can estimate how badly the threads will diverge from each other and can reorganise them on-the-fly, basically ripping apart the entire warp and creating a new warp by combining non-divergent/divergence-minimised threads from different warps (this is my understanding of it, might be wrong, I'd really recommend reading the paper if you want to know more).
It's not weird at all. It's also because GPUs are technically in-order processors (the GPUs that are capable of out-of-order execution now also exist - ARM Valhalla and Snapdragon Adreno are examples of out-of-order GPUs now in use), and CPUs are therefore already superscalar out-of-order monsters. However how you program either or both processors matters a whole lot too, because GPUs are meant to chew through the vector math in parallel (ie. you want as many ALUs busy as possible). EDITED: Forgot to add, SIMD vector processors, especially some superscalar varieties (such as AMD GCN shaders), tend to only work on exactly the same item at a time in parallel, unless they explicitly allow dual-issuing of two different SIMD vector math instructions at a time. FPUs are weird in a way.
Don't forget that a CPU core also implements the entire x86/x64 instruction set while a shader core is only going to implement a much smaller and simpler instruction set. This is how they fit so many more cores on a GPU die in the first place.
The next iteration on that discussion I believe is ASICS.
@@freedom_aint_free Basically the more specific an integrated circuit is the more efficient it is at doing that set of specific tasks or singular task. So the most ASIC will be super fast at one thing, but won't be able to anything else. The opposite are FPGA(Field-Programmable Gate Array), which is even more general purpose than the CPUs we have. You can make it do almost everything.
Unless it is a non-x86 CPU like an Arm core, x86 shouldn't always be seen as the only CPU type, Arm is on the rise
But if a GPU has sufficient instructions it could be Turing complete and accomplish whatever a CPU can
GPUs probably are Turing complete its just we would see more complex programming be it either in source or in the compiler
@@pokemettilp8872 RISC-V is the most interesting.
I remember when NVIDIA did this Tegra presentation and I had to cringe when they claimed they had the first 200 core (or something like that) mobile processor. They really just had a generic arm design and a GPU and added those cores up like they were equivalent.
That's nvidia shield there
Nvidia makes cpus?
@@sparkyispog They made SOCs that had ARM CPUs for various phones, tablets, car infotainment systems and the Nintendo Switch.
@@CjqNslXUcM and they still do SOC for their own computers like the Nvidia Orin which has 12 Cortex A78E cores (And a Ampere based GPU with 2048 cuda cores and 64 tensor cores)
@@sparkyispog Yes, they're all ARM based CPUs. That's why nVidia wanted to buy ARM. The Nintendo Switch uses an nVidia CPU. GPUs are ussually paired with their CPUs, though. I think the AMD term for that is APU (Accelerated Processing Unit), which means "A chip with a CPU and a GPU in it."
Basically, CPUs are optimized to minimize latency, GPUs are optimized to maximize throughput (bandwidth)
While at first glance they seem to imply the same thing, they do not. You could get a result from the CPU in 1ms, but only process 10 items, but a GPU can process 10,000 items in 100ms. You would expect this to mean 10,000/100 = 100 items in 1ms, but yeah that's not how GPUs work. You pay for the the high bandwidth in latency
It is nuanced, but once you understand it, the difference is actually night and day.
GPUs aren't also flexible. The programs you write, are "inherently" parallel. No std::thread kinda stuff. You write a scalar like program that is "automagically" parallelized, so you have to write thinking about parallel access from the get-go
This is also why GPU's can be saddled with RAM that has massive bandwidth but not so great latency. GPU work load is fairly predictable in that sense and thus the latency can be worked around.
Is that the reason, why a PCI-E socket with 16 lanes is way more faster than one with only 4 or 8 lanes?
@@3333927 Each lane provides x bandwidth so more lanes is more bandwidth. Latency is still the same however.
@@Jabjabs GPUs*
Better analogy - CPUs are surface streets, GPUs are the express lane. You can travel much faster in the express lane but it only goes one direction. You can head anywhere in the city on surface streets or turn around at any point, but you can't go as fast.
You should have done the mechanical layout of the a CPU vs GPU core. Its a clear the difference that way. Way more parts for the CPU core as they are very different and not even in the same realm.
4090 has 83 TFLOPs. It’s the 4080 that has the 49.
And doesn't the 13900K have 1 TFLOPs?
@@Kynareth6 that's a thing?
TFlops don't mean anything for performance in games, because this is what were talking about here, there's many GPU's like the Titan V which has 110 TFLOPS but is no better than a 2080ti which has 14, also many GPU's that have less than that still outperform lower TFLOP GPU's, don't even get the point of using that as a metric to prove how powerful something is. Especially when you put it in context of Game Performance, sure it might be better for editing or other applications but otherwise, no.
@@marmite2956 doesn’t matter my dude. The info presented in the video was factually incorrect in regards to the specs of the 4090.
A factual error was made, corrected in the comments, and acknowledged by the OP. That’s as far as the conversation went and had to go. Functionally, the specs of the gpu don’t make any meaningful impact on the content on video.
@@redcat7121 not only is it out it has always been replaced by the 13900KS
A great explanation. Thank you!
In a base analogy, it can be taken like thus; CPUs are like 3 architect/builders. They can perform numerous amounts of complex stuff well, but they're limited in number, and therefore efficiency when complexity isn't required.
GPUs are like ant colonies; not smart enough to build wonders, but there are many enough to work fast and efficiently on singular tasks.
Good video. I'd like to add that programming GPUs in a way which approaches the advertised performance is rather difficult. You have mentioned the branches, but also, they are lacking features (no call stack, no dynamic memory allocation), ideally need specific memory access patterns (search for memory coalescing, and bank conflicts), have manually managed in-core caches, and important technical info is a trade secret (like their instruction sets).
My sole wish is that dedicated GPU boards get modular VRAM one day. It would be pretty great and would let older models last longer.
@@FMHikari One problem with that is that it would have to be physically further from the GPU itself which would slow things down
@@circuit10 Understandable. I still wish it was somewhat possible, or at least easier than reballing the memory chips.
@@FMHikari VRAM bandwidth is about an order of magnitude higher than system RAM. Due to lower bandwidth, system RAM only consume couple watts, and these modules don’t even really need passive heatsinks. Despite hardware vendors are happy to sell modules with heatsinks, to get a few extra bucks.
However, due to the much higher bandwidth, VRAM chips in modern GPUs require not just passive heatsinks, they actually need active cooling. While it’s technically possible to engineer modular VRAM which would support active cooling, I’m afraid the costs of that gonna be prohibitive. CPUs do that, but they cost hundreds of dollars, and there’s normally just a single CPU in the computer.
@@FMHikari It would be good if it was possible though, especially for AI things you tend to need a lot of VRAM but Nvidia would rather sell you an expensive datacenter GPU for that
This kind of parallelism is actually called Single Instruction Multiple Threads, as it is slightly different from Single Instruction Multiple Data. In fact a Warp can be Single Instruction Multiple Threads like explained in the video and process multiple pixels at once (and following every branch in unison), while every core can be Single Instruction Multiple Data and process a vec4 at a time, not just a float.
4:38 former maintainer of Intel's OpenCL driver for Linux here, on Intel the Y branch threads would execute after the X branch has finished(reached the "else" statement) and block the X branch threads until the end of the if/else. I'm not familiar with Nvidia but I think they do the same.
Also with AVX512 the line seems to be blurring somewhat, AVX512 has the same lane masking capability just like Intel's GPU ISA.
Are SIMD on cpu like avx512 comparable to GPGPU these days or are they completely different beasts?
@@hansisbrucker813 compilers have been pretty good at hiding the architectural differences between CPU with SIMD and GPU, even before the execution mask introduced with AVX512 the difference from an application programmer's perspective should be very small.
There are still differences, like with Intel's iGPU you can do some weird stuff with its registers, like "treat r2 and r3 like one register, take every second element, add the first element of r4, widen the result, and store to r5/r6 as if it's one single register" which with AVX512 you need multiple instructions to align the lanes before doing the addition.
@@linnaea_lavia neat
Thanks for help on Linux
Man, I'm still waiting for the Linux drivers for the Intel N2600's GPU, the GMA 3600 (PowerVR SGX545). Those drivers were only made for Win7 x86, nothing else has hardware acceleration for video or 3D. It's like the GPU does not exist for the OS. No driver ever came. What happened? This laptop is a paperweight without them, as I can't even watch a locally stored sub-VGA res video on it.
Would really appreciate more videos on this style explaining these kinds of concepts
Great presentation!
Size / yield / energy : a big clever CPU core is harder to manufacture. Scaling out CPU to the same amount of cores as in a GPU, but keeping the complexity of the CPU… you’ll get insane power draw, low yield and insane prices like super computers.
One thing cut for time here: that memory architecture is very different because they are built for so different purposes. GPUs have specialized memory that shuffle a lot of memory at very wide buses; read a lot of closely aligned memory. CPU have very narrow busses (in DDR5, 2x32bit per stick). So a CPU chip can shuffle a lot of different data at the same time while GPUs are good at shuffling a lot more of the same data. So the GPU memory model is bad for running multiple different programs at the same time. So the literal hardware interfaces of the chip are built for extremely different purposes, entire different programming idea :)
What about cpu and gpu using same memory? Like Apple m2 max with 32 core gpu uses same memory that used as ram.
@@iamraihan3203 sorry for late reply on something I am not an expert on M2. I imagine that being bandwidth limited is an issue for some applications because when I read about it is based on LPDDR5 DRAM. So maybe extreme gaming at max resolution and max fidelity cannot produce FPS similar to a “real” GPU with monster busses. If there is anything in m2 architecture that makes this less of an issue I don’t know about it. Just being on same chip as CPU shouldn’t make bandwidth larger than the physical RAM bandwidth. CPU having shorter latency to GPU might give some wins in specific workloads where bandwidth isn’t the limiting factor. But max graphics with tons of high resolution textures ought to break it. Similar to consoles I imagine game devs might optimize games for avoiding limits when there are few M chips with known performance.
So based on my limited understanding I don’t see how PC games requiring large VRAM + bandwidth requirements could ever be ported to m2 without bottle necking earlier. As long as the graphics problem is bandwidth limited speed between RAM/VRAM to GPU chip is the limiting factor.
So with a bunch of caveats; for extreme graphics you want a very fast bus between GPU chip and RAM.
Apple is known for good hardware though, such as their video encoder/decoder that was leagues ahead of what NVIDIA shipped. That made a huge difference for Canon R5 footage that that Apple could accelerate but NVIDIA could not. So not all problems are about bandwidth limitations.
Caveats caveats caveats :-)
You just shouldnt call GPU FP units cores, that's just a marketing term from NVIDIA. Shaders or PF32 units would be better names for what they are. The closest thing in a NVIDIA GPU to a CPU core would be something like a SM. And there are only like 128 SMs even in the highest End GPUs.
Ok then what is a core? Because Im betting youll argue the exact opposite for AMDs bulldozer architecture
@@jennalove6755 First of all, there is no exact definition for what a core is. Typical properties of cores are:
1. Memory is shared across cores
2. Cores can process data and instructions independently (MIMD)
3. (at least for symmetric multi processing) all cores are hierarchically equal (there is no master core that manages all other cores).
4. Each core has its own front end, Execution engine and memory interface
I don't know why you're mentioning Bulldozer?
@@jennalove6755 I guess core is the same as thread? Except with shared resources like hyper threading that can give you two threads per core.
@@jennalove6755 It would be more appropriate to consider the CUDA Cores as FP-ALUs. Making a full comparison is tricky due to FP vs Int and FMA vs FADD. But if we compare FMA only, then one of the dual-core Bulldozer complexes would have "4 Cores", and that goes up to "8 Cores" if you include the FADD units. So really, AMD could have argued that they have at least 2x as many cores as they did (2x FMA per Int core). And yes, I would claim that each complex was 2 cores as a core is typically defined as the number of front-ends (parts that fetch instructions). This is why the appropriate comparison with a GPU is the SM == Core, and the CUDA Cores are SIMD ALUs similar to the SIMD ALU cores from SSE on x86 (or NEON on ARM).
@@jennalove6755 Windows has defined my bulldozer a 4 core CPU with a total of 8 logical cores.
Love the video. Really interesting and pretty simple to understand.
Awesome, thank you!
Gpu cores also run at a lower clock speed which allows stacking more of them in a small chip
And each core(CU/SM) is very small too since it's very simple.
Clocks don't have to do with density directly. It more has to do with how deep the pipeline stages are and transistor yield and quality. Faster clocks demand that the entire pipeline stage, which can be multiple transistor "layers" deep and very wide, be completed by the next clock. You can either shorten or lengthen, narrow or widen these stages for various benefits. GPUs tend to have very wide, very deep pipeline stages which take a long time to process and guarantee data is fully through all the transistors, thus pushing clocks down on GPUs. But it gives them some amazing bandwidth energy efficiency at the expense of a bit of latency.
More current gets this happening faster but produces more heat, which is another consideration, as modern computers are far more heat production limited than anything. Heat cannot escape the dies fast enough at the currents necessary for the speeds we demand and they have to be turned off 80-90% of the time. Dark silicon is still a big problem. It's also part of why low power processors aren't not that much weaker yet consume 1/10th the power.
@@KaiserTom Clock speeds are also affected by the parasitic capacty of the transistor itself.
@@rattlehead999 The clock speeds are typically determined after the design based on the parasitics within the manufactured die. This is also where binning comes into play, chips with higher parasitics are binned to lower clock speeds.
The choice of CU/SM size is more of a tradeoff between area (to reduce die size) and pipeline latency (the shorter the pipeline, the more efficient warp switching becomes). Clock speed estimates are taken into consideration at that stage, but in a much more heuristic way that doesn't really consider parasitics (those end up affecting the design as you get closer to tapeout and perform advanced electrical simulations).
I think it's useful to mention that GPUs will frequently and deliberately block on memory as the memory subsystem is geared towards throughput with little caching in the way of reducing access latency. Hence, a SM core may theoretically switch context after every warp instruction.
Thumbnail is 🔥
MY MAN
You could have simplified this video so much and with less name dropping. While supporting it with a tiny spec of math to make it overall more understandable. Even still this was an overall good explenation. If anyone desires a more compact explenation read the following.
The CPU has a certain amount of threads (these are comparable to hands, the more hands you have the more you can do at once). Each thread capable of handling a single operation at a time. Usually the CPU has a number of cores and they might or might not be split in two so that there are twice the threads than there are cores (two threads per core), thus doubling the amount of things that can be done at once. A CPU can handle almost any operation but only a certain amount at a time(=amount of threads).
A GPU is perfect for Linear Algebra because it has many many tiny "threads*" each solving a small problem of a greater problem toghetter. Which is perfect for Lineair algebra. As Lineair Algebra is an extremely simple and easy form of math. Just takes a really long time for one person or thread to perform or calculate. For simplifications sake Lineair Algebra is just the math revolving around matrices. These are effectively 'packages' containing data. A matrix could for example give you a 3D coordinate on earth with its corresponding temperature and so on instead of having to express each property individually.
Summary:
CPU is like einstein, you have very few but they can do a wide range of things.
GPU is like a peasant, there are many, they can do very few things. But toghetter they topple empires.
If my English is broken please reply with where and what so i can correct myself.
Your english is fine! But you misspelled "toghetter" and "Lineair" , it is supposed to be "together" and "Linear".
That branch blocking fits a GPU perfectly, because it short-circuits the computation path if the view of the object is blocked, or not in view.
One thing to note. Starting with Ampere, Nvidias Cuda cores have a FPU, and a combined Int32 and FPU. They aren't completely split anymore. They did this to increase the max theoretical performance, however, there's almost never a case where some Integer calculations are being run. It's really quite interesting actually. AMD has done the same thing with Navi 31, in the Radeon 7000 series. I guess it's a way to squeeze out some extra performance without increasing die size
And also starting from Ada Lovelace (40 series), NVIDIA's SMs support on-the-fly thread reordering which basically lets them reorganise threads across multiple warps such that each thread in a warp is actively executing the same instructions on neighbouring sets of data. Where past architectures may have some threads block execution when the warp enters a branch, Ada Lovelace can detect this and reorganise these threads so that all threads in a warp are taking the same branch, while the other threads that wouldn't have taken that branch are organised into a different warp populated with other threads from other warps that _also_ wouldn't have taken that branch. Caveat with this is it seems like this is only supported with callable shader execution where one shader can invoke another on-the-fly, which I assume basically gives the GPU time to do the reordering.
NVIDIA also has another optimisation up their sleeve called subwarp interleaving, which basically lets a warp dynamically switch between two branches of code whenever one of the branches encounters an instruction that would block execution anyways (such as an instruction depending on data being read from VRAM). With current architectures the entire warp would just end up stalling since the first branch is waiting on a blocking instruction to complete and the second branch is waiting on the first branch to complete, but with this optimisation the warp would be able to switch to the second branch from the first branch, execute the second branch until it also encounters a blocking instruction, then switch back to the first branch where the instruction would hopefully have access to the data it was depending on. This isn't currently available in any publicly available GPUs, but NVIDIA was testing this out on a custom Turing (20 series) GPU and saw anywhere from a 6% to a 20% improvement in raytracing workloads, which is one of the most branch-heavy workloads you could give the GPU.
@@jcm2606 Nice info, I was interested in knowing more about this gen
@@jcm2606 some insider information? or is it available to the general public? Take care bro, I hope this doesn't count as casually breaching classified info on a random comment on YT.
@@nymbusDeveloper86 This information is literally on 4000 series webpage
@@flintfrommother3gaming OK
It has alaways been slower, the advantage of a gpu is not in speed (as some might think due to how it can render amazing graphics given some pipelines), but in concurrency, the sheer amount of cores with specialized functions does wonders for its specific needs.
Thanks for this correct and honest answer at the very end :) Ultimatively it depends on what should be done and how it's implemented. Certain tasks can't be done efficiently on a GPU.
GPUs are great for pipelining. So if you have 10 operations that have to be done in succession, it actually sets up several parallel pipeline you can pump data through them. You essentially get a fix delay which is about equal to the pipeline length but other than that the data can be pumped through at the clock speed. The worst thing that can happen to a GPU is to stall a pipeline. That's why using tons of different shaders is much worse than a few well designed ones. Though with actual CUDA cores they implement more and more concepts of a CPU into the GPU. Though in the end it's still a piece of specialized hardware for certain tasks. Yes, they can do some tasks a CPU can do as well, but are often times slower. Though for pure data crunching they are great, at least when you can adjust the algorithm for the architechture.
I love this channel. You have done a really good job with explaining concepts that are rather esoteric at times.
I found the snippets of code examples overlaid for context rather interesting as well. Code examples for small processes are really helpful for people like who are beginners, understand how simple tasks are performed, and then how they can be implemented en mass to perform complex tasks. I wish someone would talk a little bit about the beginnings of embedded programming, how to get into it, and working with PLD's and understanding the building blocks of creating your first projects. Like, how to scope a programming project, how to attack it, etc.
For instance, yesterday while driving home from work. My brain was occupied with the idea "Ok, assume I did not have a cmath library and only had addition. How would I perform exponent math? That was a fun little task.....first asking "Well, Brian.....what does it even mean to 'square' something?" and then the question was "Well wait.....how do I do multiplication with only addition?! I know how to do it as a person, but how does a machine do it with only addition and loops?" That was fun mental exercise. 🙂
There's also the interesting differences between SIMD vs SIMT, and how most modern GPU's actually implement both. (Typically only for packed vectors of int4/8 or for 2*FP16)
GPUs*
5:38 This specific design of GPU cores is also why modern GPUs tend to have multiple types of cores as well - each core type designed to handle one specific type of task.
Amazing Explanation! :)
This explains why Half-Life 1 and Counter Strike 1.5 was the last games of the "Software Renderer" era , and why we can have cool graphics nowaday, even with "low end" gpus
Halflife also supports OpenGL.
@@eleventy-seven I've read somewhere that Half-life 1 looks better with software rendering than with OpenGL.
No idea how I found this video but I'm glad I watched it!
You explained the difference between GPU and CPU cores very well and I could actually follow quite easily. You earned yourself a new sub!
Welcome!
He's pretty good at his stuff. You won't be disappointed by your decision.
My favorite way to tell people what a cpu is by using a restaurant analogy. The Cores are people, and the customers are incoming data, and each core/person has a task that it has to execute said task. Its the best analogy that i used to understand cpus
It’s moments like these you wonder why GPU’s are more expensive than CPU’s 😢
I do a lot of work in numerical weather prediction models, and much is now being run on GPUs. Running a simulation on a GPU is much faster than a CPU. The same goes for machine learning, such as a convolutional neural network - which is a popular way to work with large meteorological weather data to improve the accuracy of weather forecasts. Now the numerical weather prediction models are designed to work on Supercomputers, such as the Cheyenne supercomputer that the National Weather Service uses, but they can also run on your desktop computer. In fact, I run the WRF model on my I9-10850K and I'm also doing work with the Unified Forecast System (UFS), which is the next generation of numerical weather prediction model in the US. And for that I'm using my 3080. You have a very informative channel, and I will be subscribing.
The Intel Core i9-10850K is a desktop processor with 10 cores, launched in July 2020.
Currently, the combined processing power of NWS supercomputers is 8.4 petaflops, which is more than 10,000 times faster than the average desktop computer.
gpu shader cores from my understanding are little more complex than the first pentium cpu cores, but given modern architectural design, emphisis on massively parrallel design, and newer process nodes, you now have a massive number of relatively slow processors but with a huge number of 'threads'
good video, cpu also have instruction sets for simd, currently mainly avx2 and avx512
That's some deep level of knowledge
Thank you sir for the infos ❤️
Finally! a clear explanation of GPU architecture! Thanks!
It should be noted that there are tasks other than graphics that lend themselves to SIMS processing - more or less anything that involves vector or matrix operations (signal processing, solving massive systems of linear equations...). That's where Nvidia's CUDA API comes in.
@1:56 I really thought that'd happen someday. Back in the day I had a socket 939 motherboard with a slot to take an AM2 daughterboard. When GPUs started to do compute workloads and high end SSDs. Came on PCIE cards I thought it wouldn't be long before we get whole systems running on PCIE slots.
Btw you can run any directX graphics on your processor. A 13900K is about equal to a midrange card from 10 years ago, you can play CS Go on your CPU even without APU
The way I remember it being explained once; a Cpu runs single, intense calculations by orders of magnitude per second where a gpu crunches triangles.
Magic word: parallelism.
Nowadays it's a bit more complex than triangles since modern GPUs are designed to support some level of general compute (see compute shaders in most graphics APIs, as well as general compute APIs like CUDA or OpenCL), but yeah, that's basically it.
The GPU cores are almost like hardware version of specialised subroutines. In fact, that is what part of the CUDA cores are for, calculation of the triangles for 3D graphics. Meanwhile, CPU are general purposed processor that can execute any of the instructional sets within its CPU family.
Think of how Amiga PCs were capable of arcade like graphics and sound because it had specialised sprite and sound processors in addition to the Motorola 68000 CPUs they used.
IBM PCs had only Hercules/CGA/EGA graphics card, which were really ram/register buffers with DAC circuits to flush the data from c800 to the VGA analog output. Sound wise, IBM PCs beeper speaker were limited until Yamaha OPL3 FM chip gave it midi and a few voice channels. This off loaded specialised midi, wave, DSP and other related sound processing to sound cards.
Nice explanation of warp scheduling and stuff, I used those ideas a lot in my path tracer
Each NVidia warp is basically equivalent to a single CPU thread capable of executing vector instructions of size 32. A simplistic way to the estimate the sequential performance could be to divide the number of NVIDIA cores by 32. That does not work because of other major design differences with CPUs. The GPU cores are themselves grouped into SM ( Streaming Multiprocessors) designed to execute up to 32 warps in a way that is roughly similar to CPU hyperthreading but at a larger scale. Also, the individual warps are not optimized for speed. A CPU can execute simple instructions from a single thread at a rate of 1 per cycle but a GPU SM can only process the instructions of a single warp with a significant latency ; maybe 1 every 10 cycles for the simple ones to a few hundreds cycles for memory accesses. To make things worse, GPU have very little memory caches because they would be inefficient (or very expensive) when running tens of thousands of threads in parallel. Instead, GPUs rely on the principle that the memory latences between instructions are hidden by the fact that each SM is running up to 64 warps in "hyperthreading" mode (and the whole GPU may contain up to 100 SM).
The best summary i have heard is GPUs are optimized for high throughput while CPUs are optimized for low latency.
This generally applies if the problem can benefit from parallelism.
Warp does not come from warp-speed, but rather warp and weft in making fabric from thread.
You mean warping material into a form as in related to wrap
Old English weorpan "to throw, throw away, hit with a missile," from Proto-Germanic *werpanan "to fling by turning the arm" (source also of Old Saxon werpan, Old Norse verpa "to throw," Swedish värpa "to lay eggs," Old Frisian werpa, Middle Low German and Dutch werpen, German werfen, Gothic wairpan "to throw"
Also Old High German warf "warp," Old Norse varp "cast of a net". Most likely where we get the word 'wharf' where ships are moored.
CPUs actually include SIMD execution units and perform tasks in a similar way to GPUs.
What sets GPUs apart is their fixed function hardware used for special tasks like performing raster algorithms, sampling textures and in more recent times neural network inference and ray intersection testing.
We have also seen that under the right circumstances compute shaders are competitive with fixed function rasterizations and some CPUs including machine learning instructions.
This makes me wonder whether someday we might go back to a unified processor, that forgoes that overhead and complicated programming model that comes with using a co-processor.
More content like this I love it. Keep up the great work man!
I'd say comparing the two is like comparing a big commercial ship to a ship building mashine.
Very diffrent use cases but one definetly benefits the othet :)
I think it's also important to highlight the fact that FLOPS only measures the performance with floating point numbers. 80+% of your everyday tasks don't even involve floating point numbers at all
Absolutely amazing video, all the other channels massively oversimplify it.
Wow, thanks!
I would really enjoy some hands-on Vulkan videos in Rust or C if that is something, you would be interested in doing :)
If you're looking for a layman's terms distinction you could use general purpose for CPUs, as you did, and then specialized on the other end for GPU cores. Cuda cores, and whatever AMDs equivalent is, are specialized for a very specific kind of math in a very specific way and are thus made to be incredibly fast and efficient at that, whereas a CPU has to be able to do everything at roughly the same effectiveness just don't do as well at those tasks specifically.
Back when I studied cpu’s 2000) I was wondering why they didn’t use gpu , but gpu are better at numbers then instructions .
2:39 that was it the FPU .
One CPU core can execute several instructions in parallel, this is why SMT/HyperThreading is possible. While it's a bit tricky to define boundaries of cores, one could say that a x86 core with 6 ALUs and 4 FPUs would """equal""" 4 GPU "cores" as far as throughput is considered.
Otherwise x86 CPUs would be limited to "number of cores * frequency * 2" FLOPs, but instead the number is much higher (and I think CPU FLOPs are calculated by "number of cores * number of FPUs per core * frequency * 2")
And then there is rabbit hole with "double purpose FPU/ALU" + FPU combo in the newer Nvidia GPUs, which essentially means that theoretical performance is anywhere between half of what you'd expect (equal number of INT and FP instructions) and what's advertised (pure FP).
Some older GPU arches actually didn't support dynamic branches at all. And even today result is severe cost in performance:
the GPU actually computes both if/else code paths and selects the result via boolean (condition)
I.e. the shader if-else is likely compiled into branchless code.
I don't think that is the case anymore though.
This highly depends on a compiler (both IL and driver ones). If it decides that the difference between both paths and the mere execution of a branch will be negligible, then it most likely will result in a removed branch (so both paths execute). You can directly specify and hint the compiler what you want to generate with a [branch] or [flatten] atrributes on HLSL, don't know about other high level languages (glsl has no support for those iirc).
Another reason is that even if you cram many cores into a CPU like in a GPU, most of the time it wouldn't even be faster. Most programs aren't even able to take advantage of the 8 or 16 cores of today's CPUs, so the main bottleneck is currently the clock speed per core. And the clock speed equivalent of a GPU is very low.
Clock speed isn't the limitation..... ipc is the limitation
If clock speed was the limitation cpus wouldn't have been improving much in the last decade
As Alucard said clock speed doesn't mater IPC is the main thing that makes a CPU fast. You have 8 core cpus that run at 1.3ghz that give you preformance close to a Ryzen 7 1700x. The biggest bottleneck is the unwillingness of Intel and AMD to switch from x86-64 and cisc to risc or vlwi.
That's a really good point.
@@xfy123 Yes, the clock speed is not as direct of a measurement as IPC. But it's directly correlated with IPC, higher clock speeds are better, so there is no way you can say they don't matter
And it's not only the core, but the bus that connects the relevant data source and sinks. A GPU would be something like a 64-lane highway full of trucks (VRAM) with only a handful of individual destinations it can reach. That's because a GPU has a reduced instruction set with copy-pasted circuits per lane. A CPU on the other hand would be a large city with maybe a 4 or 8 lane highway (Cache). From any lane you take into the CPU you can reach more or less any destination register in that CPU.
Also GPU are fast in 32 bit float numbers, but using 64 bit floats performance can be 20x slower, so if you need 64 bit precision, CPU can be more performant
Saying that isn't fair is not close enough... Gpus and CPUs are just two totally different things, is like comparing a car with a truck...
I just wanted to thank you for these excellent videos!
My pleasure!
from your description, one alder-lake core (with avx512 enabled, so saphire rapids) would be equivalent to 32 cuda cores, since it can initiate 32 floating-point math ops per clock cycle. except that they can be two sets of 16-way simd, and simultaneously can perform another 4 logical ops and more memory ops at the same time, twisting and winding it's way through 2 arbitrary threads at the same time with far greater flexibility, and at a higher clock rate. so top xeon cpu's can theoretically manage >7 terraflops, as long as the memory can keep up. still 7 times slower, but much easier to design diverse code for.
Of course it’s fair to compare so we know what processor is best for what, as you have done.
Modern GPUs also have branch prediction engines just like CPU, so the threads blocking warp doesn't happen.
This is a nice overview of the differences between CPU and GPU. It might be better not to use the term CUDA core so often, as that doesn't match the terminology of what a CPU core is, and probably confuses the watcher. Granted, NVIDIA made the terminology confusing. The video also makes it sound like only GPUs perform SIMD operations and CPUs don't, which is a misrepresentation because SSE, AVX, etc instruction sets exist, which would multiply "CPU core" count by 8 or more. Another difference that wasn't mentioned that allows GPUs to have high throughput is that latency can be hidden by switching contexts between several warps on an SM, akin to hyperthreading on a CPU.
Awesome work. Can you do a video on CPU security features like pointer authentication?
The internet really needs a well-researched video on capability hardware. There's a good book on the subject but it only covers mainframes, and nothing after the 80s. I get that CHERI (including ARM Morello) is the big thing now, but the intel i960 was absolute perfection.
Warp, as in weaving. Look up "warp and weft." The warp is a bunch of parallel threads.....it's pretty easy to see why the name was chosen.
A channel that answers the questions my curiosity hungers for!
This reminds me of Acerola's video where he tries to make a pixel sorting shader, and he found that GPUs are miserably bad at sorting, and he could not get it to run at acceptable framerates without high end hardware. He explained these same concepts as "CPUs are smart slowly, GPUs are stupid faster."
I'd put things a bit differently.
CPU cores are designed to be versatile, to be able to perform many kinds of operations, but generally do one thing at a time very fast fast, that's because programs need to execute in a precise order, like add A to B, then devide the result by C.
GPU cores on the other side are designed to do only certain, limited things and are super small so you can fit thousands of them next to each other, meaning they can do multiple tiny things in parallel, all at once. Then you can just glue together what they produced to compose an image.
I can’t understand quite many things. All my takeaways are CPU is more generalized, while GPU is very scary on task-specific computing. 😂
A GPU core is smaller and more stupid, cannot act alone ; everyone in the group does the same instructions (program).
A CPU core is much bigger and more independent, and each core run one (or sometimes several) programs at the same time.
That was what the SIMD / WARP discussion was about :)
That's a very good takeaway actually. The essence is that a GPU "core" is just a very small math unit, it can do a couple of math operations and that's it, but also it's not an independent unit, it needs to be organised in bigger groups with other "cores" and run the same operations together but just on different data.
That's very usefull when for example you want to apply a shader to the entire image, i.e. multiply all the pixels on the screen with a different number to affect their lighting levels. But when e.g. you want to display a webpage that has all kinds of different elements you would have to do that element by element, i.e. the entire GPU thread calculate this button, then another entire GPU thread calculate the position of the other image, etc. it becomes ridiculous.
A CPU core on the other hand is essentially like an autonomous computer, you can run hundreds of different instructions, math, logic, weird jumps, memory operations, etc (it's getting really hard to keep this comment high level and not go into details :P).
@@InTimeTraveller Lol thanks bro. I understand it much better actually. x))
@@randomgeocacher :)))) thanks a lot mann that helps! 😄😄
Note that the Nvidia Cuda cores are probably closer to Intel threads. The best Nvidia equivalent for the core is the Streaming Multiprocessor (SM).
For the Ada Lovelace Architecture (RTX 40 series), each SM has 128 Cuda cores. 128 * 128 = 16384.
CPUs also tend to handle a larger number of data types with additional instructions (and more hardware) needed for both conversions between each, as well as all different operations performed for each different data type. The GPU core, on the other hand, can be a lot simpler and thus smaller.
Another way to put it:
A CPU core: A single jack-of-all trades genius who gets things done ASAP. He even tries to predict what will happen next so he is never waiting on someone else. His desk space takes up an entire floor of the office building. He has multiple assistants that organize things and try to keep whatever he needs within reach of this genius. Trying to manage lots of these people gets… unmanageable.
A GPU core: A factory worker who needs to be told what to do all of the time. Give the managers megaphones and they all get lots of stuff done, as long as the manager can utilize the factory workers efficiently. To do that, the factory workers have to all be following the same instructions.
See? Totally different!
Warp is fabric terminology. A fabric (cloth) consists of warp threads and weft threads.
Great video! However, CUDA uses the SIMT terminology instead of the SIMD does ;)
"Teraflops per second"...
"Trillion floating point operations per second per second"
The funny thing is if I'm not mistaken that the CPU could be the component that feeds instructions to the GPU since it could be independent while the OS is still running. It's also more dynamic in terms of the operations that it can handle so it could be faster or more efficient on a different math operation than the GPU.
The FPU in a CUDA core are optimized for particular floating point operations. They're really good at addition, subtraction, and multiplication. They suck at division and most any other function like trig and exponents. You're also going to suffer from starting GPU kernels and data transfers between the GPU RAM and main RAM/CPU cache. And the largest float they can handle is double. They can't do long doubles. Doubles are usually enough. But double precision is also slower than single precision, but that may be in common in with CPUs. I haven't checked.
I'm a simple man. See there's a new LLL video, click into it and like it straight up
We should be doing something "like" using GPUs where we now use CPUs. Not exactly - GPU cores are quite simple in terms of the suite of operations they can perform. But, we screwed the pooch decades ago in CPU design. We decided that we would let legacy C software drive the development of new CPUs - we took as our goal to stay with a single core design but make that core run the old code faster through whatever trickery we could come up with. And while the results have been impressive, they've also turned our CPUs into complex monstrosities that no one can understand completely. And that's exactly how we wound up with things like Spectre / Meltdown and so on. And in the end we had to swallow the multi-core pill ANYWAY. We should have swallowed it decades sooner and moved toward more simple and easy to understand (and easy to secure) cores, and just re-written software as necessary to make use of them. These days there's more logic in a CPU core to do all this trickery than there is actual logic that does the rubber-meets-the-road computation. It was a mistake, and now it's too late to fix it.
Just as an example, take instruction re-ordering. That was just a STUPID thing to do in hardware. The programmer or the compiler can handle that part of the problem, but that wasn't good enough because they wanted the chip to run EXISTING CODE BINARIES faster. So we spend logic on that that could have been spent offering us more controllable computing. Come on, it's obvious: just get the instructions in the right order to START WITH, instead of forcing the logic to re-sequence them for you every time. Out of all of the things we "shouldn't have done," this is the one that galls me the most. Speculative execution is another example. Instead of spending logic on that, we could have spent it on "more cores."
I'm not even entirely sure about cache memory logic. Having a fast scratchpad on the processor is certainly a good idea, but I suspect that there was a better way to do it that involved the programmer / compiler being conscious of what was going to go into it. Then all that "control logic" could have been used for... you got it - more cores.
The general principle here is that we tried to put logic into the chips to make them "think for the programmer." Instead we should have demanded that our programmers LEARN TO THINK. And focused on putting as many simple, easy to understand cores as possible into the programmer's hands. Now, I have no problem with the compiler being smart enough to do some of these things for the programmer - that's still just a one-time cost, and then you reap the benefits forever. But having the hardware do it, at RUNTIME, EVERY TIME? That's just poor resource management.
SIMT limitation in GPUs is no longer true in more recent architectures, however, it is true that they still perform best when they are used in a SIMT manner.
Comparing CPU cores to GPU cores is like comparing a FedEx truck to a freight train.
The freight train/GPU has a lot of limitations that allow it to be WAY more efficient compared to a FedEx truck/CPU. Sometimes it's worth it, sometimes it isn't.
I would have gone with a semi truck (GPU) and a scooter (CPU): when you want to move a lot of packages simultaneously, all from the same initial point to the same destination point, nothing beat a semi-truck, but if you want to deliver one package quickly with agility and ability to change your mind, a scooter is the best.
cpu is the manufacture hub, and the gpu is the logistics chain.
Good content. RUclips recommendations working. Make videos like these more.
Would it be possible to design an (inefficient and clunky) operating system that runs on a GPU? Or make Doom run on it?
GPUs may have many ALUs, but the way they're organized matters. I.e. AD102 has 18432 ALUs. These are organized into 144 SMs, with each (simplified) being able to simultanously run 4 different FP32 instructions, each on a group of 32 variables. The SM is what logically and in terms of functionality comes the closest to a CPU core.
144SMs definitely is in the same ballpark as the 128 CPU cores that are now common in servers. On such a GPU however, you need 4 independent threads (with 32 work items ea.) per SM in order to reach full utilization. On a modern CPU that mark is somewhere around 1 up to 1.2 (on average) threads being required to fully utilize a core.
When not having enough different and independent work, a GPU may not be able to flex it's muscles. Besides having to execute instructions on groups of 32 (some: 64) items, GPUs usually also suck at control flow (if - else, conditional branching and jumps), aswell as special (transcendental) math operations, such as sin(), cos(), tan(), sqrt(), … This leads to not every workload being able to actually benefit from GPU execution.
"Bad for general purpose computing" that's a bit too harsh. What really makes the difference is whether your algorithm supports high data parallelism without needing separate control threads, or at least lends itself easily to massive SIMD
Well it makes sense to me because a CPU is doing a bunch of work while the GPU has ONE task, to generate video. CPu has to do alot run your system load programs etc. Overall it would appear slower because if you think about it a CPU is giving a bunch of data out while the GPU is only doing video task.
Now get into APUs that have to do both...
At the end of the day, it is the difference that CPUs are optimized for single threaded tasks or tasks with a mix, while GPUs are beast of multhreading and only multithreading.
Great video! Just wanted to point out that the RTX 4090 can execute 83 Teraflops FP32, not 49.
My bad
@@LowLevelTV Also 330 teraflops of tensor operations
CPU vs GPU cores, for me, cpu are extremely smart. One huge difference is the scheduler and preemption mechanism. If a instruction is waiting on io (memory access) the cpu scheduler will make a decision to do something else and then come back to it. However, gpu threads/cores cannot. This is also due to there isn't many instructions a gpu core can run (only two types of algorithmic operations). A gpu core also cannot load memory itself, instead it just receives and outputs.
Another difference, on a higher level, is that the instruction set for gpu's are not open source, or you cannot directly program a gpu core. That's why you need to download the gpu drivers, which is the only way a program tells what the gpu does. It is almost like a functional programming paradigm program where the programmer tells the computer what needs to be done with the computer making a decision how to do it; which comparing to other programming paradigms (using languages like c/c++) you tell exactly what and how a operation is done down to the cpu instructions.
Versatile abacus vs highly parallelized calculator. BUT don't forget, that GPU design lends itself perfectly to data processing. Best example MCP7A or also known as geforce 9300. A wonderful hybrid chip consisting of an nforce 730i and a geforce 9300 with working drivers. And everything YEARS before cpus with crappy igpus.
Out of that DNA, DPUs were born.
WIth all that being said, you can harness the power of your GPU for things other than graphics, if it happens to coincide with what the GPU is good at.
Mouse Simulator 2D: Written in javascript, with no hardware acceleration, all on a pentium M, and the latest version of firefox:
Runs ~30fps
So basically, if we were to count CPU cores like Nvidia counts CUDA """cores""" (and AMD counts stream units), then the i7-11800H would have more than 128 cores? (8 cores capable of AVX-512, each ALU capable of doing 16 FP32 operations at once in SIMD, where I imagine each lane would be equivalent to one CUDA "core". And each Tiger Lake-H core most likely has multiple ALUs for superscalarity.)
GPUs are great at processing consecutive chunks of data. CPUs are great at processing arbitrary bits of data in arbitrary order. Car vs Semi. I would date to do Uber with a semi. I'd hate to move many pallets of product with a car.
4:20 as far as I remember it's not guaranteed that cores executing X will finish before Y starts. What's more Y can be executed before X
Thank you for explaining this.
No worries!
In short: Think your GPU is a really fast sport car that reach 250mph which can carry only 100 units inside, and your CPU is a not so fast truck that reachs about 100mph, but can carry 2000 units, and lets use 'sand' as the data we need to process.
The sport car is about 2.5x faster then truck, so if your data is 100 units of sand or less, the GPU can just handle it better than CPU.
But if you have 2000 units of sand, isn't effective using your GPU to transport that sand, because it will took more time then CPU.
So, CPU is for 'heavy' data process, and GPU is for small data that can send faster then CPU.
Thats why in games, we use GPU to render textures, effects, shadows, and other small things, and CPU to handle physics, AI, and some other big stuffs.
I woner how powerful and efficient cpus would be today if ARM was the prevailing architecture from the beginning
does shader execution reordering allow gpu cores to act more like cpu cores in certain cases?
Not quite. SER basically just allows the GPU to determine how much the threads in a warp are diverging from each other (ie executing different instructions to each other or operating on data from different areas of memory to each other), typically by asking the shader developer to provide it some data to base this off of so that the GPU can look at how different the data is between threads. This then lets the GPU attempt to optimise the case happening at 4:15 where the warp enters, say, an if/else statement and half the threads want to execute the X branch while the other threads want to execute the Y branch. By looking at the data that the shader developer provided, the GPU can estimate how badly the threads will diverge from each other and can reorganise them on-the-fly, basically ripping apart the entire warp and creating a new warp by combining non-divergent/divergence-minimised threads from different warps (this is my understanding of it, might be wrong, I'd really recommend reading the paper if you want to know more).
I was using AMD Athlon II X2 235e and GT730 for half of my life... I recently switched to i10300 and 1650ti... Damn i can sense the change
It's not weird at all. It's also because GPUs are technically in-order processors (the GPUs that are capable of out-of-order execution now also exist - ARM Valhalla and Snapdragon Adreno are examples of out-of-order GPUs now in use), and CPUs are therefore already superscalar out-of-order monsters. However how you program either or both processors matters a whole lot too, because GPUs are meant to chew through the vector math in parallel (ie. you want as many ALUs busy as possible).
EDITED: Forgot to add, SIMD vector processors, especially some superscalar varieties (such as AMD GCN shaders), tend to only work on exactly the same item at a time in parallel, unless they explicitly allow dual-issuing of two different SIMD vector math instructions at a time. FPUs are weird in a way.
I also realized that GPU’s are kinda slower and laggy too, I think CPU are much better when it comes to better performance