Seems you made an error, around 2:33... You completely confused flops and FP32/FP16 FP16 and FP32 stands for the Floating (point) Precison, basically how many bit's the value is stored in, not - flops, the theoretical max floating operations per seconds
I noticed that too he doesn’t know what single and double precision floating points are. Aren’t graphics cards measured in giga or tera-flops? 16 floating point operations per second would be awful.
2:35. FP16 and FP32 refer to 16bit Floating Point number and 32bit Floating Point number. They are often called "half" and "single" respectively (there is also "double" which is 64bit, but is not really useful in AI). The reason for the inverted acronym is because nearly all programming languages typically require that a name start with a letter (but can contain numbers), so 32FP would be an error in most languages. It is not a measure of operations/FLOP because that would measure nothing interesting, it is a measure of precision. They determine the number of unique values that can be represented. FP32 is actually *slower,* because more transistors are involved in the calculation. You'd typically use FP16 because of the improved performance, only using FP32 if you needed the precision (which is extremely rare). This would be the exact opposite of what you'd expect if they stood for 16 FLOP and 32 FLOP, the latter would be more operations (ignoring that there is no time unit, again, a strange unit of measurement in this context). en.wikipedia.org/wiki/IEEE_754
@Ashwin Mouton: No one saw this coming ? AI is the next big thing, and the iPhone has got a similar hardware “Neural Engine” since the iPhone X released at the end of 2017.
Google is really going too far. Usually when I start googling something new to learn about it, I would have dozens of adds or suggestions thrown my way for the next couple of days. But this time, Google actually commissioned a guy on RUclips I've been watching to make a VIDEO explaining the concept I've been trying to understand. Wow, that's just freaky.
Google then followed up with a call to Hugo's cell phone leaving a message asking him why he stepped away from his computer after watching the video. He didn't call them back so they reached out to his parents followed by his best friend from elementary school.
Let's talk about cross device tracking. They really go too far. If I watch bikinis ( I am human, after all) those videos shall not show up in my other device where I Focus on science and tech .. Well RUclips is a monopolistic service. What's else? Vimeo? Google is a monster.
So it's mostly a different way to solve problems using better resources for a specific kind of operations, I watched the video two times to see if I got it right, great one Greg. Video liked as always
A few corrections/clarifications, the matrices, as the text says, are 4x4x4 i.e. 3 dimensional. (not 4x4 i.e. 2 dimensional). And without knowing for sure, I will bet that FP16 and FP32 are referring to 16 bit (two bytes) and 32 bit (4 bytes) precision rather than anything to do with speed.
A matrix is 2D by definition. There is no such thing as a 3 dimensional matrix. Tensors can be 3d. But tensor cores are dealing with 2d matrices, and should really be called matrix cores not tensor cores, but sounds better for marketing. The FP part is indeed a mistake. tbh i dont really understand where how they can say that the "processing array" is 4x4x4. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Where does the extra dimension come in?
Actually when I think about it, each node in the 4x4x4 space probably represents one multiplication result each. So for each row*col you need 4x multiplication-accumulate to complete the calculation. Still we are only dealing with matrices (2d vectors), its just that you need 4x the matrix dimensions in muliply-accumulate operations. Imo visualizing the multiplication in 3d is just confusing, as its all happening in 2d. Atleast it confused me.
I really appreciate the way you’re able to reduce these ridiculously complex descriptions into moderately complex examples. I still have no idea what a Tensor Core is/ does because I was just staring at the RGB in the PC behind you...
Really like the Techie videos. Found this one exceptionally interesting. Would love to see a video on why GPUs would not make good CPUs and Visa-Versa and the differences between them on an architectural level.
FLOP is derived from FP, not the other way round. FP16 is a 16 bit number, where the decimal point can be at any position of the number - hence floating point. FLOP/s is just the measurement of how many operations with this kind of numbers can be done per second - but the single precision FLOP count refers to FP32 numbers - 32bit floating point numbers.
Tensor cores sound like this saying from Bruce Lee. "I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times."
FP is not short for FLOPS, it's short for Floating Point. FP16 means Half Precision floating point number. FP32 means Single Precision floating point number. FP64 means Double Precision floating point number. The number 16, 32, and 64 refers to the number of bits that it takes in the memory to store floating point numbers with respect to their precision !!!
I love the minute science videos. It's one of the main aspects of differentiation you have compared to other tech tubers. I love the reviews and builds as much as the next guy but learning about what goes into the hardware or software is really interesting and not many do it. You video on nano-meter in regards to CPUs was really informative.
time line 2:38 , The FP16 and 32 are not Flop related. They are regarding the Floating Point bits representation, half precision = 16 bits and single precison =32 bits.
Matrix math was super easy for me, and I have number dyslexia oddly. I had no idea a Tensor Core was pretty much the same as Googles 8 bit matrix single instruction ASIC processerror. Good video. Short and informative.
Excellent video! I watched the Google I/O on Tensor Cores and didn't quite get it (or at least recall it) beyond that it's good for machine learning. This video made it super easy to understand! Excellent job!!!
I'm an electrical engineer and Tensor is a term we never use, that term is for civil and mechanical engineering. But it seems similar to a State-space representation matrix used in Dynamical Systems and Control.
This is why each AI matrix models is 8 x 8 x 8 (512 cuda cores) and why Nvidia includes accuracy for32 bit, 16, and 8 (being the simplest and fastest, especially for high speed language models, vision, , or audio) In which case, it can do multiple models and correct for itself. This is also why the hopper architecture has over 16,000 GPU cores and 72 CPU cores. You have 132 matrix instances, with 2 CPU cores per matrix model, and 6 left over for overhead. Besides AI, this is also an excellent Data center CPU and GPU combined. Along with the memory. It greatly simplifies data center deployment
One correction, at 2:35 you refer to the "FP16 or FP32" as being an abbreviation for FLOPS - this is incorrect. In the document you are referencing, the FP in FP16 and FP32 stands for "Floating Point" and the 16 and 32 correspond to bits of precision.
there will be tensor cores in gaming gpus i can bet on that. remember that real time ray tracing demo? nvidia developed these technologies to take advantage of tensor cores in games. cuda cores alone just cant handle ray tracing. why would they invest there money in something no one can use?
So much for that "prediction"... Tensor cores are in consumer cards. 20xx-RTX, which was oddly out before this video was made. Though, honestly, Titan-V was a consumer card too. The non-consumer cards being Quadro cards. (Or whatever the Quadro-variant is actually named, to identify it as the VOLTA/TENSOR version.)
The processing power never get enough (Because a multi-million market bussiness pushing it) But for how long ? We may never know....And BTW, the picture remind me of "Terminator 2" the movie !
@sciencestudio I was hoping you were gonna mention real time ray tracing in game, that is supposedly being handled by the tensor cores, thats the reason i and sure many others were interresed in them for a gaming aspect. it was shown in a video called: The state of Ureal, during the developer conference
INSAINT the new Metro game has ray traced lighting in, maybe shadows not sure, but that will be the first game to use RTX. Not on consoles though PC only.
got curious... and looked this up... came across this video... and then I heard him say @4:37 .... Seemed silly since all we get now are Nvidia cards with Tensor cores. Queue the SUPER SERIES.... the follow up to the 20 series cards all mainstream cards with tensor Cores... hehe! Found this comment silly!
It's amazing how RTX is just becoming normal today now while only 2 years ago when this video came out, it still seemed like something totally unreachable.
hmm... when you pointed that finger multiple times , it reminded me of supreme leader allladin talking about enriched uranium... sorry i dunno why.. lol k will subscribe now... even though i dont understand much of it... a vid put together well... n u know yiur thing.. looking forward learn more from you.. :)
I believe 1180 will have Tensor cores, for the same reason CUDA cores were introduced before all games used them. Machine learning is starting to become quite mainstream, maybe not in games but who knows, that could also happen!
+Science Studio so if an application that was previously written in cuda had functions in it that would be more efficiently run on tensor cores, would the driver automatically run those operations on the tensor cores, or would the developer need to rewrite their application to take advantage of them?
Say a certain application is written in CUDA 7 SDK (tensor cores 100% not supported, as there was no NVIDIA GPU utilizing them) and it uses tensors and related (mathematical) functions, even if you would upgrade the driver to the latest versions, it would not be able to take advantage of tensor cores running on such a tensor cores-equipped GPU. That would be because the latest CUDA SDK (with tensor core support) might be backwards compatible with (programming) functions called in older SDKs, but newer (programming) functions which take advantage of both the driver's tensor core support and GPU's tensor cores are never called. If the driver could automatically detect that certain (mathematical) functions can be accelerated using tensor cores instead of traditional CUDA cores, that would be the case were an older application can get an automatic performance boost using a tensor core-equipped GPU, but I doubt it's the way NVIDIA did this, as it would make the driver too large and complex (edit: humble opinion from a programmer, definitely not a CUDA engineer, but still handled similar things between different SDK versions)
I hope Nvidia has tensor core on GT1130 or 2030 or lower segment but newer product althought it is not as fast as hi end one. Btw if CUDA was so good, why there are no program that can do simple HEVC endoding using CUDA? Mainly for older GPU than doesn't have NVENC that support HEVC encoding?
Great video! One thing though. If the next gen cards (gtx 11 or 20 whatever they will call it) are supposed to support rtx/ray tracing, doesnt that tech use the tensor cores in order to function in real time (AI calculates most of the light rays and where they go as I understood it, though I could be wrong)? So wouldn't that mean they'd probably have some tensor cores? I know we know little about the tech and the next gen cards, but if I understood those rtx demos correctly I believe these new cards would actually have to have tensor cores in order to do it. Your thoughts? Just wanna know if I am understanding the rtx correctly or not.
So, you mean efficiency of software can impact the efficiency of hardware? What are unique and interesting idea. Also I guess what you're saying is read nvidia's blog on what a tensor core actually is. Thanks
3:21 : " Tensor cores can handle 64 floating-point mixed precision operations per second ". How do you not realize how that's clearly false because it would be incredibly slow ?
"...where FP16 and FP32 are used -- don't worry these just stem from the acronym FLOP, which stands for floating point operations per second..." Sorry, but that is patently wrong/incorrect. Floating point is actually a data structure type which is declared PRIOR to declaring the variable when you are programming. Much like how you might declare a variable as an integer, (and even then, it can be int4 and/or int8 (for single and double precision integers), floating points (FP) is a type of number that includes decimals. (In C, single precision floating point variables are declared as float, while double precision floating points are declared as double. In FORTRAN (which is used in a LOT of HPC and scientific computing still) is declared using either real(*4) for single precision, and real*8 or double precision for double precision floating points.) Conversely, the acronym, FLOP, floating point operation is the number of floating point operations. FLOPS ("plural") is the correct acronym for FLoating point OPerations per Second, i.e. how many operations it is performing per unit of time (i.e. computational throughput), most notably, per second. FP32, BTW, is a data type of floating point, 32-bits in width/length, or 4 bytes, which is single precision. FP16 is a data type of floating point, 16-bits in width/length, or 2 bytes, which is half precision. Depending on implementation and compiler along with OS, and hardware architecture, a single precision floating point number can be 7 or 8 places to the right of the decimal point, while double precision can be anywhere between 14-16 decimal places to the right of the decimal point and half precision will typically usually be four decimal places to the right of the decimal point. Again, FP16/32 is NOT and DOES NOT come from the acronym (which you got it wrong anyways) FLOP(S), but rather from the data type of your programming language of choice (which in this case, is likely to be C/C++ because CUDA's in C (although you can call CUDA in Fortran, but CUDA isn't natively in FORTRAN).
Nice video! You might find it interesting that about a fifth of the Top 500 supercomputers in the world use Nvidia based Tesla GPUs, as I guess you would call them "math co-processors" for the supercomputers. Two of the systems use the new Volta cards. What *I* find really interesting is that they have Tesla Voltas running on IBM POWER CPU based systems, not Intel or AMD. Except for 26 systems, all of today's top 500 supercomputers run on Intel Xeon CPUs. Of the non-Intel CPUS, 22 are based on IBM POWER CPUs, 6 on the Sparc64 CPUs, two AMD Opterons, and two weird ones, named ShenWei, made by China, and said to originally be based upon the DEC Alpha CPU. I love digging around the Top 500 Supercomputer list's spreadsheet each time they are released...and the most recent one is notable for the total dominance of Linux as the operating system of Supercomputing. All of the Top 500 run Linux....No AIX, No Solaris, no HP-UX, and certainly no Windows!
Well, there is another co-processor that is starting to be used, it's the Intel Xeon Phi. It is basically a version of the Xeon CPU that has added around it math units that were designed for GPU use, giving it superior specialized math capability, around an x86 core which makes it easier (in theory) to program. Only 7 of the Top 500 list use Xeon Phi co-processors, while 97 use Nvidia Tesla, 2 use Nvidia Volta, and 2 use a combined Tesla/Xeon Phi setup. There is one other co-processor, a proprietary accelerator made by a Japanese company PEZY Computing / Exascaler Inc., called the PEZY-SC2 and PEZY-SCnp, which I can find very little about in a language I can read.
i want to game, but i also want to train my models, my CNNs and RNNs will appreciate every tensor core, soo, yep, bring more tensor cores to regular Gpus, not only to 2k+ gpus
Doesn't stop me from wanting tensor cores to enhance AI's for shooters and Sim's while the reduced workload on the CPU can provide more room to do other things.
Seems you made an error, around 2:33... You completely confused flops and FP32/FP16
FP16 and FP32 stands for the Floating (point) Precison, basically how many bit's the value is stored in, not - flops, the theoretical max floating operations per seconds
+Science Studio, you should put a message in the video with the correction.
Also consumer grade graphics cards do come with tensor cores now :P
This+1.
worse yet he made something up to cover.
He makes lot of these type of mistakes in his videos.
I noticed that too he doesn’t know what single and double precision floating points are. Aren’t graphics cards measured in giga or tera-flops? 16 floating point operations per second would be awful.
4:38 owh that didn't age well, we have Tensor cores in all Geforce RTX cards now lol
Don't count on it 😂😂
2:35. FP16 and FP32 refer to 16bit Floating Point number and 32bit Floating Point number. They are often called "half" and "single" respectively (there is also "double" which is 64bit, but is not really useful in AI). The reason for the inverted acronym is because nearly all programming languages typically require that a name start with a letter (but can contain numbers), so 32FP would be an error in most languages. It is not a measure of operations/FLOP because that would measure nothing interesting, it is a measure of precision. They determine the number of unique values that can be represented. FP32 is actually *slower,* because more transistors are involved in the calculation. You'd typically use FP16 because of the improved performance, only using FP32 if you needed the precision (which is extremely rare). This would be the exact opposite of what you'd expect if they stood for 16 FLOP and 32 FLOP, the latter would be more operations (ignoring that there is no time unit, again, a strange unit of measurement in this context). en.wikipedia.org/wiki/IEEE_754
"Should we expect to see tensor cores in consumer grade graphics cards? Dont count on it."
Who else is watching this after RTX reveal? xd
Lol...I guess no one saw this coming.
@Ashwin Mouton: No one saw this coming ? AI is the next big thing, and the iPhone has got a similar hardware “Neural Engine” since the iPhone X released at the end of 2017.
The RTX 2060 has 240 Tensor cores
and they find a potential application to boost gaming performance with it as well. DLSS 2.0
@@dans.8198 another way to say "has got" is "has."
Google is really going too far. Usually when I start googling something new to learn about it, I would have dozens of adds or suggestions thrown my way for the next couple of days. But this time, Google actually commissioned a guy on RUclips I've been watching to make a VIDEO explaining the concept I've been trying to understand. Wow, that's just freaky.
Google then followed up with a call to Hugo's cell phone leaving a message asking him why he stepped away from his computer after watching the video. He didn't call them back so they reached out to his parents followed by his best friend from elementary school.
Let's talk about cross device tracking. They really go too far. If I watch bikinis ( I am human, after all) those videos shall not show up in my other device where I Focus on science and tech ..
Well RUclips is a monopolistic service. What's else? Vimeo? Google is a monster.
I don’t think it’s because you searched, or that he was even commissioned in the first place to make the video, but yeah we never know xD
Greg: Tensor cores are not likely to be in consumer grade GPU's any time soon
NVidia: Hold my drink
So it's mostly a different way to solve problems using better resources for a specific kind of operations, I watched the video two times to see if I got it right, great one Greg.
Video liked as always
They need to start making relaxor cores. They might chill out at Nvidia and drop some new GPU's
that's a joke worthy of a laugh track
rtv190 are we talking about a laugh track that's used in moderation like in old TV shows? Or an overused laugh track in current TV shows?
A few corrections/clarifications, the matrices, as the text says, are 4x4x4 i.e. 3 dimensional. (not 4x4 i.e. 2 dimensional).
And without knowing for sure, I will bet that FP16 and FP32 are referring to 16 bit (two bytes) and 32 bit (4 bytes) precision rather than anything to do with speed.
Simon Als Nielsen you are correct
A matrix is 2D by definition. There is no such thing as a 3 dimensional matrix. Tensors can be 3d. But tensor cores are dealing with 2d matrices, and should really be called matrix cores not tensor cores, but sounds better for marketing. The FP part is indeed a mistake.
tbh i dont really understand where how they can say that the "processing array" is 4x4x4. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Where does the extra dimension come in?
Actually when I think about it, each node in the 4x4x4 space probably represents one multiplication result each. So for each row*col you need 4x multiplication-accumulate to complete the calculation. Still we are only dealing with matrices (2d vectors), its just that you need 4x the matrix dimensions in muliply-accumulate operations. Imo visualizing the multiplication in 3d is just confusing, as its all happening in 2d. Atleast it confused me.
I really appreciate the way you’re able to reduce these ridiculously complex descriptions into moderately complex examples. I still have no idea what a Tensor Core is/ does because I was just staring at the RGB in the PC behind you...
not gonna lie this video was loooking spot on as hell, the camera is just making everything look so clean and crisp
Thanks mah dude.
Science Studio your welcome my man
Really like the Techie videos. Found this one exceptionally interesting. Would love to see a video on why GPUs would not make good CPUs and Visa-Versa and the differences between them on an architectural level.
You sir are a worthy adversary.
I absolutely love watching your videos thank you so much for explaining this term
FLOP is derived from FP, not the other way round.
FP16 is a 16 bit number, where the decimal point can be at any position of the number - hence floating point.
FLOP/s is just the measurement of how many operations with this kind of numbers can be done per second - but the single precision FLOP count refers to FP32 numbers - 32bit floating point numbers.
The only youtuber that can takes him time to actually explain the engineering behind these stuff. Thanks Greg :)
Tensor cores sound like this saying from Bruce Lee. "I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times."
Hearing dot product got me triggered about cross products.
FP is not short for FLOPS, it's short for Floating Point. FP16 means Half Precision floating point number. FP32 means Single Precision floating point number. FP64 means Double Precision floating point number. The number 16, 32, and 64 refers to the number of bits that it takes in the memory to store floating point numbers with respect to their precision !!!
I Love to learn...more qna if you get time..I listen to it while im at work..keeps me motivated..sounds crazy... but it really does
Thanks for the support, Paul.
Science Studio you're one of very few tech tubers who keep me motivated to learn too
I love the minute science videos. It's one of the main aspects of differentiation you have compared to other tech tubers. I love the reviews and builds as much as the next guy but learning about what goes into the hardware or software is really interesting and not many do it.
You video on nano-meter in regards to CPUs was really informative.
I use my 1080 ti for tensorflow ML on a daily basis so I was excited to see you had a video related to ML. Great video !
-UL alum
You must be reading my mind because I've been looking for videos on tensor cores lately
time line 2:38 , The FP16 and 32 are not Flop related. They are regarding the Floating Point bits representation, half precision = 16 bits and single precison =32 bits.
Matrix math was super easy for me, and I have number dyslexia oddly. I had no idea a Tensor Core was pretty much the same as Googles 8 bit matrix single instruction ASIC processerror. Good video. Short and informative.
Hey thanks for the video! I work in ML so its kind of fun to see videos like these pop up on my feed.
Manatee Licking?!?
Excellent video! I watched the Google I/O on Tensor Cores and didn't quite get it (or at least recall it) beyond that it's good for machine learning. This video made it super easy to understand! Excellent job!!!
I'm an electrical engineer and Tensor is a term we never use, that term is for civil and mechanical engineering. But it seems similar to a State-space representation matrix used in Dynamical Systems and Control.
Man I miss your channel so much!
Thanks for Teaching us..!
Good Job Sir
This is why each AI matrix models is 8 x 8 x 8 (512 cuda cores) and why Nvidia includes accuracy for32 bit, 16, and 8 (being the simplest and fastest, especially for high speed language models, vision, , or audio)
In which case, it can do multiple models and correct for itself.
This is also why the hopper architecture has over 16,000 GPU cores and 72 CPU cores.
You have 132 matrix instances, with 2 CPU cores per matrix model, and 6 left over for overhead.
Besides AI, this is also an excellent Data center CPU and GPU combined. Along with the memory.
It greatly simplifies data center deployment
Great explanation, love these informative videos man 👍
Thanks!
This makes me want a titan V.
Not because I'd utilise its full potential with my dumb cuda programs, just because it's cool af
4:39 well..... rtx 2000 and rtx 3000 series have them. Used for DLSS and other AI
Great video, Greg!
5 minutes video translated into one sentence: they are fixed function units
Thanks for making this video getting a bit fed up of all the rumor channels keep banging on about tensor cores being in next gen gaming cards.
One correction, at 2:35 you refer to the "FP16 or FP32" as being an abbreviation for FLOPS - this is incorrect. In the document you are referencing, the FP in FP16 and FP32 stands for "Floating Point" and the 16 and 32 correspond to bits of precision.
More vids like this please
Ooh just got on my break!!
there will be tensor cores in gaming gpus i can bet on that. remember that real time ray tracing demo? nvidia developed these technologies to take advantage of tensor cores in games. cuda cores alone just cant handle ray tracing. why would they invest there money in something no one can use?
because the world is more than gaming, for example AI as I mentioned
You were right on the money! 👍🏿
Yep. There are now.
good explanation
Tell us about CUDA Cores and how it differs from Tensor Cores. As far as I know, CUDA Cores also used for parallel processing in MLDL works.
Thanks! Wanted to know this :P
Thank you, great video!
excellent video :D
Love these videos please do moarrrrrrrrrr
Yessir!
Great video... I guess. Let's just say I remember why I dropped out out of engineering school. I'm sure this made sense to someone. Wooooosh
Thanks for this video and explaining things dude now i can brag about these things with my friends 😂😂😂
So much for that "prediction"... Tensor cores are in consumer cards. 20xx-RTX, which was oddly out before this video was made. Though, honestly, Titan-V was a consumer card too. The non-consumer cards being Quadro cards. (Or whatever the Quadro-variant is actually named, to identify it as the VOLTA/TENSOR version.)
The processing power never get enough (Because a multi-million market bussiness pushing it) But for how long ? We may never know....And BTW, the picture remind me of "Terminator 2" the movie !
@sciencestudio I was hoping you were gonna mention real time ray tracing in game, that is supposedly being handled by the tensor cores, thats the reason i and sure many others were interresed in them for a gaming aspect. it was shown in a video called: The state of Ureal, during the developer conference
INSAINT the new Metro game has ray traced lighting in, maybe shadows not sure, but that will be the first game to use RTX. Not on consoles though PC only.
Very well. And now over to the Gamer's Nexus to learn about memory subtimings.
You could mention that neural networks make extensive use of matrices
SVP transcoding NVIDIA TensorRT , RIFE AI 60fps 보간 기법하니까 영상미 아주 좋습니다
This didn't age well. Welcome to the world of Nvidia RTX. 😳
Thanks for simplifying all this for us "Dummies"!
The Engineering Explained of tech.
Nate wu Love the Senna at the profile pic mate
Nate wu Are they related?
got curious... and looked this up... came across this video... and then I heard him say @4:37 .... Seemed silly since all we get now are Nvidia cards with Tensor cores. Queue the SUPER SERIES.... the follow up to the 20 series cards all mainstream cards with tensor Cores... hehe! Found this comment silly!
It's amazing how RTX is just becoming normal today now while only 2 years ago when this video came out, it still seemed like something totally unreachable.
hmm... when you pointed that finger multiple times , it reminded me of supreme leader allladin talking about enriched uranium... sorry i dunno why.. lol
k will subscribe now... even though i dont understand much of it... a vid put together well... n u know yiur thing.. looking forward learn more from you.. :)
FP 16 and FP 32 is not a flop . It is a floating point that tells you how precise your calculation will be . What are you talking about ?
So the tensor core is soft ware that combine cuda cores to do the matrix?
I believe 1180 will have Tensor cores, for the same reason CUDA cores were introduced before all games used them. Machine learning is starting to become quite mainstream, maybe not in games but who knows, that could also happen!
Aaaah yeah, the good old 2000 series, a lot has changed and its amazing.
That’s a clean ass shirt bro 😍🙏🏽
So did we get tensor cores in consumer gpus?
but what about the new ray tracing that nvidia unveiled? isnt that meant to run on tensor cores?
+Science Studio so if an application that was previously written in cuda had functions in it that would be more efficiently run on tensor cores, would the driver automatically run those operations on the tensor cores, or would the developer need to rewrite their application to take advantage of them?
Say a certain application is written in CUDA 7 SDK (tensor cores 100% not supported, as there was no NVIDIA GPU utilizing them) and it uses tensors and related (mathematical) functions, even if you would upgrade the driver to the latest versions, it would not be able to take advantage of tensor cores running on such a tensor cores-equipped GPU.
That would be because the latest CUDA SDK (with tensor core support) might be backwards compatible with (programming) functions called in older SDKs, but newer (programming) functions which take advantage of both the driver's tensor core support and GPU's tensor cores are never called.
If the driver could automatically detect that certain (mathematical) functions can be accelerated using tensor cores instead of traditional CUDA cores, that would be the case were an older application can get an automatic performance boost using a tensor core-equipped GPU, but I doubt it's the way NVIDIA did this, as it would make the driver too large and complex
(edit: humble opinion from a programmer, definitely not a CUDA engineer, but still handled similar things between different SDK versions)
So there is any software (no games) where I can use this tensor cores? some IA that organizes my files, desk, answer my emails and more?
I hope Nvidia has tensor core on GT1130 or 2030 or lower segment but newer product althought it is not as fast as hi end one.
Btw if CUDA was so good, why there are no program that can do simple HEVC endoding using CUDA? Mainly for older GPU than doesn't have NVENC that support HEVC encoding?
Did the mathematics behind your petroleum engineering degree help you in understanding concepts like these within computer hardware technologies?
This is a very interesting video, I wonder if NVIDIA or AMD will put R&D into a new kind of core for gaming?
Great video! One thing though. If the next gen cards (gtx 11 or 20 whatever they will call it) are supposed to support rtx/ray tracing, doesnt that tech use the tensor cores in order to function in real time (AI calculates most of the light rays and where they go as I understood it, though I could be wrong)? So wouldn't that mean they'd probably have some tensor cores? I know we know little about the tech and the next gen cards, but if I understood those rtx demos correctly I believe these new cards would actually have to have tensor cores in order to do it. Your thoughts? Just wanna know if I am understanding the rtx correctly or not.
RJ Santiago you can use CUDA for ray traced effects just tensor is better.
So, you mean efficiency of software can impact the efficiency of hardware? What are unique and interesting idea. Also I guess what you're saying is read nvidia's blog on what a tensor core actually is. Thanks
3:21 : " Tensor cores can handle 64 floating-point mixed precision operations per second ".
How do you not realize how that's clearly false because it would be incredibly slow ?
five years later, yes tensor cores are available in consumer graphics cards, BEHOLD.... DLSS!!!
Only if Greg knew Nvidia's going to drop a bomb (DLSS) with this tensor cores a year later then he would be more emphasizing on this for sure XD
thank you for tensor core video, could you do a benchmark video of tensorflow on a rtx 2080 or rtx 2080ti or 2070?
"larry" :-)
What about Raytracimg for that sweet real-time dynamic. Lighting?
laughs in rtx 4090
You pronounced "veil" like "veal" and created the word "matricy," presumably referring to a single *matrix among many matrices.
but for rendering and modelind tensor cores should be beter? is it some kind of ACIS?
So what cores are likely to be used in the future for handling ray tracing in games?
Can the api/code libraries for CUDA be used for tensor cores?
So you can replace tensor cores with cuda cores when needed although not as efficient.
yep, there is a major difference in speed
www.nvidia.com/en-us/data-center/tensorcore/
You need to add a correction for your mistake at 2:33 +Science Studio
"...where FP16 and FP32 are used -- don't worry these just stem from the acronym FLOP, which stands for floating point operations per second..."
Sorry, but that is patently wrong/incorrect.
Floating point is actually a data structure type which is declared PRIOR to declaring the variable when you are programming.
Much like how you might declare a variable as an integer, (and even then, it can be int4 and/or int8 (for single and double precision integers), floating points (FP) is a type of number that includes decimals.
(In C, single precision floating point variables are declared as float, while double precision floating points are declared as double. In FORTRAN (which is used in a LOT of HPC and scientific computing still) is declared using either real(*4) for single precision, and real*8 or double precision for double precision floating points.)
Conversely, the acronym, FLOP, floating point operation is the number of floating point operations.
FLOPS ("plural") is the correct acronym for FLoating point OPerations per Second, i.e. how many operations it is performing per unit of time (i.e. computational throughput), most notably, per second.
FP32, BTW, is a data type of floating point, 32-bits in width/length, or 4 bytes, which is single precision. FP16 is a data type of floating point, 16-bits in width/length, or 2 bytes, which is half precision.
Depending on implementation and compiler along with OS, and hardware architecture, a single precision floating point number can be 7 or 8 places to the right of the decimal point, while double precision can be anywhere between 14-16 decimal places to the right of the decimal point and half precision will typically usually be four decimal places to the right of the decimal point.
Again, FP16/32 is NOT and DOES NOT come from the acronym (which you got it wrong anyways) FLOP(S), but rather from the data type of your programming language of choice (which in this case, is likely to be C/C++ because CUDA's in C (although you can call CUDA in Fortran, but CUDA isn't natively in FORTRAN).
Nice video! You might find it interesting that about a fifth of the Top 500 supercomputers in the world use Nvidia based Tesla GPUs, as I guess you would call them "math co-processors" for the supercomputers. Two of the systems use the new Volta cards. What *I* find really interesting is that they have Tesla Voltas running on IBM POWER CPU based systems, not Intel or AMD. Except for 26 systems, all of today's top 500 supercomputers run on Intel Xeon CPUs. Of the non-Intel CPUS, 22 are based on IBM POWER CPUs, 6 on the Sparc64 CPUs, two AMD Opterons, and two weird ones, named ShenWei, made by China, and said to originally be based upon the DEC Alpha CPU.
I love digging around the Top 500 Supercomputer list's spreadsheet each time they are released...and the most recent one is notable for the total dominance of Linux as the operating system of Supercomputing. All of the Top 500 run Linux....No AIX, No Solaris, no HP-UX, and certainly no Windows!
Farrell McGovern interesting digging. Anything else which you found?
Well, there is another co-processor that is starting to be used, it's the Intel Xeon Phi. It is basically a version of the Xeon CPU that has added around it math units that were designed for GPU use, giving it superior specialized math capability, around an x86 core which makes it easier (in theory) to program. Only 7 of the Top 500 list use Xeon Phi co-processors, while 97 use Nvidia Tesla, 2 use Nvidia Volta, and 2 use a combined Tesla/Xeon Phi setup. There is one other co-processor, a proprietary accelerator made by a Japanese company PEZY Computing / Exascaler Inc., called the PEZY-SC2 and PEZY-SCnp, which I can find very little about in a language I can read.
Farrell McGovern gotta love those corrupt companies
What about Ray Tracing? From what I've heard it can only be done with tensor cores.
The newer tensor cores are different now and supports 8bit and 4bit numbers.
so a normal graphics card with a driver/firmware modification can create Tensore core? or is it physically different ?
Physically different. The transistors that forms it's logic are optimized for certain types of calculations.
Nice shirt Greg. It suits you. Lovely Colour :)
We need games with a great AI that will utilize Tensor Cores, so this cores may appear in consumer GTX videocards later.
i want to game, but i also want to train my models, my CNNs and RNNs will appreciate every tensor core, soo, yep, bring more tensor cores to regular Gpus, not only to 2k+ gpus
what about Ray Tracing? doesn't that need tensor core?
Excuse me while I dash off to get my PhD in computer science. THEN I might be able to understand you!!
Battery is filling up
Doesn't stop me from wanting tensor cores to enhance AI's for shooters and Sim's while the reduced workload on the CPU can provide more room to do other things.
Don't count on mainstream devs jumping on board any time soon.
The new Metro game is the first to use ray traced lighting, maybe they are using tensor cores as they are perfect for it.
Great technology ...
So all in all, cuda cores can do what tensor cores do, just slower.
When I got stuck in Tensor Analysis, I would play hangman on the Kristoffel symbols. Down vote if you feel my pain.
A metric ton of derivatives...pun intended..😂