Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
Loved the explanation of how L1 cache works. Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation. Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level. It was anything but boring! Thanks to you two for doing this one
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
This is actually one of primary reasons why I bought RUclips Premium. No ads and offline background videos. Most of RUclips content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping. Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory. JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free". Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world. If anyone is more interested in this stuff I suggest a less known book: The Garbage Collection Handbook: The Art of Automatic Memory Management This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
I think we should go even deeper with Casey in the future. when I started I programming I watched around 100 episodes of Handmade Hero. I think alot of poeple don't have that context. I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about) Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey" I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible. Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
This is such great entertainment. I alredy knew most of this but 1) I feel so smart 2) This is not efficiently put together but entertainingly put togeather I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that. Simply put: This was awesome 🎉
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
It's like damn Casey why do you apologize for explaining L1 caches ? This is the most interesting thing I have heard this month it was just super great !
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
Casey is so knowledgeable on this stuff but -- and I dont mean this in a bad way -- speaks in such a dense fashion, that I had to rewind at several points to re-listen to what he said -- just to follow what he was saying in that first hour. I think it'd be virtually impossible to follow what he's saying live since I had to go through what he said at my own pace. It's all good but it's akin to a scientific journal that has to be read over and over again to grasp what is being said instead of being focused on giving a wide perspective or a 'top down' view of the situation. I think his brain just works like that. He's built to walk you through something, not to summarize what something is.
For anyone wanting to understanding exactly what casey is refering to when he talks about the Associative caching. [Virtual Memory: 13 TLBs and Caches] ruclips.net/video/3sX5obQCHNA/видео.html
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this. CPUs are so fascinating dude!
Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
+
The right amount of brussel sprouts bowls to burgers is 5 to 1
@@monsieuralexandergulbu3678 So 5 bowls of brussel sprouts for every 1 burger, got it.
it's*
I agree
simdeeznuts
the CrowdStrike joke was lit
real
It was shit
@@saltstillwaters7506 crowdstrike shareholder spotted :p
You and Casey have such good chemistry, please consider turning these videos to a podcast series!
Someone doesn't know about the Jeff and Casey show.
@@braincruserThis show was so good, Casey on his unleashed mode.
@@braincruser Yeah, this will also end up with the hosts to the punches.
Please consider sewerslide.
A 400-part series called "Handmade BFFs"
Execution on the crowdstrike joke was really on point
SIMDeez nuts
give em the ol swizzle
@@DavidM_603 Oh my god. 😂
@@DavidM_603 the ol shuffle
SIMDeez nuts
Maximizing the throughput of deeznuts
Casey coming in strong with "I don't even know what all this tech slop is, what tf is a fireship and a lavarel?" That's my boy right there! 😂
Another Casey video count me in! I don't care how many hours that guy talks I'm always learning so much from him.
Simdeeznutz, time to learn bud.
Loved the explanation of how L1 cache works.
Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation.
Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level.
It was anything but boring!
Thanks to you two for doing this one
Love how Casey is 2x the size of prime just towering over him as disembodied head. 😆
Love Casey! He actually knows what he's talking about. Great resource
Please invite Casey again! And give him a whiteboard!
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
I just wanted to say that I really appreciate those in depth explanations
Very good video! Keep having Casey on stream, it's really interesting and entertaining.
I've been listening to Casey since his Handmade Hero series. It was such a formative experience and glad to see him on the channel. Thank you
26:42 Hahaha the chat message “BEAM = berry easy artificial machine” was very under-appreciated
True
Casey is great. Dude is so chill
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
2 hour long Casey discussion. Sick.
Always great to see Casey on the show -- love these interviews!
I care. It's important stuff. Casey Muratori is a fantastic brain that is so enthusiastic. Love that guy.
I really enjoy in-depth talks like these with Casey. Please keep it going, its really incredible.
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
I wanna say mojo is working on something like this if I recall
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
I just want to say this was fascinating. we need more of this.
34:48 A handy conversion to remember is that light travels ~1 foot in 1 nanosecond (in a vacuum). Electricity in silicon is about 20% of that
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
Please make this a monthly or biweekly podcast. Love y'all's interactions, you really bring the best out of eachother.
Love Casey and these deep dives. He's incredibly interesting to listening to!
You should post this as a podcast, so I can listen to it while walking my dogs in the forest.
YES
please do
I would LOVE to listen to this while walking my dogs in the forest.
IF I HAD ANY
This is actually one of primary reasons why I bought RUclips Premium. No ads and offline background videos. Most of RUclips content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
what prevents you from...i don't know...play the youtube video and listen to it just like a podcast ?
@@OBEYTHEPYRAMID I’m assuming there’s no internet service on his dog walk in the forest
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping.
Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
That's a great course and very fun too
But can you do NaN to Tetris?
@@jwr6796 I can't even spell NaN....
His website is gold for this stuff. Almost too much information but its all good
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk
To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory.
JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free".
Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world.
If anyone is more interested in this stuff I suggest a less known book:
The Garbage Collection Handbook: The Art of Automatic Memory Management
This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
Wow that is really interesting, I didn't know that. Gonna check up on that book
Casey is by far my favorite guest! I learn a ton every time he’s speaking. Also he’s great at simplifying and explaining things!
This may actually be my favourite discussion so far, I thought I already understood a lot of this but I was missing some key concepts.
Really good show, watched the whole thing and would love to see another one. Great vibes, learned a lot, what more can you ask for. Keep it up, guys!
Best content in my feed for weeks. You're both great!
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
Casey is amazing. Please bring him back!
I think we should go even deeper with Casey in the future.
when I started I programming I watched around 100 episodes of Handmade Hero.
I think alot of poeple don't have that context.
I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
- How do you know Casey is dropping tech bars?
- His mouth is open
btw, simdeeznuts
Thank you Prime! Casey is awesome!This is just such an interesting subject, now searching for the HW Engineer’s perspective as case mention @1:05 :D
Him not knowing fireship is funny 🤣
he even said "idk what fireship is" instead of "who" lol
Not bored at all dude. Stayed till the end.... 👍
SIMDeezNUTZ
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about)
Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
SIMDeez NUTs
Not so many popular channels go this deep, explained so well. Prime content right here.
These things just go above my head, there is so much more to learn
Do you know what Beam is? If so, please enlighten me.
Every time I see a video that has Casey in it makes me smile.
Bro I love Casey sooo much!! Please bring him on more
Oh definitely keep these coming, these are a goldmine.
SIMDeezNuts
Please do more with thus guy it reminds of the time when we had to know our hardware well if we had to do code for it
This was really interesting, bring Casey on more
Casey seems very knowledgeable, love to hear his thoughts
49:13 missed opportunity to make cache hit joke right there
simdeeznuts!
I was here for a little of this conversation and it was great.
CASEY IS ON THE CASE!!!!
more Casey please! Amazing knowledge and content.
the crowdstrike joke.... beautiful!
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey"
I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible.
Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
The web industry is getting laughed at - but we deserve it.
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
SIMDEEZNUTS
Casey is awesome, his course opened my mind to new things after 24 years of professional (yea right...) programing.
I could listen to these deep-dives for ages
Simdeeznuts for more Casey interviews
Damn you and your working of the algorithm. Also, SIMDeezNuts.
wow this is taking me back to the days of cpu designs :) physical & virtual addressing. page aligns, cache flushing. oh the memories.
This is such great entertainment. I alredy knew most of this but
1) I feel so smart
2) This is not efficiently put together but entertainingly put togeather
I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that.
Simply put: This was awesome 🎉
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
It's like damn Casey why do you apologize for explaining L1 caches ? This is the most interesting thing I have heard this month it was just super great !
Bring Casey more. He is such a delight
Casey just single handedly elaborated the best JavaScript defense argument EVER
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
SIMDEEZ NUTS
Well today I truly feel like a nerd. Sadly I understand exactly what Casey is explaining.
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
I loved this, the whole L1 cache thing was super interesting
Interesting interview. Great deepdive about 8 ways etc. Didnt know that. At all.
simdeez nuts
Casey is so knowledgeable on this stuff but -- and I dont mean this in a bad way -- speaks in such a dense fashion, that I had to rewind at several points to re-listen to what he said -- just to follow what he was saying in that first hour. I think it'd be virtually impossible to follow what he's saying live since I had to go through what he said at my own pace. It's all good but it's akin to a scientific journal that has to be read over and over again to grasp what is being said instead of being focused on giving a wide perspective or a 'top down' view of the situation. I think his brain just works like that. He's built to walk you through something, not to summarize what something is.
He's better when giving a prepared speech - and it helps greatly if he knows his audience well
@@Muskar2Indeed. :)
This giant head is disturbing
yeah i agree
his brain is too big to fit in normal screen size
true but its funny too :D
The ego must fit somewhere
the giant talking head of wisdom and knowledge
😂😂😂 all I needed was the crowd strike joke.
Casey is basically explaining the Hennessy Patterson book. Although he's good at doing so :)
It's very easy: A and B are isomorphic if you can define a bijection between A and B.
Not sure if you were trying to be funny, but I think 'bijection' is a _less_ known word than isomorphic
@@Muskar2 What do you mean less known, it’s just injection and surjection happening at the same time.
I like your funny words, magic man.
Not sure if you were trying to be funny, but I think 'injection' and ‘surjection’ are less known words than bijection
This was a great conversations!
Learned alot by listening in😊
Where can I find more stuff like this, on this level!?
As a last year eee, this guy helps me understand many things
For anyone wanting to understanding exactly what casey is refering to when he talks about the Associative caching.
[Virtual Memory: 13 TLBs and Caches] ruclips.net/video/3sX5obQCHNA/видео.html
So many of these absolute gems of channels buried all over the place, thank you for sharing
Great vid, thanks Casey and Prime👍🏻
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this.
CPUs are so fascinating dude!
One of the best episodes I’ve watched.
This looks like a podcast ill watch/hear once a week.