If you enjoyed this, help support me: www.patreon.com/simondevyt As for the question at the beginning, as a first hint, look at how the indices for the arrays are generated in both loops. Write out a few, and think about the pattern in memory. Actual answer below. Both loops touch all memory exactly once. First loop generates indices sequentially (ie. 0, 1, 2, 3, 4, ...), meaning memory accesses will also be sequential. Second loop uses a large stride (0, 4000, 8000, 12000, ...), so cache misses will be common.
@@heinzerbrew It's the indexing, just unroll the loop a few times to confirm. Sequential, predictable memory access vs jumping around, main topic of video. Php? Don't know much about it, other than it's an interpreted language. Guessing the overhead of that drowns out other things.
@@heinzerbrew Again, you can't use php to test this. And this has nothing to do with memory allocations. Additionally, please don't use the pinned comment for this.
This is a pretty contrived example. I tried to think of the absolute, most basic thing that would illustrate the point without needing to build a lot of extra boilerplate code around it. But in this example, using "gcc -O3" doesn't optimize it, but gcc -Ofast does, because it disables strict standards compliance. So it does sorta catch it, but is held back because you have to explicitly tell the compiler that you don't care much about the order. Remember that floating point operations are NOT associative, so it tends to be the compiler reordering them without explicit permission from the programmer is a no-no. V8 doesn't seem to catch it, but can't say for sure why. So the difference in timings between the loops should still largely come down the difference in memory access patterns. We could make the setup a little more complex with some structs with useless info to pad them out, or an array vs linked list, or something to that effect. Maybe it would have been a better choice? Would be more legit, but harder to parse for beginners. I dunno, I'm learning how to present this as I go heh.
This channel is pure gold man. Although I already know most of these concepts, you explain them pretty clearly and in a very understandable way ! Well done friend, subbed!
This cleared up so many things that my professor failed to teach me (or even understand it himself) We truly need more videos explaining how the cpu works when we code Thank you so much!
One is accessing the array in row major order, and the other is accessing it in column major order. The top one is faster because the memory is being accessed in a contiguous way, which means you'll get a lot of cache hits (very fast, going to main memory much less). The bottom is is NOT accessing memory in a contagious way, meaning every access is a cache MISS. en.wikipedia.org/wiki/Row-_and_column-major_order
@@tagg1080 That's exactly the only change. He's using a formula to treat a 1D array as if it were a 2D array. But the order of + and * determines whether it's accessed row by row, or column by column.
You didn't explain why the original code posted was 10x slower. The reason being that the method in which the array was effectively traversed was not sequential or easily predictable, causing a lot of cache misses.
I dutifully paused the video at the beginning and took _way_ too long to spot the differences between the code samples. (I've never been much of a programmer.) After getting it I still learned a lot more from your discussion on the caches and all that. Thanks.
I feel like this is going to be an awesome series, especially for people like me who learnt programming by themselves and never really had a formal education teaching them about what parts of your code does under the hood.
I did study computer science, but none of this was covered. This is mostly from years of experience learning at companies and applying these optimizations.
@@simondev758 i share this experience. in computer science, we also never covered that. only when i had to do projects and facing performance optimization problems, i was motivated to tackle the fundamentals of data structures and how to squeeze more and more efficiancy out of my code.
@@Ainigma wait really? I’m at the first semester of computer engeneering and i’ve alterady learned about this, Idk if the way they teach is different here in Brazil or not
Different universities teach different curricula. I went to a university that's considered pretty good (ie. often included in top lists), the focus was much more on theory and mathematics. I mentored a few Stanford grads while I was at Google, came off with the same impression. Super strong bases in theory/math. Also, computer engineering != computer science. I can totally believe that an engineering oriented curricula would learn way more about the underlying hardware.
Good Stuff, my man! The explanation is very clear and the dry humor is appreciated. I went into this video not knowing what a contiguous array was and now I have an idea of what it is.
Question; I understand that the L1, L2, and L3 caches reside on the CPU itself. Is there a typically a controller on the CPU like a northbridge? Or is it the CPU itself that access the caches hence the low latency; no middle man.
That's awesome! That's right, caches typically reside on the CPU. They're also built with a different kind of memory (SRAM), faster. Northbridge is the interface between CPU & main memory and resides on motherboard. That's my understanding anyway.
@@simondev758 I hope you will grow so much that you will be able to afford one of those macs without feeling its price, i watch your videos for quite some time now and you deserve it.
"let's head to apple store. I don't actually own one, because they are super expensive and I'm super cheap, but I like to look at them" 😂🤣😆 Man this was soooo relatable on so many levels 😂
Youness covered it well, but I left it open as sort of an "exercise for the viewer". Kinda wanted the video to be in depth enough that you could answer it yourself.
@@simondev758 Leaving as an exercise to the viewer is fine, but as a suggestion for the format it would be good to circle back to the question at the end of the video and call it out. The way it's structured now it seems like the question is just forgotten about.
One is accessing the array in row major order, and the other is accessing it in column major order. The top one is faster because the memory is being accessed in a contiguous way, which means you'll get a lot of cache hits (very fast, going to main memory much less). The bottom is is NOT accessing memory in a contagious way, meaning every access is a cache MISS. en.wikipedia.org/wiki/Row-_and_column-major_order
@@bonehelm I still don't get it, both FOR starts looping on x and then y. So both should have the same speed. The only difference here is that on the first one you do the multiplication before the addition... And on the second one the other way around... And it doesn't seems to be a continuous problem here....
For all those of you who are wondering why he accessed 1d array like if it is 2d, here is the real world example. Assume that in C++ you write a class that represents a matrix. You can go with 1d array or something like vector< vector< int > >, which you can think of as being 2d array. So why the hack then to use 1d array. There are many pros of using this approach. One of them is much faster work of constructor and destructor. Assume that you need to allocate memory for NxM matrix m. With 1d array it is one call of malloc (which asks the system at allocate a chunk of memory for you). With 2d array you should call malloc N times, once for each row. The same logic applies when you want to free memory. So now once we chose to represent our matrix as 1d array, assume that the user of our class wants to sum all of its elements. The user has no idea how did we choose to represent the matrix under the hood, he only knows that he can access an element of the matrix by using operator() as m(x, y). Your operator() in turn will have to calculate the position of an element by evaluating y * M + x and in the essence this will be exectly the same code as he showed in the video.
Awesome! Looking forward to this series :) It's really not easy teaching about memory and data structures without the audience falling asleep, but you've done a great job! Unfortunately JS arrays are messy and not exactly contiguous... :(
Hah yeah, I actually had a small section on JS arrays in there just to clarify about them, but cut it due to time (and laziness). You're totally right, they're non-contiguous "array-like" objects. The TypedArray is closer to what you'd get in C++ or something.
I have a final tomorrow over this stuff and this video just happened to pop into my recommended. Probably cause I was googling virtual memory, caching, etc. Super informational video, and I was able to study and be entertained at the same time. I’m a fan 👍🏼
I'm glad I'm not the only one who noticed he didn't answer the original question. I already knew most of what he discussed (how memory and the caches work) but he didn't explain how that applied to the question he posed. The reasonable answer (as I suspected before I watched the video) is that one piece of code accesses the data by rows and the other by columns. The cache would load the data by rows which would make the code that accesses the data by rows faster since there would be a lot of cache misses for the code that accessed the data by columns.
Yes, It's because in the first one, for the whole time the inner loop executes, the starting address remains the same (x*4000), then we just move contiguously with the +y, which equates to +1. In the second for-loop, the inner loop starting address is the same, but it's not contiguous, we move in steps of (4000*y) every single iteration, which is a huge step to get to the next element.
I don't remember I watched this video before until I saw the L1/2/3 Cache example. The drawing is making the video so interesting. Now I come back for the new videos after finishing my algorithm final exam!
Arrays are also slow when you have multiple threads, each bound to its own CPU, trying to access a global array. Suppose you have an array of size 4 int32_t, and 4 threads labeled 0 to i. The i th thread accesses the i th index in the array and modifies it repeatedly. This will cause major slowdowns as each modification will mark the cache of the other 3 threads/CPUs as modified, and each thread will have to do extra work to update their cache, even though the value that the thread cares about isn't being modified. This essentially makes the parallel code sequential. A linked list would be much better to use here since the memory locations will not be adjacent to each other, or each thread can modify a local variable and only write to the array at the end.
I have code that uses multiple threads to generate large arrays using perlin noise, and get near perfect scaling with threads. If you were doing something that multiple threads had tons of modifications, the L3 cache would still be valid, as that is shared between cores, so it wouldn't be that big of an issue. The bigger issue would be multiple threads constantly updating the same data causing race conditions, or bizarre behavior due to how memory reads and writes work.
Amazing Video Simon never thought about why arrays are fast, in spite of using them more than a hundred times. Now, since I am working on my own project, I feel the need for them.
An amazing series start, Simon! Thank you for this dive into memory allocation/access details, very helpful! Surely looking forward to a (contiguous?) continuation and your other quick projects & cool tips! Very kind of you to put in so much work and share it. Again, thanks!
Loving this series man. Please keep going. Would love to see more of real world optimization like you did previously. But yeah fundamentals need to be cleared up first. Btw I remember there was one cpu level hack which exploited this eager loading behavior in cpu. Some private part of memory was located by measuring the access speed, if the access was faster that meant the memory location just after current location was already cached by the cpu even though later on it was restricted to actually use due to other privacy constraints.
A simple and interesting explation of arrays, I came here to watch this video just to make sure I knew exactly how everything works. Having hardware classes with programming coursesreally, does help out a lot to understand computers, both on hardware and software level.
Posting this for anyone else out there like me. As someone who knows JUST enough code to be dangerous, I feel dense for not getting it sooner. My brain absolutely looked at the two for loops, saw them moving through the same numbers, and even though I recognized there was a difference in the last line, my thought process collapsed to "alright so you move through all the same array positions eventually," and I spent ten minutes trying to figure out, on my own, why changing the order of operations would make it faster while ignoring, y'know, the actual result of iteratively running the loops. The first comment I saw mentioned row vs column major ordering, and even then, I thought "that's silly, it's a 1D array!" and failed to make the connection. Took me seeing someone list the resulting indices in order to finally click. Despite so much of video focusing on contiguousness. So for anyone else that did the same, we're a special kind of stupid, but at least not a unique kind.
This is a very great high-level view of arrays and CPU caching! I wish I had this video as an introduction when I was taking my Computer Systems class in college!
why does the code on the top run 10 times faster? I thought you were going to answer it in this video. By looking at it I can deduce that it has to do with the nested loop. The first code goes in contiguous 4.000 data points, then moves onto another 4000 batch, which should be already cached. The second skips between batches, grabbing the first from all of them, then the second, etc. I don't see why it wouldn't be prefetched if it's repeteable.
Yup, the bottom code block is leapfrogging by 4000 positions in every iteration of the nested loop but the top code block accesses the array sequentially. Its basically the difference between grabbing one sandwich from the plate and eating it completely before moving onto the next VS taking a bite from a different sandwich every time until your plate is clean.
I optimized some existent numerical C++ code (for signal processing) from 420ms per iteration to 45ms, but the goal was 30ms per iteration. There is a giant nested for loop that was taking 44ms to run. So I applied the change specified in the video to reduce cache misses. Now it's taking 22ms. Maybe I could further optimize it by writing a special data structure (probably just a linked list), but that's good enough for the manager by now and he has other priorities for me.
There are pros and cons to this worth noting. It can be a valid form of redundancy to have an array as a 'look-up' to speed up data coming from a query in a relational database (rather than hitting the query over and over). That helps performance sometimes. However, the cache vulnerabilities can be exploited. An array does not have the same benefits of a linked-list in terms of being able to specify pointers for pass by address versus pass by reference. Also though, in some languages a linked list can be a problem as it needs objects, and that means garbage collection (Java) can happen at anytime, thereby making performance hard to predict and sometimes rather stuttering.
Are we working our way towards ecs? ...I'm getting those vibes haha. This also reminded me of a channel 9 talk I recently saw. "Native Code Performance and Memory: The Elephant in the CPU" Blew my mind the importance of cache misses and structuring your data in continuous chunks
Oh yeah definitely want to cover ecs at some point, why major companies are going that route, etc. Lot of ground to cover between beginner and that level though heh.
Had a chair on advanced computer architecture, this covered some weeks on a 10 min video. Wish I had this kind of resources back then. Also had to make a simulator with several caching sizes and strategies, good times 😂
My only formal training in programming is an elective course in C in college. I've loved it so much I end up writing device drivers for baremetal code running on top of ARM Cortex M processors as a hobby. There is no substitute for learning the fundamentals. It pains me to see some universities starting Computer Science students with a course in Java...
if you aren't constrained by the requirement of the array must keep it's ordering, you could also just take the value then swap in the last item in the array when deleting right?
Takeaway: Memory caching magic allows faster access of indexes that are close. What constitutes 'close' is arbitrary, the optimization aspect is accessing indexes in order of proximity. Also curious as to whether 'x * 4000' is cached behind the scenes. Otherwise, pulling it into an outer loop variable should be better.
How does the processor look up data in the L-caches? I thought it was looking not for specific data, but for whatever data is in a specific data address. Do the L-caches keep the data alongside the address of that data in memory, where it came from?
I think it uses the reachability idea. If there is no way a part of memory can be reached it will be garbage collected. They didn't use reference count approach because that doesn't handle 2 pieces referencing each other but are completely isolated and not reachable.
8:26 it's actually only true if you want to keep the order of the elements. If not, you just move the element you'll overwrite at the end, and place the new element where the old one used to be. This is two copies every time instead of an increasing number of copies as the array grows larger and you insert close to the beginning. Same thing for removing.
The video is good and correct, but I expected to hear something on arrays in JavaScript that are indeed objects. I was looking for an info how they behave in detail regarding speed. What I also missed was an information how about build-in boundary check in many languages affect array access speed.
I actually had a section about JS arrays in the original script, recorded it and everything. But decided that this series wasn't JS specific and cut it.
Considering MicroSD cards can now store 2TB... I very much suspect there will soon be an ultra-array in which the cache has almost every popular combination of 255 bytes each intermixed of read-only-memory in crystal and through lasers, whereas the memory pointers will have its own buffer... sort of like how we download a RUclips app for our smart devices to load a whole lot faster than without in requiring significantly less bandwidth plus the reloading.
Great content once again. Thanks. I would like to see a three.js tutorial about creating model outlines / silhouettes. I know about outlineEffect and outlinePass but I'd like to see any other techniques (I've heard about stencil buffer but I have no idea if it's possible in three.js). My problem is that the outlineEffect is crappy for boxes and the outlinePass is too expensive (for my needs). I'm wondering if there is a cheaper method to draw an outline.
No, unfortunately using an array does not fix the underlying problem. If you are not accessing the data in an array in a linear fashion, then you will produce excess cache misses no matter what. Any algorithm that does random data accesses basically basically throws 99% of your CPU performance away on a modern machine. So you bought this 4.8GHz beast of a CPU for $500 and what you are getting out of it is the performance of a $5 microcontroller because both are limited by basically the same memory access time. The only way to prevent that is to be very aware on the algorithm level what actually happens when the machine runs through the data.
Hur hur hur, I actually understand this stuff now. =P. I've never used Hash before though. In my use case (targeting enemies near the player), I followed Unity's design for this; they have a new array each frame that grabs all the colliders on a specific layer around the player. What I did was change that to a list and cache'd the list also giving it the size for 50 since I didn't see the game having more than 20 enemies in the scene at once within a detectable range. Lists are slower than arrays, about 10x to 100x slower depending on what I'm doing (Or so I read). I'm hoping doing it the way I did it by initializing the list in memory with 50 spaces for things makes it faster... The result is it did, changing over enemies and the player to this system gave a marked increase in performance. =D Happy Ending. Why use List over the built in Unity array? Well Lists are easier to manipulate, and later on I want to put things in order of their vector 3 magnitude with the player. So plucking things from one list and adding it to the other while possible with Array to List allows for a bit more backwards checking if things are alright as I'm developing.
thank you simon! does this cpu caching also apply to javascript? i created a testscript in node and only get a 2x performance improvement. i guess this only applies to compiled / strongly typed languages? i would really appreciate a answer.
It also appear in Javascript since js is also using the cpu and the arrays are mostly contiguous in js too. The performance boost can differ based on your cpu and what you do
Yeah it applies to JS too, that code snippet I had at the beginning was in JS :) To add to leomelki's response, which was great and covered this, remember that JavaScript arrays are more like "array-like" objects, see the docs: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array They're sparse and potentially non-contiguous, so yes these optimizations will apply but they'll also be partially cancelled out by just overall noise from JavaScript. If you use a TypedArray, you're getting much closer to what a C++ array is, and that's actually what I used in the example.
@@simondev758 thank you for answering my question. i know that js also uses the cpu and has arrays :) but since js is a interpreted language and not a compiled one, i was thinking that the js engine may does some things in the background, that could get in the way of efficient cpu caching. im pretty solid in js but i have no idea what the js engine does in the background. i tried using a normal array and not a typed one, that maybe was the reason i couldnt get the 10x improvement. anyway thanks a lot for all your great content.
Np, also keep in mind it'll be super dependent on what cpu you have, etc. You'll get a win with perfect caching vs all cache misses, the magnitude will vary from cpu to cpu.
for all wo are interested in this. i did some more tests and could also get a 10x improvement on performance. my tests used a more real world example for my ecs system ( array of structs/objects instead of numbers) and you can see very well how the performance changes simply by adding / removing properties. i can now add some performance improvements to my ecs system just by watching your video ☺️
Love the videos, but watching using my speakers makes my table tremble. I'd love it if you would run the sound through audacity or some other sound editing software to reduce the lower sound frequencies because my subwoofer deafens your voice. It would make the sound a lot crisper, too!
So please clarify this for me: the code at the top is faster because every single loop addressing in the array is contiguous, so it can use the CPU's prefetch and hit the L1 L2 L3 cache? But the code at the bottom, since the index is jumping, so CPU can't prefetch, it has to access memory every time?
you only have to shift everything over in the array on an erase if you expect it to all stay in the same order though. in a lot of situations you don't care and can just fill in the hole with the last element and then decrease the size by 1 :D
I'm still confused as to why the top code in the beginning is 10x faster. I see what everyone is saying about row-major and column-major order, but that seems to only apply to the layout of multidimensional arrays, and the array in the code isn't multidimensional. It seems like the real reason is that the bottom code has to do 4000^2 multiplication operations, whereas the top code only has to do 4000 multiplications.
It's subtle but it's there if you look closely. I'll put the answer below in case you want to look again first. Look at the 2 loops, they loop over exactly the same number of times, so doing exactly as many loops. Each loop does 4000 * 4000 loops, or 16000000 iterations in total. Each accesses an array using the x and y values, and if you look at how they calculate the index of the array, there's exactly the same number of calls to calculate an index, which is a multiple and an add, so exactly the same amount of work. You can confirm this for yourself but just using a smaller loop, say 3 x 3. So the index calculation would be [x * 3 + y] for the first loop, and [x + 3 * y] for the second. Then the indices you'd get for the first set of loops would be [0, 1, 2, 3, 4, 5, 6, 7, 8] The 2nd would give you [0, 3, 6, 1, 4, 7, 2, 5, 8] Notice both loops touch every array element exactly once. The only difference is the order they access it in.
@@simondev758 Ok, I see how the bottom code could lead to less efficient caching, but I don't think that's the only difference. In the top code, x is being multiplied by 4000 on every iteration of that inner loop, but since x doesn't change in that inner loop, x * 4000 can be precomputed before executing the inner loop to avoid doing the same multiplication operation over and over. So in the top code, that multiplication only has to be done the 4000 times that x changes as a result of the outer loop. In the bottom code, however, y is what's being multiplied by 4000, and since y changes on every iteration of that inner loop, y * 4000 can never be precomputed, so it has to be calculated all 4000^2 times. So it looks like the top code only has to do 4000 multiplications, whereas the bottom code has to do 4000^2 multiplications, which is a difference of almost 16 million multiplications. It seems like that would be the main reason the top code is so much faster.
@@GamingBlake2002 Very possible, got me wondering so I redid this quickly be writing the test again, but this time only a single Test function with the 2 loops, and the index was calculated as i = x * p1 + y * p2; p1 and p2 being parameters to the function, fed from the commandline. In other words, Test(buf1, 1, 4000) vs Test(buf2, 4000, 1) Compiled with g++ -O3 End result, same, so you might be right about the optimization on the multiplications, it seems to have little effect on the end result. I like the angle that you're looking at though, so in the future I'll make sure that if I'm going to show off the effects of something, I'll see about making a single unified code path with just mutated parameters.
But did you forget to answer the question at the beginning of this video? The real question is whether a Javascript array remembers its access pointer? Does it behave like C that an array is basically a pointer? I copied your code and tested, the first loop is indeed significant faster, 4x faster on my machine. The only difference is in the fist loop, the array was accessed sequentially, and in the second loop, the access is kind of jumpy. So may I have the conclusion that the implementation of Javascript array stores a state of a pointer position so moving the pointer to the closer position/index is faster than moving it to a more distant position/index?
If you enjoyed this, help support me: www.patreon.com/simondevyt
As for the question at the beginning, as a first hint, look at how the indices for the arrays are generated in both loops. Write out a few, and think about the pattern in memory. Actual answer below.
Both loops touch all memory exactly once.
First loop generates indices sequentially (ie. 0, 1, 2, 3, 4, ...), meaning memory accesses will also be sequential.
Second loop uses a large stride (0, 4000, 8000, 12000, ...), so cache misses will be common.
@@heinzerbrew It's the indexing, just unroll the loop a few times to confirm. Sequential, predictable memory access vs jumping around, main topic of video. Php? Don't know much about it, other than it's an interpreted language. Guessing the overhead of that drowns out other things.
@@heinzerbrew Again, you can't use php to test this. And this has nothing to do with memory allocations. Additionally, please don't use the pinned comment for this.
would a modern c/c++ compiler be smart enough to detect and optimize it?
This is a pretty contrived example. I tried to think of the absolute, most basic thing that would illustrate the point without needing to build a lot of extra boilerplate code around it. But in this example, using "gcc -O3" doesn't optimize it, but gcc -Ofast does, because it disables strict standards compliance. So it does sorta catch it, but is held back because you have to explicitly tell the compiler that you don't care much about the order. Remember that floating point operations are NOT associative, so it tends to be the compiler reordering them without explicit permission from the programmer is a no-no. V8 doesn't seem to catch it, but can't say for sure why.
So the difference in timings between the loops should still largely come down the difference in memory access patterns.
We could make the setup a little more complex with some structs with useless info to pad them out, or an array vs linked list, or something to that effect. Maybe it would have been a better choice? Would be more legit, but harder to parse for beginners. I dunno, I'm learning how to present this as I go heh.
@@hww3136 what if you intend to iterate that way? You'll get a bug
Best Director Award goes to Simon, i have never seen such a beautiful explanation without answering question, amazing!!!!
hah
This channel is pure gold man. Although I already know most of these concepts, you explain them pretty clearly and in a very understandable way ! Well done friend, subbed!
This cleared up so many things that my professor failed to teach me (or even understand it himself)
We truly need more videos explaining how the cpu works when we code
Thank you so much!
Nice, lemme know if there's other subjects I should cover too.
“why the code on the top runs faster? watch the video and find out” ...and no mention of the code again. Otherwise nice video.
Hah yeah, I feel kinda stupid for not looping around, live and learn. I'll continue the series and make sure I actually answer it next time!
@@simondev758 Nah you managed to explain without relying on code. If anything cut the intro.
One is accessing the array in row major order, and the other is accessing it in column major order. The top one is faster because the memory is being accessed in a contiguous way, which means you'll get a lot of cache hits (very fast, going to main memory much less). The bottom is is NOT accessing memory in a contagious way, meaning every access is a cache MISS. en.wikipedia.org/wiki/Row-_and_column-major_order
@@bonehelm maybe I am blind but the only thing he switched was the + and *
@@tagg1080 That's exactly the only change. He's using a formula to treat a 1D array as if it were a 2D array. But the order of + and * determines whether it's accessed row by row, or column by column.
You didn't explain why the original code posted was 10x slower. The reason being that the method in which the array was effectively traversed was not sequential or easily predictable, causing a lot of cache misses.
I dutifully paused the video at the beginning and took _way_ too long to spot the differences between the code samples. (I've never been much of a programmer.) After getting it I still learned a lot more from your discussion on the caches and all that. Thanks.
I feel like this is going to be an awesome series, especially for people like me who learnt programming by themselves and never really had a formal education teaching them about what parts of your code does under the hood.
I did study computer science, but none of this was covered. This is mostly from years of experience learning at companies and applying these optimizations.
Me too!
@@simondev758 i share this experience. in computer science, we also never covered that. only when i had to do projects and facing performance optimization problems, i was motivated to tackle the fundamentals of data structures and how to squeeze more and more efficiancy out of my code.
@@Ainigma wait really? I’m at the first semester of computer engeneering and i’ve alterady learned about this, Idk if the way they teach is different here in Brazil or not
Different universities teach different curricula. I went to a university that's considered pretty good (ie. often included in top lists), the focus was much more on theory and mathematics. I mentored a few Stanford grads while I was at Google, came off with the same impression. Super strong bases in theory/math.
Also, computer engineering != computer science. I can totally believe that an engineering oriented curricula would learn way more about the underlying hardware.
Wow this video is almost 10 minutes but it felt like 1 minute! I really enjoyed it (Y) will definitely rewatch this later today!
Awesome!
The video became x10 faster for you
I was jumping around randomly and it felt like 10 minutes
Good Stuff, my man! The explanation is very clear and the dry humor is appreciated. I went into this video not knowing what a contiguous array was and now I have an idea of what it is.
Question; I understand that the L1, L2, and L3 caches reside on the CPU itself. Is there a typically a controller on the CPU like a northbridge? Or is it the CPU itself that access the caches hence the low latency; no middle man.
That's awesome! That's right, caches typically reside on the CPU. They're also built with a different kind of memory (SRAM), faster. Northbridge is the interface between CPU & main memory and resides on motherboard. That's my understanding anyway.
4:20 you ain't cheap bruh, your content is way better than expensive internet courses.
ty! If you have suggestions for future vids, let me know.
@@simondev758I'd love this series just keep doing it
hahaha weed number
@@simondev758 I hope you will grow so much that you will be able to afford one of those macs without feeling its price, i watch your videos for quite some time now and you deserve it.
"let's head to apple store. I don't actually own one, because they are super expensive and I'm super cheap, but I like to look at them" 😂🤣😆
Man this was soooo relatable on so many levels 😂
This has got to be the best educational video on youtube, fuckimg amazing
Awesome, glad you enjoyed it!
But why is the code example from the very start of the video faster than the other example? I feel like that wasn't covered or I missed the point.
Youness covered it well, but I left it open as sort of an "exercise for the viewer". Kinda wanted the video to be in depth enough that you could answer it yourself.
@@simondev758 Leaving as an exercise to the viewer is fine, but as a suggestion for the format it would be good to circle back to the question at the end of the video and call it out. The way it's structured now it seems like the question is just forgotten about.
Great feedback Mike, 100% agree! Tried it out but next time I'll make sure to close the circle.
One is accessing the array in row major order, and the other is accessing it in column major order. The top one is faster because the memory is being accessed in a contiguous way, which means you'll get a lot of cache hits (very fast, going to main memory much less). The bottom is is NOT accessing memory in a contagious way, meaning every access is a cache MISS. en.wikipedia.org/wiki/Row-_and_column-major_order
@@bonehelm I still don't get it, both FOR starts looping on x and then y. So both should have the same speed. The only difference here is that on the first one you do the multiplication before the addition... And on the second one the other way around... And it doesn't seems to be a continuous problem here....
Hey, thanks for giving me a good explanation on what cache hits/misses were. I've seen it being tossed around but never knew what it meant until now.
For all those of you who are wondering why he accessed 1d array like if it is 2d, here is the real world example. Assume that in C++ you write a class that represents a matrix. You can go with 1d array or something like vector< vector< int > >, which you can think of as being 2d array. So why the hack then to use 1d array. There are many pros of using this approach. One of them is much faster work of constructor and destructor. Assume that you need to allocate memory for NxM matrix m. With 1d array it is one call of malloc (which asks the system at allocate a chunk of memory for you). With 2d array you should call malloc N times, once for each row. The same logic applies when you want to free memory. So now once we chose to represent our matrix as 1d array, assume that the user of our class wants to sum all of its elements. The user has no idea how did we choose to represent the matrix under the hood, he only knows that he can access an element of the matrix by using operator() as m(x, y). Your operator() in turn will have to calculate the position of an element by evaluating y * M + x and in the essence this will be exectly the same code as he showed in the video.
Awesome explanation, exactly what I had in mind when making the example.
Awesome! Looking forward to this series :)
It's really not easy teaching about memory and data structures without the audience falling asleep, but you've done a great job!
Unfortunately JS arrays are messy and not exactly contiguous... :(
Hah yeah, I actually had a small section on JS arrays in there just to clarify about them, but cut it due to time (and laziness).
You're totally right, they're non-contiguous "array-like" objects. The TypedArray is closer to what you'd get in C++ or something.
I have a final tomorrow over this stuff and this video just happened to pop into my recommended. Probably cause I was googling virtual memory, caching, etc. Super informational video, and I was able to study and be entertained at the same time. I’m a fan 👍🏼
Good luck on your final!
I'm glad I'm not the only one who noticed he didn't answer the original question. I already knew most of what he discussed (how memory and the caches work) but he didn't explain how that applied to the question he posed. The reasonable answer (as I suspected before I watched the video) is that one piece of code accesses the data by rows and the other by columns. The cache would load the data by rows which would make the code that accesses the data by rows faster since there would be a lot of cache misses for the code that accessed the data by columns.
Yep, that is exactly what is going on.
Yes, It's because in the first one, for the whole time the inner loop executes, the starting address remains the same (x*4000), then we just move contiguously with the +y, which equates to +1.
In the second for-loop, the inner loop starting address is the same, but it's not contiguous, we move in steps of (4000*y) every single iteration, which is a huge step to get to the next element.
I don't remember I watched this video before until I saw the L1/2/3 Cache example. The drawing is making the video so interesting. Now I come back for the new videos after finishing my algorithm final exam!
Arrays are also slow when you have multiple threads, each bound to its own CPU, trying to access a global array. Suppose you have an array of size 4 int32_t, and 4 threads labeled 0 to i. The i th thread accesses the i th index in the array and modifies it repeatedly. This will cause major slowdowns as each modification will mark the cache of the other 3 threads/CPUs as modified, and each thread will have to do extra work to update their cache, even though the value that the thread cares about isn't being modified. This essentially makes the parallel code sequential. A linked list would be much better to use here since the memory locations will not be adjacent to each other, or each thread can modify a local variable and only write to the array at the end.
I have code that uses multiple threads to generate large arrays using perlin noise, and get near perfect scaling with threads. If you were doing something that multiple threads had tons of modifications, the L3 cache would still be valid, as that is shared between cores, so it wouldn't be that big of an issue. The bigger issue would be multiple threads constantly updating the same data causing race conditions, or bizarre behavior due to how memory reads and writes work.
AMAZING!!!
Waiting to watch the rest of this series
thx!
Amazing Video Simon never thought about why arrays are fast, in spite of using them more than a hundred times. Now, since I am working on my own project, I feel the need for them.
An amazing series start, Simon! Thank you for this dive into memory allocation/access details, very helpful! Surely looking forward to a (contiguous?) continuation and your other quick projects & cool tips! Very kind of you to put in so much work and share it. Again, thanks!
np, glad you enjoyed it!
Absolutely loved it. Just the right way to explain a concept top to bottom. Keep it up
Holy f. This is so good, youtube finally recommend me something that worth my time. Please, release more episode!!!
Second video in the series is the linked list one.
Man. Your channel (even i don't use js so much) is freaking awesome!! I would buy your course or book without hesitation! Thank you!
No courses at the moment, I just put it all online for free :)
This is explained so clearly, I'm now going to binge watch your videos 😃 Thanks to YT for recommending me your channel. Subscribed! 🌟
you are made for creating videos like these, I could watch this for hours, thanks so much for the content
Even if I know this topic, your video was so interesting that time has passed unnoticed and I watched whole video. Wow
Loving this series man. Please keep going. Would love to see more of real world optimization like you did previously. But yeah fundamentals need to be cleared up first.
Btw I remember there was one cpu level hack which exploited this eager loading behavior in cpu. Some private part of memory was located by measuring the access speed, if the access was faster that meant the memory location just after current location was already cached by the cpu even though later on it was restricted to actually use due to other privacy constraints.
More real world optimization? Sure, I can probably swing that.
Sounds a lot like Meltdown: en.wikipedia.org/wiki/Meltdown_(security_vulnerability)
@@simondev758 ah yeah, its Meltdown.
Thanks man. Love your videos.
A simple and interesting explation of arrays, I came here to watch this video just to make sure I knew exactly how everything works. Having hardware classes with programming coursesreally, does help out a lot to understand computers, both on hardware and software level.
Thank you so much for creating this series, this is what ( even now ) I think more programmers ( and sometimes myself ) should pay attention to.
Problably the best vide have ever watched about how arrays works
I wasn’t ready for it to be over!
Loved this video! Will enjoy watching the whole series definitely.
This was amazing. THANK YOU. More like this for Javascript. Keep it coming and don't stop!
Posting this for anyone else out there like me.
As someone who knows JUST enough code to be dangerous, I feel dense for not getting it sooner.
My brain absolutely looked at the two for loops, saw them moving through the same numbers, and even though I recognized there was a difference in the last line, my thought process collapsed to "alright so you move through all the same array positions eventually," and I spent ten minutes trying to figure out, on my own, why changing the order of operations would make it faster while ignoring, y'know, the actual result of iteratively running the loops.
The first comment I saw mentioned row vs column major ordering, and even then, I thought "that's silly, it's a 1D array!" and failed to make the connection.
Took me seeing someone list the resulting indices in order to finally click. Despite so much of video focusing on contiguousness.
So for anyone else that did the same, we're a special kind of stupid, but at least not a unique kind.
It's hard, which is why optimization is often a specialty.
I love how he goes into depth of things, please continue your great efforts sir 🔥
If you'd have been my tuition teacher I'd have been a topper in my class
Your way of explaining is so cool man👌
I'm inspired
Very helpful video, you explained everything clearly.. Reminded me of my university teacher back in the day. Thank you for the upload!
Awesome, if you have suggestions on topics to cover too, let me know
This is a very great high-level view of arrays and CPU caching! I wish I had this video as an introduction when I was taking my Computer Systems class in college!
Arrays are fast, except when they aren’t ;)
Great video!
Well said!
Got it as a recommendation, instantly subscribed.
Great explanation. Clear and concise. Thanks!
why does the code on the top run 10 times faster? I thought you were going to answer it in this video. By looking at it I can deduce that it has to do with the nested loop. The first code goes in contiguous 4.000 data points, then moves onto another 4000 batch, which should be already cached. The second skips between batches, grabbing the first from all of them, then the second, etc. I don't see why it wouldn't be prefetched if it's repeteable.
Yup, the bottom code block is leapfrogging by 4000 positions in every iteration of the nested loop but the top code block accesses the array sequentially. Its basically the difference between grabbing one sandwich from the plate and eating it completely before moving onto the next VS taking a bite from a different sandwich every time until your plate is clean.
Already knew these concepts but still an amazing video, definitely helpful to cs students!
your videos deserve a way bigger audience
Sir, your channel is a gem! thx for these videos
8:24 the animation made me smile
Great video. You may have forgotten to revisit and explain in detail the code at the start though...
Yes... yes I did heh. Answer is pinned.
Wow, man. I mean literally you just made it simple; not like other people who go through so much technical stuff and use buzz words.
I mentored a junior guy recently who did that, spoke entirely in jargon and buzzwords. I always wondered if he specifically practiced doing that.
Superb series! thank you for the content!
Here king, you dropped this. . .
Thanks! I couldn't understand cache misses or contiguous arrays before your video. It felt like 3Blue1Brown of Computer Science
I just applied this on my internship and I solved a serious bottleneck!!! Thanks Simon
That's really cool! What kind of problem were you solving?
I optimized some existent numerical C++ code (for signal processing) from 420ms per iteration to 45ms, but the goal was 30ms per iteration. There is a giant nested for loop that was taking 44ms to run. So I applied the change specified in the video to reduce cache misses. Now it's taking 22ms. Maybe I could further optimize it by writing a special data structure (probably just a linked list), but that's good enough for the manager by now and he has other priorities for me.
This video was amazing, you have my subscription
There are pros and cons to this worth noting. It can be a valid form of redundancy to have an array as a 'look-up' to speed up data coming from a query in a relational database (rather than hitting the query over and over). That helps performance sometimes. However, the cache vulnerabilities can be exploited. An array does not have the same benefits of a linked-list in terms of being able to specify pointers for pass by address versus pass by reference. Also though, in some languages a linked list can be a problem as it needs objects, and that means garbage collection (Java) can happen at anytime, thereby making performance hard to predict and sometimes rather stuttering.
Are we working our way towards ecs? ...I'm getting those vibes haha.
This also reminded me of a channel 9 talk I recently saw. "Native Code Performance and Memory: The Elephant in the CPU"
Blew my mind the importance of cache misses and structuring your data in continuous chunks
Oh yeah definitely want to cover ecs at some point, why major companies are going that route, etc. Lot of ground to cover between beginner and that level though heh.
Esc stands for what?
@@abdelrahman5094 Entity-Component-System
Had a chair on advanced computer architecture, this covered some weeks on a 10 min video. Wish I had this kind of resources back then. Also had to make a simulator with several caching sizes and strategies, good times 😂
Ooh that simulator sounds kinda fun to play with.
My only formal training in programming is an elective course in C in college. I've loved it so much I end up writing device drivers for baremetal code running on top of ARM Cortex M processors as a hobby. There is no substitute for learning the fundamentals. It pains me to see some universities starting Computer Science students with a course in Java...
My first exposure to programming was Java in my intro to computer science class heh :P
Some people are just naturally good at logic and programming. We can count ourselves in that group
if you aren't constrained by the requirement of the array must keep it's ordering, you could also just take the value then swap in the last item in the array when deleting right?
yep
Took me a good minute to find the difference between the 2 codes.
6:02 snipe'ns a good job mate
Hi Man. Very nice explanation. I was curious how you made the animations in your video.
Takeaway: Memory caching magic allows faster access of indexes that are close. What constitutes 'close' is arbitrary, the optimization aspect is accessing indexes in order of proximity.
Also curious as to whether 'x * 4000' is cached behind the scenes. Otherwise, pulling it into an outer loop variable should be better.
basically every optimizer will pull constants out of loops
i found this channel and its briliant
How does the processor look up data in the L-caches? I thought it was looking not for specific data, but for whatever data is in a specific data address. Do the L-caches keep the data alongside the address of that data in memory, where it came from?
Great explanation, thanks for your effort
I hope to see future videos on how javascript garbage collection work behind the scenes
Ooh that'd be an interesting subject.
I think it uses the reachability idea. If there is no way a part of memory can be reached it will be garbage collected.
They didn't use reference count approach because that doesn't handle 2 pieces referencing each other but are completely isolated and not reachable.
8:26 it's actually only true if you want to keep the order of the elements. If not, you just move the element you'll overwrite at the end, and place the new element where the old one used to be. This is two copies every time instead of an increasing number of copies as the array grows larger and you insert close to the beginning. Same thing for removing.
Good point, I mention the same trick in the next video I think.
Great job, I like your style. Easy like and sub 💯
Great video. I’m your fan for now
this was very helpful. thank you
np!
more js content sir love it❤️ thank you
Thanks for sharing this amazing tutorial!
The video is good and correct, but I expected to hear something on arrays in JavaScript that are indeed objects. I was looking for an info how they behave in detail regarding speed. What I also missed was an information how about build-in boundary check in many languages affect array access speed.
I actually had a section about JS arrays in the original script, recorded it and everything. But decided that this series wasn't JS specific and cut it.
Considering MicroSD cards can now store 2TB... I very much suspect there will soon be an ultra-array in which the cache has almost every popular combination of 255 bytes each intermixed of read-only-memory in crystal and through lasers, whereas the memory pointers will have its own buffer... sort of like how we download a RUclips app for our smart devices to load a whole lot faster than without in requiring significantly less bandwidth plus the reloading.
Your content PLUS the video production = 🙌 SUBSCRIBED
Great video! But I still do not understand the Intro example, is the array in the bottom a dynamic array or just the top one a hash table?
array
Great content once again. Thanks. I would like to see a three.js tutorial about creating model outlines / silhouettes. I know about outlineEffect and outlinePass but I'd like to see any other techniques (I've heard about stencil buffer but I have no idea if it's possible in three.js). My problem is that the outlineEffect is crappy for boxes and the outlinePass is too expensive (for my needs). I'm wondering if there is a cheaper method to draw an outline.
Great suggestion, I'll see what I can do!
I want more!! Can't wait
thx!
Your videos are amazing! Thank you for this content
Learned a lot. Thank you
I love this. Thank you for sharing.
thx
Okey, you got me :) awesome videos!
For cache boost program need to designed in certain way? Using arrays when possible automatically boost performance?
No, unfortunately using an array does not fix the underlying problem. If you are not accessing the data in an array in a linear fashion, then you will produce excess cache misses no matter what. Any algorithm that does random data accesses basically basically throws 99% of your CPU performance away on a modern machine. So you bought this 4.8GHz beast of a CPU for $500 and what you are getting out of it is the performance of a $5 microcontroller because both are limited by basically the same memory access time. The only way to prevent that is to be very aware on the algorithm level what actually happens when the machine runs through the data.
Was hoping to see some architectural variations in c++ on heap and stack that optimize cache hits.
Great vid tho.
Hur hur hur, I actually understand this stuff now. =P.
I've never used Hash before though. In my use case (targeting enemies near the player), I followed Unity's design for this; they have a new array each frame that grabs all the colliders on a specific layer around the player.
What I did was change that to a list and cache'd the list also giving it the size for 50 since I didn't see the game having more than 20 enemies in the scene at once within a detectable range.
Lists are slower than arrays, about 10x to 100x slower depending on what I'm doing (Or so I read). I'm hoping doing it the way I did it by initializing the list in memory with 50 spaces for things makes it faster... The result is it did, changing over enemies and the player to this system gave a marked increase in performance. =D Happy Ending.
Why use List over the built in Unity array? Well Lists are easier to manipulate, and later on I want to put things in order of their vector 3 magnitude with the player. So plucking things from one list and adding it to the other while possible with Array to List allows for a bit more backwards checking if things are alright as I'm developing.
thank you simon! does this cpu caching also apply to javascript? i created a testscript in node and only get a 2x performance improvement. i guess this only applies to compiled / strongly typed languages? i would really appreciate a answer.
It also appear in Javascript since js is also using the cpu and the arrays are mostly contiguous in js too.
The performance boost can differ based on your cpu and what you do
Yeah it applies to JS too, that code snippet I had at the beginning was in JS :)
To add to leomelki's response, which was great and covered this, remember that JavaScript arrays are more like "array-like" objects, see the docs: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array
They're sparse and potentially non-contiguous, so yes these optimizations will apply but they'll also be partially cancelled out by just overall noise from JavaScript. If you use a TypedArray, you're getting much closer to what a C++ array is, and that's actually what I used in the example.
@@simondev758 thank you for answering my question. i know that js also uses the cpu and has arrays :) but since js is a interpreted language and not a compiled one, i was thinking that the js engine may does some things in the background, that could get in the way of efficient cpu caching. im pretty solid in js but i have no idea what the js engine does in the background. i tried using a normal array and not a typed one, that maybe was the reason i couldnt get the 10x improvement. anyway thanks a lot for all your great content.
Np, also keep in mind it'll be super dependent on what cpu you have, etc. You'll get a win with perfect caching vs all cache misses, the magnitude will vary from cpu to cpu.
for all wo are interested in this. i did some more tests and could also get a 10x improvement on performance. my tests used a more real world example for my ecs system ( array of structs/objects instead of numbers) and you can see very well how the performance changes simply by adding / removing properties. i can now add some performance improvements to my ecs system just by watching your video ☺️
Love the videos, but watching using my speakers makes my table tremble. I'd love it if you would run the sound through audacity or some other sound editing software to reduce the lower sound frequencies because my subwoofer deafens your voice. It would make the sound a lot crisper, too!
Interesting, I can try it out. Already process it through Audacity to clean it up/edit.
What font do you use for VSCode? I like it.
Whatever the default is, literally did no customization heh
@@simondev758 I found it out, its Cascadia Code.
So please clarify this for me: the code at the top is faster because every single loop addressing in the array is contiguous, so it can use the CPU's prefetch and hit the L1 L2 L3 cache? But the code at the bottom, since the index is jumping, so CPU can't prefetch, it has to access memory every time?
In C++ std::vector is like you said an array bigger than you need, in c# List is a binary tree
Didn't know C#'s underlying data structure, that's interesting.
you only have to shift everything over in the array on an erase if you expect it to all stay in the same order though. in a lot of situations you don't care and can just fill in the hole with the last element and then decrease the size by 1 :D
I'm still confused as to why the top code in the beginning is 10x faster. I see what everyone is saying about row-major and column-major order, but that seems to only apply to the layout of multidimensional arrays, and the array in the code isn't multidimensional. It seems like the real reason is that the bottom code has to do 4000^2 multiplication operations, whereas the top code only has to do 4000 multiplications.
It's subtle but it's there if you look closely. I'll put the answer below in case you want to look again first.
Look at the 2 loops, they loop over exactly the same number of times, so doing exactly as many loops. Each loop does 4000 * 4000 loops, or 16000000 iterations in total. Each accesses an array using the x and y values, and if you look at how they calculate the index of the array, there's exactly the same number of calls to calculate an index, which is a multiple and an add, so exactly the same amount of work.
You can confirm this for yourself but just using a smaller loop, say 3 x 3. So the index calculation would be [x * 3 + y] for the first loop, and [x + 3 * y] for the second.
Then the indices you'd get for the first set of loops would be [0, 1, 2, 3, 4, 5, 6, 7, 8]
The 2nd would give you [0, 3, 6, 1, 4, 7, 2, 5, 8]
Notice both loops touch every array element exactly once. The only difference is the order they access it in.
@@simondev758
Ok, I see how the bottom code could lead to less efficient caching, but I don't think that's the only difference.
In the top code, x is being multiplied by 4000 on every iteration of that inner loop, but since x doesn't change in that inner loop, x * 4000 can be precomputed before executing the inner loop to avoid doing the same multiplication operation over and over. So in the top code, that multiplication only has to be done the 4000 times that x changes as a result of the outer loop.
In the bottom code, however, y is what's being multiplied by 4000, and since y changes on every iteration of that inner loop, y * 4000 can never be precomputed, so it has to be calculated all 4000^2 times.
So it looks like the top code only has to do 4000 multiplications, whereas the bottom code has to do 4000^2 multiplications, which is a difference of almost 16 million multiplications. It seems like that would be the main reason the top code is so much faster.
@@GamingBlake2002 Very possible, got me wondering so I redid this quickly be writing the test again, but this time only a single Test function with the 2 loops, and the index was calculated as i = x * p1 + y * p2;
p1 and p2 being parameters to the function, fed from the commandline.
In other words, Test(buf1, 1, 4000) vs Test(buf2, 4000, 1)
Compiled with g++ -O3
End result, same, so you might be right about the optimization on the multiplications, it seems to have little effect on the end result.
I like the angle that you're looking at though, so in the future I'll make sure that if I'm going to show off the effects of something, I'll see about making a single unified code path with just mutated parameters.
Great series!
This was really clear and easy to understand, thanks
But did you forget to answer the question at the beginning of this video? The real question is whether a Javascript array remembers its access pointer? Does it behave like C that an array is basically a pointer?
I copied your code and tested, the first loop is indeed significant faster, 4x faster on my machine. The only difference is in the fist loop, the array was accessed sequentially, and in the second loop, the access is kind of jumpy. So may I have the conclusion that the implementation of Javascript array stores a state of a pointer position so moving the pointer to the closer position/index is faster than moving it to a more distant position/index?
Yeah I didn't answer it explicitly, figured the video did, but next time I'll have an explicit answer
This video kicks ass.
Well explained 👏👌👍
A Rick and Morty reference followed by the Apple store page. I think our friendship is over, have a good day sir.
Great tutorial outside that tho lol
heh