The smallest "unit" that can be loaded or evicted in an Intel cache is 64-bytes, or a 16 pixel span of a 32-bit RGBA image row. When you write from the top-left to bottom-left of the destination image, you are accessing only one pixel of this 16-pixel chunk of the row. The typical L1 cache can only contain at the very most 512 of these 16-pixel chunks, best case (32Kb data cache). The performance increase comes from stopping before you get to the 512'th row and instead return to the top to fill out more of the 16-pixels in each cache line before they get evicted (stopping at some point before reaching 512 would be better because the cache replacement policy won't be perfect, and accessing the source image is also causing cache evictions). The above explains tile sizes above the 256-512 range perform badly in your graph, and why you would expect a ~10 speed up (it's worst case 15 more cache fetches without tiling). You might even get the same/better results from processing columns in the source image instead of tiles.
So bottleneck is not the number of CPU, rather the knowledge of amount of L1 cache, allowing developer to create properly sized jobs. Is there any JS function that gives the amount of L1 cache ?
@@DenisTRUFFAUT Not that I'm aware of, but for desktops the L1 cache specifications don't vary too much. There are also techniques that exploit caches with different specifications without knowing the details of the cache ("cache oblivious" algorithms).
I love watching this channel, even though I don't fully understand everything they're talking about the majority of the time. Keep up the good work! I'm learning more and more as I go :)
It doesn't run on a Mac... Safari runs on a Mac. Chrome runs on a Mac. Firefox runs on a Mac. Opera runs on a Mac. Internet Explorer and Edge don't. IE doesn't support WASM. IT MUST BE EDGE! Who would have thought?! :-P
As soon as they diplomatically said "But it's a very good browser", we knew it was Edge ;) Especially after Microsoft said it was forced to move to Chromium because Google kept intentionally breaking RUclips performance on Edge.
9:52 - Is that bug in the first line of the inner loop? says [newX, newY] = [newWidth - 1 - y, newY]. Shouldn't it be [newX, newY] = [newWidth - 1 - y, x]?
Not sure if I'm missing something, but couldn't this be explained by the rotation? Reading the pixels a row at a time is much faster, but this means you're writing a column at a time for the result (so there are large gaps between the locations you're writing to). You want both fairly sequential reads and writes, so using tiling means that the write cache can cache a whole tile before writing it, rather than having to write single values spread out through memory. (edit): After re-listening through the bit about the two explanations, this might have been what you were saying actually, just in a fairly roundabout way.
I agree and I think it's a better answer to the problem than the one presented in the video. I think, what they're talking about is a little bit different. I remember when I was learning C++, it was emphasised that accessing array by index is very expensive, so that's why it's better to use pointers. Every time you do array[index] it has to go the beginning of the array in memory and move an array pointer to the given index. It's very expensive if you're jumping between rows. On the other hand, if you use pointer you're just responsible for moving it and just by having 2 pointers you can copy memory efficiently. However, it might also be a case that V8 does some optimisations when accessing array by index, so that problem would not apply here. It's just my intuition but I might be wrong about that.
@@lukejagodzinski Accessing an array by index is not more or less expensive than doing the pointer arithmetic manually. You might be thinking about vectors from the stdlib, which overload the index operator and do a lot more under the hood. Luke’s got it right. And that is what I meant, but wasn’t as coherent as I could have been. What I sadly only learned only after the video was recorded is that writes to memory go through the cache as well and can also cause the cache to get evicted (I was under the impression that only reads do that). So yes, keeping both reads and writes limited to fairly small memory area (for example by using tiling) reduces the chance of cache eviction which means a speedup.
@@dassurma so you say that V8 doesn't do extra jumps between memory cells when executing? array[100]; array[1000]; array[50]; I would assume that it would do something like: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Go to memory point where array starts 4. Move array pointer by 1000 5. Go to memory point where array starts 6. Move array pointer by 50. With pointers it would just be: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Move array pointer by 900 4. Move array pointer by -850 Also, as I think about that, tiling would actually not help with array access... So probably I'm wrong that it was the source of the problem. But still I think pointers would improve speed here
@@lukejagodzinski I was referring to C++ (because you said that array indexing is expensive). For JavaScript, it’s hard to tell what kind of assembler V8 would generate for that code as it depends on the types and what behavior the optimizer observers. But the optimization you outline is definitely within the capabilities of V8.
It is called cache blocking, I felt like you first time I was introduced to it. Here is a really good article on this technique software.intel.com/en-us/articles/cache-blocking-techniques and to be honest I never thought that any web developer would find himself in a situation where he needs to implement this technique in his work :D
without tiling each write has to evict because you are rotating 90 degrees (x become y) when you do tiling, you don't need to evict foreach write and this why it's much faster
Can I try an explanation? The reading is nice sequential but the writing isn't. Let's take a 4x4 pixel image which is just a single dimensional 16 item array/block of memory. When you rotate clockwise, then you copy 1 to 4, then you copy 2 to 8, then 3 to 12 and then 4 to 16. If you tile the 4x4 image into 2x2 images, then you copy 1 to 4, then 2 to 8 _but then_ 5 to 3 and 6 to 7. So the first 4 pixels in the tiled rotate used the read-buffer 1-6 and write buffer 4-8 (6+4=10 pixels in cache total). The first 4 pixels in the full scan used read-buffer 1-4 but then write buffer 4-16 (4+12=16 pixels in cache total, basically the entire image).
Oh my god I loved this. JavaScript and low level cpu functionality are two of my favorite topics :D Thanks for making this. PS: Please tell your colleagues that work on GMail to finally make it a PWA so I don't have to use another program anymore :P
If you open your google chrome apps you can right click gmail and set it to open in a new window. This is almost the same as a pwa :) You can even create a shortcut for your desktop or start menu.
Cache locality? Cache locality! It's actually a pretty popular concept in performance critical code (e.g. game engines). I keep seeing it mentioned in a ton of GDC talks on architecture/optimization/related.
If you wanted to write a parallel algorithm for rotation would you want to tile the image in the same way? that would seem to be useful for per-core caches, but then once separate cores are reading and writing from different locations it would seem to make shared cache performance worse. That's pure speculation on my part though. Perhaps a new side learning project is in order.
I disagree that this is a "tools not rules" case. That this would be a performance issue is obvious from looking at the code, knowing the rules would let you catch it immediately. You're accessing a 2D array column by column. This *always* causes cache issues for large arrays, and images are very large. These are also obviously the instructions that will be run the most, by a significant margin, so it's worth thinking about from the start. In fact in C++ there is a whole paradigm dedicated to designing with the cache in mind. Look up data-oriented design.
Confirming isn't needed in obvious cases, it's just time wasted on unnecessary profiling. I mean do you profile before using binary search over linear search on an array of 2 million values? This is the same situation, just more obscure information.
@@emilemil1 If you had shown me the code before I learned about loop tiling, I’d have bet my ass that the new code was slower, even if maybe only marginally so. At some point you’ll have enough expertise in a field to be able to make these kind of calls without profiling and testing. For example: By now I know when using `will-change` will help and when it won’t. I don’t need to test, because _to me_ it’s obvious. But that doesn’t mean it’s obvious to everyone else and - more importantly - that others should just follow the advice “use will-change”. Even if you “know the rules” around `will-change`, there are a lot of scenarios where it will make things worse. So whenever I give advice around `will-change` I’d add “but measure and see if it actually helps”. It’s great if this case is obvious to you, but it wasn’t for me. We all have different backgrounds with different levels of knowledge. “Tools not rules” is a phrase that help us DevRels give advice in a way that everyone can use it while reducing the adverse effects of just listing rules.
Let's hope the speed gain is not based on speculative execution. If it is it might go away with a OS patch and leave everyone wandering what happened. Or worse, someone might find a exploit. This performance gains look really close to the metal. One would expect many layers of abstraction for something that runs in a browser.
I've guided some clients that don't have photo optimization software to Squoosh and even I find myself using it for time to time because it does a better job than Photoshop. Any chance we'll see batch processing?
A great magical trick. Sadly, I still don't quite get it: some say in comments it's because the way the reading caching happens (better pattern prediction/cache flow through levels/cache eviction), others say that the write cache is also involved. I'll have to learn more on the topic, but at least knowing that such thing exists should help. Thanks! :)
Great video, very interesting stuff! I would have thought that tiling improves performance because of some sort of concurrency (many tiles being processed at the same time), but that's not the case, right?
The smallest "unit" that can be loaded or evicted in an Intel cache is 64-bytes, or a 16 pixel span of a 32-bit RGBA image row. When you write from the top-left to bottom-left of the destination image, you are accessing only one pixel of this 16-pixel chunk of the row. The typical L1 cache can only contain at the very most 512 of these 16-pixel chunks, best case (32Kb data cache). The performance increase comes from stopping before you get to the 512'th row and instead return to the top to fill out more of the 16-pixels in each cache line before they get evicted (stopping at some point before reaching 512 would be better because the cache replacement policy won't be perfect, and accessing the source image is also causing cache evictions).
The above explains tile sizes above the 256-512 range perform badly in your graph, and why you would expect a ~10 speed up (it's worst case 15 more cache fetches without tiling). You might even get the same/better results from processing columns in the source image instead of tiles.
Thanks for that detailed write-up! I learned more on the topic after we recorded that video, but not as detailed as you wrote it down :D
So bottleneck is not the number of CPU, rather the knowledge of amount of L1 cache, allowing developer to create properly sized jobs. Is there any JS function that gives the amount of L1 cache ?
@@DenisTRUFFAUT Not that I'm aware of, but for desktops the L1 cache specifications don't vary too much. There are also techniques that exploit caches with different specifications without knowing the details of the cache ("cache oblivious" algorithms).
Isn’t tiling supposed to leverage SIMD instructions? Or even a tile is too big to be able to be transformed in a single instruction?
Damn thanks for this explanation!
That browser must've been on the cutting edge of technology.
It must be cutting the edges, but not cutting edge.
I dunno, they said it's usually a very good browser.
DO YOU GUYS MEAN INTERNET EXPLORER? HUH GUYS? CAN I PLAY WITH YOU?
come on, they just hit an edge case with that browser. i love how he says 'they hit corner case' instead
I see what you did there...
I have never seen Jake this bamboozled.
The most detailed video I have watched in here so far , The topic is super interesting, Great work , keep it up guys
5:17 corner case... you mean Edge case?
Both work.
@@dassurma it was a pun, i.e a joke LOL
love the technicality of these explanations by Surma.
This is by far your best video so far. Good job. Keep them coming.
I love watching this channel, even though I don't fully understand everything they're talking about the majority of the time. Keep up the good work! I'm learning more and more as I go :)
That's the reason you love it because you don't fully understand
It doesn't run on a Mac...
Safari runs on a Mac. Chrome runs on a Mac. Firefox runs on a Mac. Opera runs on a Mac. Internet Explorer and Edge don't.
IE doesn't support WASM. IT MUST BE EDGE! Who would have thought?! :-P
Do not reveal the other browsers' identity.
Well yes, but actually no.
Is he using a Mac? I thought maybe if he was using Windows, it would've been Safari.
Oh, I got the point in the video where he specifically said Mac. GG Edge
Khải Hoàng We are allowed because we don’t represent Google, whether as they do. It’s Edge.
As soon as they diplomatically said "But it's a very good browser", we knew it was Edge ;)
Especially after Microsoft said it was forced to move to Chromium because Google kept intentionally breaking RUclips performance on Edge.
9:52 - Is that bug in the first line of the inner loop? says [newX, newY] = [newWidth - 1 - y, newY]. Shouldn't it be [newX, newY] = [newWidth - 1 - y, x]?
Not sure if I'm missing something, but couldn't this be explained by the rotation? Reading the pixels a row at a time is much faster, but this means you're writing a column at a time for the result (so there are large gaps between the locations you're writing to). You want both fairly sequential reads and writes, so using tiling means that the write cache can cache a whole tile before writing it, rather than having to write single values spread out through memory.
(edit): After re-listening through the bit about the two explanations, this might have been what you were saying actually, just in a fairly roundabout way.
yes, this is how i understand it too
I agree and I think it's a better answer to the problem than the one presented in the video. I think, what they're talking about is a little bit different. I remember when I was learning C++, it was emphasised that accessing array by index is very expensive, so that's why it's better to use pointers. Every time you do array[index] it has to go the beginning of the array in memory and move an array pointer to the given index. It's very expensive if you're jumping between rows. On the other hand, if you use pointer you're just responsible for moving it and just by having 2 pointers you can copy memory efficiently. However, it might also be a case that V8 does some optimisations when accessing array by index, so that problem would not apply here. It's just my intuition but I might be wrong about that.
@@lukejagodzinski Accessing an array by index is not more or less expensive than doing the pointer arithmetic manually. You might be thinking about vectors from the stdlib, which overload the index operator and do a lot more under the hood.
Luke’s got it right. And that is what I meant, but wasn’t as coherent as I could have been. What I sadly only learned only after the video was recorded is that writes to memory go through the cache as well and can also cause the cache to get evicted (I was under the impression that only reads do that). So yes, keeping both reads and writes limited to fairly small memory area (for example by using tiling) reduces the chance of cache eviction which means a speedup.
@@dassurma so you say that V8 doesn't do extra jumps between memory cells when executing?
array[100];
array[1000];
array[50];
I would assume that it would do something like:
1. Go to memory point where array starts
2. Move array pointer by 100
3. Go to memory point where array starts
4. Move array pointer by 1000
5. Go to memory point where array starts
6. Move array pointer by 50.
With pointers it would just be:
1. Go to memory point where array starts
2. Move array pointer by 100
3. Move array pointer by 900
4. Move array pointer by -850
Also, as I think about that, tiling would actually not help with array access... So probably I'm wrong that it was the source of the problem. But still I think pointers would improve speed here
@@lukejagodzinski I was referring to C++ (because you said that array indexing is expensive).
For JavaScript, it’s hard to tell what kind of assembler V8 would generate for that code as it depends on the types and what behavior the optimizer observers. But the optimization you outline is definitely within the capabilities of V8.
It is called cache blocking, I felt like you first time I was introduced to it. Here is a really good article on this technique software.intel.com/en-us/articles/cache-blocking-techniques
and to be honest I never thought that any web developer would find himself in a situation where he needs to implement this technique in his work :D
Thanks for the link, it's a good read
without tiling each write has to evict because you are rotating 90 degrees (x become y) when you do tiling, you don't need to evict foreach write and this why it's much faster
Can I try an explanation?
The reading is nice sequential but the writing isn't.
Let's take a 4x4 pixel image which is just a single dimensional 16 item array/block of memory.
When you rotate clockwise, then you copy 1 to 4, then you copy 2 to 8, then 3 to 12 and then 4 to 16.
If you tile the 4x4 image into 2x2 images, then you copy 1 to 4, then 2 to 8 _but then_ 5 to 3 and 6 to 7.
So the first 4 pixels in the tiled rotate used the read-buffer 1-6 and write buffer 4-8 (6+4=10 pixels in cache total).
The first 4 pixels in the full scan used read-buffer 1-4 but then write buffer 4-16 (4+12=16 pixels in cache total, basically the entire image).
Btw, In your Progressively Loading Images episode we can see that images are loaded tile by tile in some really progressive formats...
This is the first video that I've seen from you guys, Really interesting. Thank you.
16:40, you forgot to link the podcast in the description. Very interesting video, btw!
I fixed it now! Thanks for letting me now
Thats amazing thing you guys shared, Even though this is Engineering (logical reasoning) it sometimes works more like intuition 😊
Would a Hilbert curve be useful here for a more general solution, perhaps to avoid picking an explicit tile size?
Oh my god I loved this. JavaScript and low level cpu functionality are two of my favorite topics :D
Thanks for making this.
PS: Please tell your colleagues that work on GMail to finally make it a PWA so I don't have to use another program anymore :P
If you open your google chrome apps you can right click gmail and set it to open in a new window. This is almost the same as a pwa :) You can even create a shortcut for your desktop or start menu.
Fascinating video, guys! This was a great education, and entertaining at the same time. :)
Favorite episode to date! Really good stuff
I was on the edge of my seat hoping you would name the browser.
Could you share the link to the Hacker News discussion?
Actually, found it myself: news.ycombinator.com/item?id=19166233
Who doesn’t like a good browser war! The 90s is back guys
Cache locality?
Cache locality!
It's actually a pretty popular concept in performance critical code (e.g. game engines). I keep seeing it mentioned in a ton of GDC talks on architecture/optimization/related.
Don’t tell me you had to reshoot this just to remove the browser name
If you wanted to write a parallel algorithm for rotation would you want to tile the image in the same way? that would seem to be useful for per-core caches, but then once separate cores are reading and writing from different locations it would seem to make shared cache performance worse. That's pure speculation on my part though. Perhaps a new side learning project is in order.
I disagree that this is a "tools not rules" case. That this would be a performance issue is obvious from looking at the code, knowing the rules would let you catch it immediately.
You're accessing a 2D array column by column. This *always* causes cache issues for large arrays, and images are very large. These are also obviously the instructions that will be run the most, by a significant margin, so it's worth thinking about from the start.
In fact in C++ there is a whole paradigm dedicated to designing with the cache in mind. Look up data-oriented design.
Without tools, we wouldn't be able to confirm the result.
Confirming isn't needed in obvious cases, it's just time wasted on unnecessary profiling. I mean do you profile before using binary search over linear search on an array of 2 million values? This is the same situation, just more obscure information.
@@emilemil1 If you had shown me the code before I learned about loop tiling, I’d have bet my ass that the new code was slower, even if maybe only marginally so.
At some point you’ll have enough expertise in a field to be able to make these kind of calls without profiling and testing. For example: By now I know when using `will-change` will help and when it won’t. I don’t need to test, because _to me_ it’s obvious. But that doesn’t mean it’s obvious to everyone else and - more importantly - that others should just follow the advice “use will-change”. Even if you “know the rules” around `will-change`, there are a lot of scenarios where it will make things worse. So whenever I give advice around `will-change` I’d add “but measure and see if it actually helps”.
It’s great if this case is obvious to you, but it wasn’t for me. We all have different backgrounds with different levels of knowledge. “Tools not rules” is a phrase that help us DevRels give advice in a way that everyone can use it while reducing the adverse effects of just listing rules.
Sur-matram. Optimization levelled up.
I expected this to be about code being parallelized by the compiler, and the outer tiling loops are better for that.
Actually drew my breathe in with Jake at 21:10
Let's hope the speed gain is not based on speculative execution. If it is it might go away with a OS patch and leave everyone wandering what happened. Or worse, someone might find a exploit.
This performance gains look really close to the metal. One would expect many layers of abstraction for something that runs in a browser.
I've guided some clients that don't have photo optimization software to Squoosh and even I find myself using it for time to time because it does a better job than Photoshop. Any chance we'll see batch processing?
We are working on batch processing! (Currently implementing some lower-level building blocks that will pave the way for batch processing to land)
Very interesting! Thank you for doing your best explaining it.
A great magical trick.
Sadly, I still don't quite get it: some say in comments it's because the way the reading caching happens (better pattern prediction/cache flow through levels/cache eviction), others say that the write cache is also involved. I'll have to learn more on the topic, but at least knowing that such thing exists should help. Thanks! :)
Can the optimal tile size be dynamically determined at runtime (with a small internal test of generated data) to avoid having to profile CPUs at all?
You could run a small benchmark on your target device, but I think that’s not worth the effort. 16 seems like a pretty safe number :)
16:03 inaudible = AssemblyScript perhaps?
Is there a general rule for tiling ? Or a JS function that gives the number of cores ?
There is navigator.hardwareConcurrency
@@dassurma but number of cores doesn't matter here, its the cache size
*head explodes* great episode
Super interesting video, great work! Entirely uncharted territory for me.
10:15 Hey Jake, it's called an index :D
Podcast not linked in the description 😂
I am not a smart man.
what was the second theory?
now I'm intrigued... WHAT BROWSER WAS IT??
Edge
Happy to hear that even google engineer not fully understand why their code working 😅
Conclusion: It works! Great.
Does it rhyme with "kitchenette restorer"?
That one doesn't run wasm so no.
Great video, very interesting stuff! I would have thought that tiling improves performance because of some sort of concurrency (many tiles being processed at the same time), but that's not the case, right?
The JavaScript code in the video is single-threaded, so no.
Thank you Google
nice video about wasm :-)
Best video EVER
Tiling improves performance because it helps keep the destination matrix in the cache.
Actually useful info this time, thanks :)
Do I sense shade?
Thank you
Any sufficiently advanced technology...
simply woah ^^
nice how they just described microsoft edge
Awesome
https 203 0:30
My head hurts now. lol :)
What happens here is simple you did the most basic optimization possible chunk your operations :)
Nobody knows you're talking about IE..
Name and shame!
what a horrible direction web development is going in
DOD comes to web ;)
could be better with nice and decent. no need to be clown.
Where’d be the fun in that?