Loop Tiling - HTTP 203

Поделиться
HTML-код
  • Опубликовано: 11 дек 2024

Комментарии • 114

  • @lartme
    @lartme 5 лет назад +134

    The smallest "unit" that can be loaded or evicted in an Intel cache is 64-bytes, or a 16 pixel span of a 32-bit RGBA image row. When you write from the top-left to bottom-left of the destination image, you are accessing only one pixel of this 16-pixel chunk of the row. The typical L1 cache can only contain at the very most 512 of these 16-pixel chunks, best case (32Kb data cache). The performance increase comes from stopping before you get to the 512'th row and instead return to the top to fill out more of the 16-pixels in each cache line before they get evicted (stopping at some point before reaching 512 would be better because the cache replacement policy won't be perfect, and accessing the source image is also causing cache evictions).
    The above explains tile sizes above the 256-512 range perform badly in your graph, and why you would expect a ~10 speed up (it's worst case 15 more cache fetches without tiling). You might even get the same/better results from processing columns in the source image instead of tiles.

    • @dassurma
      @dassurma 5 лет назад +33

      Thanks for that detailed write-up! I learned more on the topic after we recorded that video, but not as detailed as you wrote it down :D

    • @DenisTRUFFAUT
      @DenisTRUFFAUT 5 лет назад +3

      So bottleneck is not the number of CPU, rather the knowledge of amount of L1 cache, allowing developer to create properly sized jobs. Is there any JS function that gives the amount of L1 cache ?

    • @lartme
      @lartme 5 лет назад +6

      @@DenisTRUFFAUT Not that I'm aware of, but for desktops the L1 cache specifications don't vary too much. There are also techniques that exploit caches with different specifications without knowing the details of the cache ("cache oblivious" algorithms).

    • @victornpb
      @victornpb 5 лет назад +1

      Isn’t tiling supposed to leverage SIMD instructions? Or even a tile is too big to be able to be transformed in a single instruction?

    • @youknowhu
      @youknowhu 5 лет назад

      Damn thanks for this explanation!

  • @CyberAcidPlanet
    @CyberAcidPlanet 5 лет назад +105

    That browser must've been on the cutting edge of technology.

    • @sahilsatishkumar
      @sahilsatishkumar 5 лет назад +6

      It must be cutting the edges, but not cutting edge.

    • @JimCullen
      @JimCullen 5 лет назад +7

      I dunno, they said it's usually a very good browser.

    • @omg_look_behind_you
      @omg_look_behind_you 5 лет назад +2

      DO YOU GUYS MEAN INTERNET EXPLORER? HUH GUYS? CAN I PLAY WITH YOU?

    • @barcicle
      @barcicle 5 лет назад +8

      come on, they just hit an edge case with that browser. i love how he says 'they hit corner case' instead

    • @pardal_bs
      @pardal_bs 5 лет назад +1

      I see what you did there...

  • @sahilsatishkumar
    @sahilsatishkumar 5 лет назад +22

    I have never seen Jake this bamboozled.

  • @hyyanabofakher6229
    @hyyanabofakher6229 5 лет назад +17

    The most detailed video I have watched in here so far , The topic is super interesting, Great work , keep it up guys

  • @AmxCsifier
    @AmxCsifier 5 лет назад +15

    5:17 corner case... you mean Edge case?

    • @dassurma
      @dassurma 5 лет назад +3

      Both work.

    • @AmxCsifier
      @AmxCsifier 5 лет назад +4

      @@dassurma it was a pun, i.e a joke LOL

  • @JoeWong81
    @JoeWong81 5 лет назад +1

    love the technicality of these explanations by Surma.

  • @iuliuvisovan7307
    @iuliuvisovan7307 5 лет назад +7

    This is by far your best video so far. Good job. Keep them coming.

  • @marshal7591
    @marshal7591 5 лет назад +2

    I love watching this channel, even though I don't fully understand everything they're talking about the majority of the time. Keep up the good work! I'm learning more and more as I go :)

    • @Abhishek-dp5tc
      @Abhishek-dp5tc 3 года назад

      That's the reason you love it because you don't fully understand

  • @lucacasonato
    @lucacasonato 5 лет назад +32

    It doesn't run on a Mac...
    Safari runs on a Mac. Chrome runs on a Mac. Firefox runs on a Mac. Opera runs on a Mac. Internet Explorer and Edge don't.
    IE doesn't support WASM. IT MUST BE EDGE! Who would have thought?! :-P

    • @khai96x
      @khai96x 5 лет назад +2

      Do not reveal the other browsers' identity.
      Well yes, but actually no.

    • @iuliuvisovan7307
      @iuliuvisovan7307 5 лет назад +1

      Is he using a Mac? I thought maybe if he was using Windows, it would've been Safari.

    • @iuliuvisovan7307
      @iuliuvisovan7307 5 лет назад +1

      Oh, I got the point in the video where he specifically said Mac. GG Edge

    • @dan-garden
      @dan-garden 5 лет назад

      Khải Hoàng We are allowed because we don’t represent Google, whether as they do. It’s Edge.

    • @youknowhu
      @youknowhu 5 лет назад +1

      As soon as they diplomatically said "But it's a very good browser", we knew it was Edge ;)
      Especially after Microsoft said it was forced to move to Chromium because Google kept intentionally breaking RUclips performance on Edge.

  • @sampathrg1
    @sampathrg1 4 года назад +2

    9:52 - Is that bug in the first line of the inner loop? says [newX, newY] = [newWidth - 1 - y, newY]. Shouldn't it be [newX, newY] = [newWidth - 1 - y, x]?

  • @lukemiles6813
    @lukemiles6813 5 лет назад +6

    Not sure if I'm missing something, but couldn't this be explained by the rotation? Reading the pixels a row at a time is much faster, but this means you're writing a column at a time for the result (so there are large gaps between the locations you're writing to). You want both fairly sequential reads and writes, so using tiling means that the write cache can cache a whole tile before writing it, rather than having to write single values spread out through memory.
    (edit): After re-listening through the bit about the two explanations, this might have been what you were saying actually, just in a fairly roundabout way.

    • @meldafert7619
      @meldafert7619 5 лет назад

      yes, this is how i understand it too

    • @lukejagodzinski
      @lukejagodzinski 5 лет назад

      I agree and I think it's a better answer to the problem than the one presented in the video. I think, what they're talking about is a little bit different. I remember when I was learning C++, it was emphasised that accessing array by index is very expensive, so that's why it's better to use pointers. Every time you do array[index] it has to go the beginning of the array in memory and move an array pointer to the given index. It's very expensive if you're jumping between rows. On the other hand, if you use pointer you're just responsible for moving it and just by having 2 pointers you can copy memory efficiently. However, it might also be a case that V8 does some optimisations when accessing array by index, so that problem would not apply here. It's just my intuition but I might be wrong about that.

    • @dassurma
      @dassurma 5 лет назад +3

      @@lukejagodzinski Accessing an array by index is not more or less expensive than doing the pointer arithmetic manually. You might be thinking about vectors from the stdlib, which overload the index operator and do a lot more under the hood.
      Luke’s got it right. And that is what I meant, but wasn’t as coherent as I could have been. What I sadly only learned only after the video was recorded is that writes to memory go through the cache as well and can also cause the cache to get evicted (I was under the impression that only reads do that). So yes, keeping both reads and writes limited to fairly small memory area (for example by using tiling) reduces the chance of cache eviction which means a speedup.

    • @lukejagodzinski
      @lukejagodzinski 5 лет назад

      @@dassurma so you say that V8 doesn't do extra jumps between memory cells when executing?
      array[100];
      array[1000];
      array[50];
      I would assume that it would do something like:
      1. Go to memory point where array starts
      2. Move array pointer by 100
      3. Go to memory point where array starts
      4. Move array pointer by 1000
      5. Go to memory point where array starts
      6. Move array pointer by 50.
      With pointers it would just be:
      1. Go to memory point where array starts
      2. Move array pointer by 100
      3. Move array pointer by 900
      4. Move array pointer by -850
      Also, as I think about that, tiling would actually not help with array access... So probably I'm wrong that it was the source of the problem. But still I think pointers would improve speed here

    • @dassurma
      @dassurma 5 лет назад

      @@lukejagodzinski I was referring to C++ (because you said that array indexing is expensive).
      For JavaScript, it’s hard to tell what kind of assembler V8 would generate for that code as it depends on the types and what behavior the optimizer observers. But the optimization you outline is definitely within the capabilities of V8.

  • @abdullahtayel5405
    @abdullahtayel5405 5 лет назад +10

    It is called cache blocking, I felt like you first time I was introduced to it. Here is a really good article on this technique software.intel.com/en-us/articles/cache-blocking-techniques
    and to be honest I never thought that any web developer would find himself in a situation where he needs to implement this technique in his work :D

    • @stepans2167
      @stepans2167 3 года назад

      Thanks for the link, it's a good read

  • @YoussefELHOUTI
    @YoussefELHOUTI 5 лет назад +2

    without tiling each write has to evict because you are rotating 90 degrees (x become y) when you do tiling, you don't need to evict foreach write and this why it's much faster

  • @marcopfeiffer3032
    @marcopfeiffer3032 3 года назад +1

    Can I try an explanation?
    The reading is nice sequential but the writing isn't.
    Let's take a 4x4 pixel image which is just a single dimensional 16 item array/block of memory.
    When you rotate clockwise, then you copy 1 to 4, then you copy 2 to 8, then 3 to 12 and then 4 to 16.
    If you tile the 4x4 image into 2x2 images, then you copy 1 to 4, then 2 to 8 _but then_ 5 to 3 and 6 to 7.
    So the first 4 pixels in the tiled rotate used the read-buffer 1-6 and write buffer 4-8 (6+4=10 pixels in cache total).
    The first 4 pixels in the full scan used read-buffer 1-4 but then write buffer 4-16 (4+12=16 pixels in cache total, basically the entire image).

  • @frutiboy1
    @frutiboy1 3 года назад +1

    Btw, In your Progressively Loading Images episode we can see that images are loaded tile by tile in some really progressive formats...

  • @sinaseirafi9445
    @sinaseirafi9445 5 лет назад +1

    This is the first video that I've seen from you guys, Really interesting. Thank you.

  • @niklashigi
    @niklashigi 5 лет назад +5

    16:40, you forgot to link the podcast in the description. Very interesting video, btw!

    • @dassurma
      @dassurma 5 лет назад +2

      I fixed it now! Thanks for letting me now

  • @lkri7951
    @lkri7951 5 лет назад +3

    Thats amazing thing you guys shared, Even though this is Engineering (logical reasoning) it sometimes works more like intuition 😊

  • @mpoisot
    @mpoisot 2 года назад

    Would a Hilbert curve be useful here for a more general solution, perhaps to avoid picking an explicit tile size?

  • @k1ngjulien_
    @k1ngjulien_ 5 лет назад +3

    Oh my god I loved this. JavaScript and low level cpu functionality are two of my favorite topics :D
    Thanks for making this.
    PS: Please tell your colleagues that work on GMail to finally make it a PWA so I don't have to use another program anymore :P

    • @jurgentreep
      @jurgentreep 5 лет назад

      If you open your google chrome apps you can right click gmail and set it to open in a new window. This is almost the same as a pwa :) You can even create a shortcut for your desktop or start menu.

  • @hypersonic12
    @hypersonic12 5 лет назад +1

    Fascinating video, guys! This was a great education, and entertaining at the same time. :)

  • @ben_lacy
    @ben_lacy 5 лет назад +1

    Favorite episode to date! Really good stuff

  • @NuncNuncNuncNunc
    @NuncNuncNuncNunc 4 года назад +2

    I was on the edge of my seat hoping you would name the browser.

  • @ZacharyBerger
    @ZacharyBerger 5 лет назад +2

    Could you share the link to the Hacker News discussion?

    • @ZacharyBerger
      @ZacharyBerger 5 лет назад +3

      Actually, found it myself: news.ycombinator.com/item?id=19166233

  • @victornpb
    @victornpb 5 лет назад

    Who doesn’t like a good browser war! The 90s is back guys

  • @martixy2
    @martixy2 5 лет назад +1

    Cache locality?
    Cache locality!
    It's actually a pretty popular concept in performance critical code (e.g. game engines). I keep seeing it mentioned in a ton of GDC talks on architecture/optimization/related.

  • @victornpb
    @victornpb 5 лет назад +3

    Don’t tell me you had to reshoot this just to remove the browser name

  • @aidanbrumsickle
    @aidanbrumsickle 5 лет назад

    If you wanted to write a parallel algorithm for rotation would you want to tile the image in the same way? that would seem to be useful for per-core caches, but then once separate cores are reading and writing from different locations it would seem to make shared cache performance worse. That's pure speculation on my part though. Perhaps a new side learning project is in order.

  • @emilemil1
    @emilemil1 5 лет назад +1

    I disagree that this is a "tools not rules" case. That this would be a performance issue is obvious from looking at the code, knowing the rules would let you catch it immediately.
    You're accessing a 2D array column by column. This *always* causes cache issues for large arrays, and images are very large. These are also obviously the instructions that will be run the most, by a significant margin, so it's worth thinking about from the start.
    In fact in C++ there is a whole paradigm dedicated to designing with the cache in mind. Look up data-oriented design.

    • @jakearchibald
      @jakearchibald 5 лет назад

      Without tools, we wouldn't be able to confirm the result.

    • @emilemil1
      @emilemil1 5 лет назад

      Confirming isn't needed in obvious cases, it's just time wasted on unnecessary profiling. I mean do you profile before using binary search over linear search on an array of 2 million values? This is the same situation, just more obscure information.

    • @dassurma
      @dassurma 5 лет назад +2

      @@emilemil1 If you had shown me the code before I learned about loop tiling, I’d have bet my ass that the new code was slower, even if maybe only marginally so.
      At some point you’ll have enough expertise in a field to be able to make these kind of calls without profiling and testing. For example: By now I know when using `will-change` will help and when it won’t. I don’t need to test, because _to me_ it’s obvious. But that doesn’t mean it’s obvious to everyone else and - more importantly - that others should just follow the advice “use will-change”. Even if you “know the rules” around `will-change`, there are a lot of scenarios where it will make things worse. So whenever I give advice around `will-change` I’d add “but measure and see if it actually helps”.
      It’s great if this case is obvious to you, but it wasn’t for me. We all have different backgrounds with different levels of knowledge. “Tools not rules” is a phrase that help us DevRels give advice in a way that everyone can use it while reducing the adverse effects of just listing rules.

  • @dum.briyani
    @dum.briyani 4 года назад

    Sur-matram. Optimization levelled up.

  • @pflasterstrips7254
    @pflasterstrips7254 2 года назад

    I expected this to be about code being parallelized by the compiler, and the outer tiling loops are better for that.

  • @akinwunmioluwaseun3772
    @akinwunmioluwaseun3772 5 лет назад

    Actually drew my breathe in with Jake at 21:10

  • @MacoveiVlad
    @MacoveiVlad 5 лет назад +1

    Let's hope the speed gain is not based on speculative execution. If it is it might go away with a OS patch and leave everyone wandering what happened. Or worse, someone might find a exploit.
    This performance gains look really close to the metal. One would expect many layers of abstraction for something that runs in a browser.

  • @JustJohnny
    @JustJohnny 5 лет назад +4

    I've guided some clients that don't have photo optimization software to Squoosh and even I find myself using it for time to time because it does a better job than Photoshop. Any chance we'll see batch processing?

    • @dassurma
      @dassurma 5 лет назад +2

      We are working on batch processing! (Currently implementing some lower-level building blocks that will pave the way for batch processing to land)

  • @OlleHellman
    @OlleHellman 5 лет назад +1

    Very interesting! Thank you for doing your best explaining it.

  • @kosnk
    @kosnk 3 года назад

    A great magical trick.
    Sadly, I still don't quite get it: some say in comments it's because the way the reading caching happens (better pattern prediction/cache flow through levels/cache eviction), others say that the write cache is also involved. I'll have to learn more on the topic, but at least knowing that such thing exists should help. Thanks! :)

  • @Morgan_Davis
    @Morgan_Davis 5 лет назад

    Can the optimal tile size be dynamically determined at runtime (with a small internal test of generated data) to avoid having to profile CPUs at all?

    • @dassurma
      @dassurma 5 лет назад +2

      You could run a small benchmark on your target device, but I think that’s not worth the effort. 16 seems like a pretty safe number :)

  • @furetosan
    @furetosan 2 года назад

    16:03 inaudible = AssemblyScript perhaps?

  • @DenisTRUFFAUT
    @DenisTRUFFAUT 5 лет назад

    Is there a general rule for tiling ? Or a JS function that gives the number of cores ?

    • @dassurma
      @dassurma 5 лет назад +1

      There is navigator.hardwareConcurrency

    • @ZacharyBerger
      @ZacharyBerger 5 лет назад +2

      @@dassurma but number of cores doesn't matter here, its the cache size

  • @MrZyclops
    @MrZyclops 5 лет назад

    *head explodes* great episode

  • @LeviBuzolic
    @LeviBuzolic 5 лет назад

    Super interesting video, great work! Entirely uncharted territory for me.

  • @iuliuvisovan7307
    @iuliuvisovan7307 5 лет назад +1

    10:15 Hey Jake, it's called an index :D

  • @anonymouse7074
    @anonymouse7074 5 лет назад

    Podcast not linked in the description 😂

    • @dassurma
      @dassurma 5 лет назад +3

      I am not a smart man.

  • @Marseleon
    @Marseleon 5 лет назад

    what was the second theory?

  • @dmitrisheley1998
    @dmitrisheley1998 5 лет назад

    now I'm intrigued... WHAT BROWSER WAS IT??

  • @lastdecember2496
    @lastdecember2496 5 лет назад +1

    Happy to hear that even google engineer not fully understand why their code working 😅

  • @NanobyteOnline
    @NanobyteOnline 5 лет назад +1

    Conclusion: It works! Great.

  • @arkanciscan
    @arkanciscan 5 лет назад

    Does it rhyme with "kitchenette restorer"?

    • @recursiv
      @recursiv 5 лет назад +1

      That one doesn't run wasm so no.

  • @cristianberroeta9866
    @cristianberroeta9866 5 лет назад

    Great video, very interesting stuff! I would have thought that tiling improves performance because of some sort of concurrency (many tiles being processed at the same time), but that's not the case, right?

    • @justintaddei
      @justintaddei 5 лет назад

      The JavaScript code in the video is single-threaded, so no.

  • @b3rakesh11
    @b3rakesh11 5 лет назад +1

    Thank you Google

  • @lakandoor1007
    @lakandoor1007 5 лет назад

    nice video about wasm :-)

  • @ColinRichardson
    @ColinRichardson 5 лет назад

    Best video EVER

  • @nicolas.stucki
    @nicolas.stucki 3 года назад

    Tiling improves performance because it helps keep the destination matrix in the cache.

  • @FlowertwigDotOrg
    @FlowertwigDotOrg 5 лет назад

    Actually useful info this time, thanks :)

    • @dassurma
      @dassurma 5 лет назад +4

      Do I sense shade?

  • @b3rakesh11
    @b3rakesh11 5 лет назад

    Thank you

  • @wmhilton-old
    @wmhilton-old 5 лет назад +1

    Any sufficiently advanced technology...

  • @apo34apo34
    @apo34apo34 5 лет назад +1

    simply woah ^^

  • @GottZ
    @GottZ 5 лет назад

    nice how they just described microsoft edge

  • @b3rakesh11
    @b3rakesh11 5 лет назад

    Awesome

  • @hobbyturystaSEO
    @hobbyturystaSEO 4 года назад

    https 203 0:30

  • @GifCoDigital
    @GifCoDigital 5 лет назад

    My head hurts now. lol :)

  • @franklemanschik_de
    @franklemanschik_de 4 года назад

    What happens here is simple you did the most basic optimization possible chunk your operations :)

  • @Benimation
    @Benimation 5 лет назад

    Nobody knows you're talking about IE..

  • @anonymouse7074
    @anonymouse7074 5 лет назад

    Name and shame!

  • @pharmokan
    @pharmokan 4 года назад

    what a horrible direction web development is going in

  • @panstromek
    @panstromek 5 лет назад

    DOD comes to web ;)

  • @zaffarmughal5478
    @zaffarmughal5478 5 лет назад

    could be better with nice and decent. no need to be clown.

    • @dassurma
      @dassurma 5 лет назад +2

      Where’d be the fun in that?