I want to acknowledge at 18:44 how dope the animations are for the matrix multiply. Stepping through the algorithm side-by-side with the highlighted code is a subtle by awesome touch. The comment even gets updated to show how many times the memory was accessed.
I switched my WIndows 11 to Ubuntu last evening for the first time in my life - and yes, I can confirm that actually my CPU is much faster than I thought and I should've stopped gaslighting myself into thinking my PC is really laggy because of my CPU and not Windows long ago.
At some point you'll find out that Ubuntu is slower than other distros like I did. Still using it for gaming. Fedora is faster and arch is very fast, but I'm too dumb to use it
@@yeahaddigirlIf Arch was too hard for you I recommend NixOS. You can use the graphical installer to install a graphical or barebones version. Most of the drivers get setup for you and any change you make can be rolled back easily. Vimjoyer here on youtube has some nice short videos explaining Nix.
@@yeahaddigirlI have a pretty beefy computer and used WSL for dev but recently switched to arch and I’m blown away by how fast and instant everything happens.
I recommend trying endeavor os for anyone interested in getting into arch but is too scared about it's complexity. And stay away from manjaro, it might be based on arch but it's surprisingly really bad.
For matrix multiplication you need one row from matrix A (contiguous in memory) and one column from matrix B (not contiguous). The CPU is very good at fetching contiguous blocks from memory (prefetching, cache, etc). So if you transpose matrix B beforehand the cache hit rate is much higher and therefore the computation is faster in total. The transpose is O(N^2), and the matrix multiplication is O(N^3), so it is worth it for large matrices.
except for large matrices you'd never multiply them in O(N^3) on cpu, but yeah. strassen can be optimized as well with the transposition, but I don't know to what extent.
OS courses from college were probably my favorite. the depth in which you start understanding linux and os design is amazing. everything higher level just suddenly clicks in your mind after all those hours of talking about processor caches, threading and interrupts
Operating system design and implementation was the funnest class after digital technology for me. The book by Prof Tannenbaum that we used was really good. It was a pity though that we didn't have the time in college to design the whole Z80 microcontroller that we did and write a proper operating system on top of it. Not in the least because we didn't have a z80 C compiler and the book doesn't really go into assembly a lot, it is all C. But I managed to made a cooperative multitasking scheduler on my Z80 board, which made programming the washing dryer combination that was the final test really easy for me. Also because I took the time to write a Z80 assembler and didn't do machine code like all the other students. I just couldn't deal with hand assembling machine code instruction after the second lesson :D
Unfortunantly our OS professor was a substitute and it wasn't that great. Luckly we also had RTOS course and we at least learned how to implement our custom realtime scheduller policies using freeRTOS API on a Tiva launchpad. But I feel like I will have to do some self teaching on OS if I really go the embeeded systems route.
@@Alkis05 VxWorks and QNX also Unix like fully posix compliant and real time operating systems, awesome for embedded. RTOS are incredible unfortunately that course didn’t go into developing those. We also had to do a very rudimentary scheduler just cooperative multitasking. But I guess in assembly that is what you can only do in 6 weeks 🤣
Did you ever dive into OS internals later on? I had the same regret as you not taking OS in my undergrad, I ended up taking it in my master's degree 7 years later. I really found it valuable, it was one of my favorite courses by far. That being said, you seem to have a lot of experience at lower levels (far more than me). And these courses can only go into so much detail, they teach you how threads, processes, filesystems, scheduling, and system calls work, but if you look at the data structures they use, I think you'd find that you have probably mastered all of them already. Now that I think about it, if you went and took an OS course, now that you have all this experience, I think you would get WAY more out of it than if you had taken it earlier on. You would be able to dive a lot deeper than the average student, and pull off some really cool stuff.
I did my undergrad in electronics, so I also have a gaping hole when it comes to operating systems. Would really appreciate it if you did a stream or series on it.
6:00 “uhm ackkshually” its not “equalize”, but just “mix”. He didnt mix the sound levels correctly. That refers to simple volume levelcontolling on the different audio tracks. “Equalizing” is a different thing that is done to audio where you tweak specifix frequencies on the tracks (whereas mixing lowers or highers ALL frequencies on a given track)
The reason transposing the second matrix is faster because we need to fetch the elements column-wise from the second matrix while multiplying. Since matrices are stored row-wise in the memory, the cache would contain the data row-wise as well. So we'd actually be jumping around in cache a lot for fetching elements column-wise, and prolly some of the last rows for large matrices won't even be cached.
The reason why transposing the second matrix is faster is because you can SIMD two 4 element contiguous arrays. In other words the columns are no longer 3 bytes apart in the second matrix but contiguous. That is because in C/C++ (and some other languages) a two dimensional Matrix like float[4][4] is internally float[4*4].
Also for large matrices that don't fit in the cache a simgle line will fit in the cache. So you won't have to load the second matrix from main memory over and over again.
Edit: sorry was wrong, C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim, but seems C/C++ decided it made more sense to have [][] = [,] = [m*n] and then "obviously" use pointers to construct jagged arrays. Sigh...
@@ErazerPT "C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim" It's not that C/C++ is being obtuse here, it's just reusing the same array indexing syntax for multi-indices. In C++ in particular that syntax is "just" an operator which means whether it's jagged or continuous can is up to the implementation. A vector of vectors will be jagged, but a custom matrix type can be implemented continuously (most library authors will overload the function call operator for this purpose though, so A(i, j) instead of A[i, j], at least until C++23 when the latter syntax will become legal).
yeah, it's all about getting the input values in continuous memory, ideally paragraph-aligned so the instructions for aligned move can be used. that's for assembly only though, i doubt rust can do that as it cannot know alignment of your data. in fact, this continuous memory layout is what the matrix rotation is good for. you can also get away by just duplicating the data (in both layouts) when loading the matrices. this really goes through the roof if you have too much data to fit into L3.
13:22 This brings back a memory during my apprenticeship (meaning, no Uni or higher mathematics, including calculus): senior: Hey, implement this, please. me: ok also me a few minutes later: wtf are complex number? senior: just take your time to learn it, the Internet should be good enough So, ehm yeah, pretty much Wikipedia taught me what and how to work with complex numbers.
Article missed something major. On GPU/TPUs there's memory so the matrices can be loaded directly on the GPU with cuda, so there's not the same bus bottleneck. That's how ML is done in neural nets. Saying that GPUs are x times better is from actual benchmarking, not theoretical.
20:14 Remember kids, always add a few more Nasty Loops to break the orgasm! The (original) video was ...pretty bad. Sorry, it lacked structure and direction. It made up with those hilarious subtitles, though. 😂
The epic CPU background music sounds like some chill version of Killer Instict. Hope your wife doesn't mind your teraflops. The comment about the website was so on point too. Engineers and academics doing 90's websites when they have the most important and thorough information to communicate. There's this one dude in Finland who used to do some university job and just for fun has any and every single language formatting and writing rule, style and advice (verbal and relating to typing it on computer documents) on his website, goes through programming language and just general information about Linux and IT on rather deep level. He's pure math bachelor and worked in academic IT center for decades, used to sit in computer politics committee as well. And his website is just a sort of green tilted turquoise background with basic font, pages and pages of text. Well written and paragraphed, but absolutely nothing extra. And he still updates it, 2023 updates have been "mathematical notation 2nd edition" and "pronouncing portuguese". The dude apparently just loves communication related stuff and teaching it. Pages like that are golden and don't look like much.
In university, a friend of mine worked on a project where they tried to optimize cpu die design/placement algorithms for GPUs. The result afaik was not that good because of cache. While the CPUs had large local caches all cores could easily access, the individual cores on the GPU had a hard time fetching the data, which for the problem given was much more dynamic than what e.g. games require. While that was some years ago, I can imagine this still being a problem for algorithms that need to access large chunks of data from less predicatable locations.
Who else wishes that GPU would finally start optimizing for integer and data operation as well? The 6000 CUDA cores that are each as complex as a pentium processor could help in other data processing as well. But it took until the NVidia 1080 release to get even the basics for this on the GPU.
The transpose optimization is quite simple really. SInce one of the matrices in the multiply is accessed column-wise it increases the chances of the next item being accessed not being in cache. If the matrix is large enough to not fit completely into cache you may have a situation where you might have element [0, 0] in cache but element [1, 0] is not in cache. And in GEMM one of the matrices is accessed in that pattern. If you transpose that matrix your access pattern changes to row-major and then your odds of cache-miss go way down. Now you'll be accessing [0, 1] after [0, 0] instead of [1, 0]. What I would still like to see a demonstration / proof for is how in practical terms the time taken for that transpose gets more than offset by the speedup from the cache-hits.
imagine you did a search in a web based product catalog, and every page only contained one result. that's a linked list (as often implemented, with a single value per node).
I experienced about a 10x speedup with real code doing reinforcement learning. It's the difference between being able to run a RL job and lose an hour; vs the whole day is lost if you started the job wrong. Of course, you must be doing SIMD code via the GPU to do it. I watched a bunch of interns spend all day doing RL jobs. I got my personal faptop accelerated, and just gave it to them for the summer. My CEO saw how huge the difference was, and got 3 monster boxes with a pair of GPUs in each one. It very much matters for machine learning work. But this is the performance of vectorized code; which has nothing to do with CPU-only code.
SIMD (AVX256 and AVX512) is being embedded into the JVM, so Matrix computations can be performed with a simple API. No external C/C++ libraries needed. Java, Kotlin, Groovy and Scala will all benefit. Just imaging doing Spark DataSet processing with SIMD instructions. Nice.
It would have been nice if the video ended with something about how the comparison between CPU and GPU ends up, answering what the actual difference was from the initial question. But other than that it wasn’t bad. Sure a lot of it was way over my head, but so was the SIMD rust video prime mentioned. It’s a gateway drug tho.
Regular linked lists are slower, they don't have a hash table, they need to do traversal to find stuff. Any random index in an array can be retrieved in O(1), insertion/deletion are O(1)+O(n), looping is O(n). Whereas any random index in a linked list is O(n), looping is O(n), *and insertion/deletion* is O(1).... + O(n) to find the index. Linked lists need custom made optimization to be fast at searching. I've come up with one type of a linked structure that will exist to speed up insertion/deletion compared to arrays, but will also have much more optimal random index access time of O(√(n)/2/ap) where ap means access points and that can be added manually in O(√(n)/2)*8 time. Naturally all big o notations show the worst case scenario for any n elements in a list/array. I'm writing it in javascript. The only problem is object reference takes 8bytes and my nodes will have a lot of object references, thankfully not as much as skip lists that have technically faster random index acces time (traversal) but randomly clone nodes several times over and over. The best language for this would be something low level that can heavily optimise memory per object and memory per pointer.
the worse the blog/forum post looks the more useful the contents are. i was researching something about some heat resistant tape for a project i was doing and the trash-looking forum i ended up on literally had people working on nuclear reactors talking on it
its hard. one of the things i loved about coding ARM assembler was the simplicity of the instruction set. Optimising code for cycles was fairly easy.The bummer about x86 is the amount of extensions (x87, IA-32, x86-64, x86-S, MMX, 3DNow!, SSE, MCA, ACPI, SSE2, NX bit, SMT, SSE3, SSSE3, SSE4, SSE4.2, AES-NI, CLMUL, RDRAND, SHA, MPX, SME, SGX, XOP, F16C, ADX, BMI, FMA, AVX, AVX2, AVX-VNNI, AVX512, AVX10, VT-x, VT-d, AMD-V, AMD-Vi, TSX, ASF, TXT, APX). There are over 2000 CPU instructions for x86-64.
Thats why I love the RCA 1802 from the early 70s. Sure its the slowest processor you'll ever use but its instruction set is like Xen perfection and you can learn to program the thing in an afternoon. Intel's 4004 is surprisingly beautiful too.
From what I understood, it's more that when you transpose, your matrix multiplication gets data next to each other for both matrices instead of one having data next to each other and the other jumping along by N So, instead of adding up a row and a column, you are adding up 2 rows, where the data for a row is all close in cache meaning you get a lot more cache hits. Caches hate data that is spatially distant
cache access always happens in line blocks. even if you are reading single float value, cpu will load entire vlock from memory into its cache line. given that, lets say you are multiplying big enough matrices where single matrix doesn't fit entirely on the cache. for a single value in a result matrix, you have to do A[i,j] * B[j,k] where j ranges from 0 to number of rows in B. Without transposing B, CPU will load whole line containing B[j,k] every inner loop, which will fill up the cache, and eventually evict previously loaded lines. so the next time you try to compute value for the C[i,k+1], it will encounter more cache misses. Transposing B before actual multiplication just reduces the chance of cache misses by just making the B matrix more cache friendly (essentially making matrix B's indexing to use B[k,j] scheme so that cache misses are reduced for adjacent j values). This is regardless of L1, L2 or L3 cache. more evictions you have at your lower level cache, slower the operation.
Another way to think of it is just to flip the 2 inner loops. When you are doing {i, j, k} loop order, and your inner looks like C_{i, j} += A_{i, k} * B_{k, j}, you are constantly changing A_{i, k} and you are always fetching another array when accessing B_{k, j}. But if you flip the loop to be {i, k, j}, B_{k, j} will just be consecutive values in an array, and you also get the added benefit that A_{i, k} is effectively a constant for the entirety of the J loop. This means that you are effectively only fetching values for B when you do the inner most loop. Another added benefit is that you can check if A_{i, k} is 0. If it is, you can just skip the entirety of the J loop because you will just be accumulating 0 * something which is 0.
This has to do with memory locality, which increases the number of cache hits by taking advantage of the processor's prefetcher. I had to do this on college on matrices up to 32k x 32k, but the math behind A x B^(-1) and why it works I can't recall.
Transposing is just a simplest thing you can do to improve cache access pattern. You can do a lot more complex scrambling-descrambling to optimise caches, or GPU local memory. E.g., for a lot of image processing algorithms, representing the image planes as Z-curves (or other space-filling curves) result in a much better locality.
Check out the book, "Developing your own 32 bit operating system". It is a bit old but fast enough to go through and learn a lot. You probably can read through it in about a weekend.
You learn this stuff in OS courses? BRB, going to enroll in an OS course RN. I'm going to SIMDeez nuts all over that teraChad CPU I just _know_ is lurking in the uni basement, waiting to entrap a poor, naive programmer such as myself. Also, whoever decided that the constants in O(n) notation get reduced to 0 for algorithms that real humans use on real hardware ought to be Old-Yellered. The fact of the matter is that those constants are _not_ zero, never will be, and contribute to an awful lot of wasted CPU cycles.
Love the way the dude is all like (Kim Jong Un in Team America) "look guys, it so fukking simple, u just need 190 IQ like me, dont u know how fucking busy I am!?"
You totally reminded me.... It was back in 2013 that now-Blizzard President J. Allen Brack issued his infamously condescending line of “You think you do, but you don't” in response to fan requests for a World of Warcraft vanilla server.Feb 24, 2020
What you were saying about pretty websites being useless is right. I think you could codify it as "The usefulness of a website is inverse to its Javascript-to-HTML ratio."
@@neodonkey The more suitable the TV size is to the room you want to put it in, the harder it is to find a good one. Most TVs are huge and most rooms are too small for it to look right. People who are poor on money are also poor on time and will give up on finding a good fit sooner.
Most of the time you just have to use a compiled language with no extra overhead (no gc, no unecessary heap allocation because of objects, etc) and do no nothing obviously bad and your program ends up being very fast without even having to optimize it.
Indeed. Just days ago I switched from Python + numpy to a couple lines of Rust and the same task immediately ran 650x faster. Seeing a modern CPU do its thing without brakes on is a thing to behold.
SIMD can give a 4x speedup (or more). It's often worth exploring, for anything where 4x the bandwidth would actually make any difference to the users. Which can be more often than you'd expect, if they have to wait for the results or blocks the UI, if it's not just some nightly batch job.
@@blarghblargh I was trying to make a program go faster the other day with SIMD. My program ended up telling me that 2-0=0. I need to do more debugging on the SIMD lol
@@blarghblargh In my program I tried a lot of tricks and none were as fast as just the first unoptimized version. Even SIMD with 32 values per calculation was only half as fast. I guess the cache works in mysterious ways.
I'd be interested to see how thoroughly and reliably Rust can auto-vectorize code. C++ compilers often don't do such a great job, and Intel ISPC demos the benefits of a language specifically designed around vector code. It's not so much that the compiler can and does vectorize code, but that the vectorization is stable, and doesn't degrade to scalar code silently when you modify the code slightly, or compile it on a different compiler version or build mode, etc. It's often the case that if you're bothering to ensure code is vectorized once, that you want it to stay that way.
@@327efrain dunno if it's still the case, but at least up until a few years ago, game engine devs would use compiler intrinsics instead of trying to coax things to auto-vectorize. Then they could be guaranteed it would emit the code they were looking for.
How hard would it be to improve Numpy or other programs doing LLM inference to take advantage of this. I'm not under the impression that it is being used already.
I'd like to add some details that I've noticed are either misleading or missing, since I did optimize the GEMM algorithm as an university project and I'm familiar with it. That project probably ended up being the one I enjoyed the most in my entire degree. They'll need to be answers to this comment because of the length, though. It's great to see that there are videos explaining this stuff to the general public (well maybe not that general in this case but you know what I mean) and the reactions derived from it. Keep up the good work. Also, if you're short on content sometime I'd be willing to go more in depth with this algorithm.
Let's start with the hardware comparisons. While it's true that a Xeon E5-2630 v3 peaks at 8 FLOP/cycle using avx2, it's not necessarily also true to other processors. Some processors may execute several operations per cycle, depending on the hardware and the data dependencies between the instructions. For example, the Zen 4 architecture is theoretically capable to run 2x avx512 float additions in parallel with 2x avx512 float multiplications, or in other words: multiplying 32 floats while adding another 32 floats at the same time reaching a mind-numbing 144 GFLOP/s per core for a Ryzen 9 7950x (2304 GFLOP/s across all its cores, which is actually not that far from the GPU in the video). Also, especially on desktop processors, the performance of single-threaded algoritms may be way higher than what the base clock of the processor would suggest because of their insane boost speed on short-running single-core loads. For example, the previous processor has a base clock of 4.5GHz, but has a max boost clock of up to 5.7GHz without overclocking depending on the conditions, which is a whopping 27% extra (or a burst at 182.4 GFLOP/s). If we take into consideration that said processor supports avx512, depending on the algorithm it may not be worth it to parallelize just due to the loss of the single-core boost combined with the parallelization overhead (which is something the video doesn't mention). As for the GPUs, comparing CPUs vs GPUs is a bit like comparing apples to oranges. That is to say, it's plainly unfair. CPUs are more optimized for branch-oriented algorithms while GPUs may have some problems with those kinds of algorithms. I tend to thing of GPUs as an independent multi-core SIMD coprocessor in steroids. That is because of the way a GPU executes instructions: there's a "scheduler" per each group of 32 cores (64 in older RADEON cards, both modes are supported in newer ones) which dispatches the same operation to each and every one of them, potentially masking the result out. It may be a bit fairer to say that the CPU side has 2 sockets * 8 cores * 8 floats per cycle = 128 "cores", or that the GPU mentioned in the video has 3072 cuda cores / 32 per ROP = 96 "cores", which is still a pretty significant difference but not by as much as before. There's also differences in how the caches are organized too, which is especially important when optimizing CPU algorithms since you have more control on how the code is executed: for example, you can bind each thread to a specific CPU core so that it never changes and take advantage of the data locality this provides, which is something that (as far as I know) can't be done in a GPU. The GPU has also some architectural advantages like the local memory, which is like a fast cache shared between every core in each group of 32, but it requires manual usage (aka. writing and reading to it with explicit code rather than relying on hardware-implemented caching algorithms like the CPU caches do). GPUs also require you to copy data between the CPU's and the GPU's RAM memory since they are separated (unless it's an integrated graphics card, mobile, etc...) and (usually) built different. It depends a lot on the algorithm itself.
On to memories: Memory speed depends on a lot of factors too, such as the channel width (or gurth ;) and speed, but also on the number of channels and where the data is located physically. For example, say your data is on a single RAM stick (therefore probably a single channel) but your machine supports two independent channels, then you'll be using only half your bandwidth. Of course we can't choose which ram stick to store our data (as far as I'm aware) and the hardware probably already does that for us. Something the video fails to mention are latencies, which exist both in the CPU instructions and the memory accesses. When the CPU requests some data from the memory, the memory takes some time to reply. I'm not talking about tens of cycles as the video shows but rather hundreds to thousands of cycles. This is because what we call RAM is actually DRAM memory (dynamic random access memory). The dynamic part means that it uses capacitors instead of bistable circuits, which are much more storage dense. The downside is that those capacitors need to be refreshed periodically since otherwise they'll lose the data (hence the volatility of the RAM). This refresh cycle is done every few milliseconds (or whatever the manufacturer decides), but that's not where all the latency comes from. For efficiency purposes, the RAM isn't byte-accessible, but rather it is row-accessible. That is, for every data read or written there's a more than likely chance that the active row has to be changed, further delaying the operations. But it's even worse: if my memory isn't failing me, those row reads are destructive and need to be written back to keep the data where it was. Although latencies aren't mentioned, they aren't all that relevant on some algorithms because of the caches. The video mentions that alogrithms which use the data exactly once can't possibly benefit from it, but in reality they do benefit. When the CPU requests data to the RAM, it'll be returned as a block of (most likely) 64 bytes. That is the size of a single cache line on x86 hardware. That transfer may take a lot more cycles than the ones mentioned in the video, but remember that once the block has been brought to the L1 cache we'll basically have the entire 64 bytes accessible within a single cycle. So, even though it seems the algorithm isn't profiting from the cache, it actually is, and quite a lot. The point the video makes is that the algorithm can't profit from a cache more than it already is; or in other words: no matter how much we torture our remaining brain cells on that kind of algorithms, it won't make the cache hit rate better. Cache management becomes especially important when dealing with SIMD because you're no longer accessing 4 or 8 bytes per cycle (assuming a dispatch width of 1 instruction per cycle), but rather you'll be dealing with 16 bytes (sse/avx/neon...), 32 bytes (avx2) or even 64 bytes on newer hardware (simd512). In other words, you may potentially end up requesting multiple whole cache lines per core to your (comparatively) slow RAM. It is also important to take it into account on multithreaded algorithms: the video mentions that multiple cores accessing the same cache line is potentially catastrophic, but that only applies when writing data, not when reading it. The problem with writing data is that (under normal circumstances) the data needs to be pulled into the L1 cache, then modified, then marked as dirty (aka. tell the neighbouring cores that the shared cache's copy of this block is not up to date). If multiple threads are writing data within the same cache line simultaneously, then for every write they'll need to send the block from thead A's L1 to thread A's L2 to the shared L3 (hopefully, otherwise into RAM and back to thread B's L3), then into thread B's L2 to finally reach thead B's L1 just to be modified, marked as dirty and the process to be repeated in reverse. There are some special instructions to avoid this problem (when avoiding it the obvious way is not possible) which are called non-temporal reads and writes. Those read and write directly to RAM (or the shared cache, I'm not sure which one it was since it should work with both depending on the hardware configuration). The non-temporal instructions have abismal latencies therefore the compiler usually generates them sometimes on write-only data (aka. data that you won't need later on, ex: for i in ... do a[i] = b[i] + c[i], a is write-only and doesn't really need to be cached within L1 or L2 really; but a[i] += b[i] does need to be cached and will have poor performance regardless if, for example, thread A computes even indices and thread B computes the odd ones; the correct way to fix that would be to make thread A compute the first half and thread B the second half). As mentioned in the video, it is true that it is sometimes faster to copy the data with a different memory layout (including but not limited to transposing) and then operate over the new data than use the original one, I can vouch for that. Using the divide and conquer method of splitting the computation into blocks is also ridiculously effective with caches. Caches also have their own quirks that need to be addressed, for example cache associativity. In some cases where data is (unfortunately) aligned it may completely destroy the cache usage rate, leaving your program using only 3.125% of your L1 (I can prove it since it happened to me). As a side note, the video mentions that loading aligned data is potentially faster than non-aligned loads, but that is no longer the case in new architectures (at least that I know of).
Alright, let's transition into compute stuff. It is true that compute speed (not just CPU clock speed) has grown at a quite faster rate than memory bus speeds, but that's what caches are supposed to make up for. There's quite a lot of algorithms that can improve a lot in cache misses but don't because of how they are coded. The matrix multiplication shown there is just one of them. When the video is showcasing the matrix multiplication it briefly mentions the divide and conquer algorithm (albeit with a different name), but let's leave that for later. The video says that matrix multiplication has a computational complexity of O(n³) and a spatial complexity of O(n²). This means that the algorithm will grow the number of steps it needs by n³ as n keeps growing, while the space used will grow by n². However, the truth is that n² has nothing to do with memory transfers, which is what we're interested in. Each of the operations related to O(n³) will involve 3 reads, then a multiplication of two of those followed by and addition with the third, and finally storing the result back into memory. This means that the amount of data that will need to be transfered will also be O(n³). Spatial complexity just means how fast the space you need grows as the problem size grows, not how many data transfers you need. We can all imagine how badly the CPU would perform without caches if every cycle you're not only requesting multiple cache lines per core but also your complexity on the memory accesses is O(n³) instead of O(n²). This is where caches shine: you can copy the data you're working with into the cache, work with only that data locally since the data will be available practically instantly, and after you're done with that data then you forget about it and move on to the next batch. This probably is what the next video in the series will probably address. I'm not sure how in depth will that video be but I'm looking forward to it since there's quite a few cool stuff you can do with it to make it ludicrously efficient (yes, not joking, I'm talking about improvements in orders of magnitude in cycles, cache misses and number of instructions at the same time when done correctly).
Finally, some notes on the GEMM algorithm itself. I like to think of matrix multiplication as taking two bidimensional arrays, placing them perpendicularly in a 3d space, generating the product of each intersection and collapsing it by adding everything together across the third axis. Or something like that, I haven't checked if that works on paper but I'm pretty sure it does. It's probably a lot easier to visualize than it is to understand what I just wrote. If you manage to get an actual mental representation of what is going on there, you can instantly understand how tiling works and how to optimize the tiling itself. For example, when using caches, the order in which the tiles are computed does matter, and depending on the tiling configuration it can reduce cache misses by orders of magnitude on all cache levels. Also, having that representation available also makes it really easy to think up a really nice and compact piece of vectorized assembly to multiply those tiles. I managed to get between 10 and 11 operations (of those in O(n³)) per cycle per core on my machine, which is pretty impressive for a CPU. Granted that was only compute with zero cache misses, but still.
With modern advancements in software design principles, even greater achievements begin to come into view. A new milestone, a CPU slower than you can think. It's almost poetic. It's all come full circle. The circle of life.
That's what I've been saying for so long. Even if you use an "easy" language like go and implement a semi-efficient solution to a problem, it'll be incredibly fast because we have supercomputers in our pockets. However, software is getting so bad on average that it cancels out our hardware progress. No, you probably shouldn't use a web framework on top of electron to build a simple gui app. Start simple, and get more complex if you need it. Use tools that are actually performant.
There is one more important thing with heavy AVX CPU is lowering it real frequency when multiple cores do many AVX instructions to not burnout from high temperature (hello Intel)
The video is self explanatory when it stresses "cache line" regarding the optimization. A cache miss pulls in a whole cache line not a single element. So, per miss you want to pull in as much data as possible that will be relevant to the following computations. Transposing the matrix achieves that, and as it grows bigger, the cost to transpose is smaller than the cost of all those cache misses. But... and here comes the Clean Code killer... if your matrices are small enough that they can fit wholly within the cache then all data will be in the cache eventually, and the cost of "random access" to the cache might be smaller than the cost of transposing the matrix. So yeah, optimal performance will require a bit of RY to handle all mixes optimally, with a "generic fallback".
Sometimes I fully agree with you even if I get offended and sometimes I politely beg to differ by hitting the dislike button (in 5% of cases). But, I like the fact that you're completely honest with your viewpoints without trying to play a hide-and-seek game on social media unlike some... Okay! Javascript is not easy. Complied languages have their own quirks and interpreted languages have theirs. In the end, nothing is easy. Certain programming languages indeed produce output that can be faster than similar programs written in another language. Nonetheless, in the hands of overqualified people things will get magically overwhelming; the programs that are supposed to be faster will feel overwhelmingly immobile when operated by the end users. Thus, Charisma matters, and so does FLOPS.
IBM Northpole is 8x faster than NVidia A100 and 6x faster than their top flight H100. It has been in development for 20 years and is more efficient as well and is on the 12nm process node so has way more bandwidth to be shrunk. "It has 256 cores and can perform 2,048 operations per core per cycle". What IBM and AMD are doing is embedding compute into memory on a single chip.
What if the blow stricken against humanity by the misguided and overhyped attempts at ai isn’t so much ‘robot overlords destroy us in nuclear fire’ as it is ‘copilot ensures novice programmers spend 532x more time making the same mistakes, and even expert programmers spend as much as 52x as much time recoding blunders’…?
Really complicated way of saying the path between the cpu and gpu is a potential bottleneck. You don't use the gpu where you are doing a small amount of work on a large dataset and then throwing it away. If the problem is inherently that it's just not a good fit. But many many problem are not that. I think a lot of web developers are likely just forcing what they know instead of taking the time to completely restructure for the hardware. You do really need to rethink your flows and data structures from the ground up. Gpu hardware is really that different.
dude you are hijacking my YT viewing. every single video I watched lately you also react to, meaning I might not actually look for any new videos and watch your reactions right away instead D:
The original video is a test on how neurotypical vs atypical you are in society. I personally loved it and the reaction, it was very informative, dense but not patronising. 🎉
ugh. he uses discord in light mode i have coded in c++, swift, c# many languages, but i stayed away from javascript as soon as i touched it. I AM BIG BRAIN BABY.
I want to acknowledge at 18:44 how dope the animations are for the matrix multiply. Stepping through the algorithm side-by-side with the highlighted code is a subtle by awesome touch. The comment even gets updated to show how many times the memory was accessed.
I switched my WIndows 11 to Ubuntu last evening for the first time in my life - and yes, I can confirm that actually my CPU is much faster than I thought and I should've stopped gaslighting myself into thinking my PC is really laggy because of my CPU and not Windows long ago.
At some point you'll find out that Ubuntu is slower than other distros like I did. Still using it for gaming. Fedora is faster and arch is very fast, but I'm too dumb to use it
@@yeahaddigirlIf Arch was too hard for you I recommend NixOS. You can use the graphical installer to install a graphical or barebones version. Most of the drivers get setup for you and any change you make can be rolled back easily. Vimjoyer here on youtube has some nice short videos explaining Nix.
@@yeahaddigirlI have a pretty beefy computer and used WSL for dev but recently switched to arch and I’m blown away by how fast and instant everything happens.
And yet Ubuntu is one the most bloated distro in the unix realm.
But the lighter ones are NOT user friendly. (Arch, Void, or god forbid Gentoo)
I recommend trying endeavor os for anyone interested in getting into arch but is too scared about it's complexity. And stay away from manjaro, it might be based on arch but it's surprisingly really bad.
The video being reacted to, also has plenty of deeper CPU/GPU performance details and clarifications in the comments.
For matrix multiplication you need one row from matrix A (contiguous in memory) and one column from matrix B (not contiguous). The CPU is very good at fetching contiguous blocks from memory (prefetching, cache, etc). So if you transpose matrix B beforehand the cache hit rate is much higher and therefore the computation is faster in total. The transpose is O(N^2), and the matrix multiplication is O(N^3), so it is worth it for large matrices.
or even moreso if you are reusing the transposed matrix several times
except for large matrices you'd never multiply them in O(N^3) on cpu, but yeah. strassen can be optimized as well with the transposition, but I don't know to what extent.
OS courses from college were probably my favorite. the depth in which you start understanding linux and os design is amazing. everything higher level just suddenly clicks in your mind after all those hours of talking about processor caches, threading and interrupts
I have Discord sound notifications disabled, you can't fool me
I haven’t had discord sound notifications enabled for years. It still got me
I don't use discord. Checkmate!
imagine using that pos (beyond 2017) 🤡
imagine not having a discord community for your projects (in 2023)
@@blarghblargh matrix, irc, git discussion exists
16:20 Well, there is a reason why people who are hired to work on these types of optimizations and well paid.
Operating system design and implementation was the funnest class after digital technology for me.
The book by Prof Tannenbaum that we used was really good. It was a pity though that we didn't have the time in college to design the whole Z80 microcontroller that we did and write a proper operating system on top of it. Not in the least because we didn't have a z80 C compiler and the book doesn't really go into assembly a lot, it is all C. But I managed to made a cooperative multitasking scheduler on my Z80 board, which made programming the washing dryer combination that was the final test really easy for me.
Also because I took the time to write a Z80 assembler and didn't do machine code like all the other students. I just couldn't deal with hand assembling machine code instruction after the second lesson :D
Unfortunantly our OS professor was a substitute and it wasn't that great. Luckly we also had RTOS course and we at least learned how to implement our custom realtime scheduller policies using freeRTOS API on a Tiva launchpad. But I feel like I will have to do some self teaching on OS if I really go the embeeded systems route.
@@Alkis05 VxWorks and QNX also Unix like fully posix compliant and real time operating systems, awesome for embedded. RTOS are incredible unfortunately that course didn’t go into developing those. We also had to do a very rudimentary scheduler just cooperative multitasking. But I guess in assembly that is what you can only do in 6 weeks 🤣
Did you ever dive into OS internals later on?
I had the same regret as you not taking OS in my undergrad, I ended up taking it in my master's degree 7 years later.
I really found it valuable, it was one of my favorite courses by far.
That being said, you seem to have a lot of experience at lower levels (far more than me). And these courses can only go into so much detail, they teach you how threads, processes, filesystems, scheduling, and system calls work, but if you look at the data structures they use, I think you'd find that you have probably mastered all of them already.
Now that I think about it, if you went and took an OS course, now that you have all this experience, I think you would get WAY more out of it than if you had taken it earlier on.
You would be able to dive a lot deeper than the average student, and pull off some really cool stuff.
Optimising computing is a complex job because cpu's and computer architectures are complex as well.😊
I did my undergrad in electronics, so I also have a gaping hole when it comes to operating systems. Would really appreciate it if you did a stream or series on it.
6:00 “uhm ackkshually” its not “equalize”, but just “mix”. He didnt mix the sound levels correctly. That refers to simple volume levelcontolling on the different audio tracks. “Equalizing” is a different thing that is done to audio where you tweak specifix frequencies on the tracks (whereas mixing lowers or highers ALL frequencies on a given track)
The reason transposing the second matrix is faster because we need to fetch the elements column-wise from the second matrix while multiplying. Since matrices are stored row-wise in the memory, the cache would contain the data row-wise as well. So we'd actually be jumping around in cache a lot for fetching elements column-wise, and prolly some of the last rows for large matrices won't even be cached.
The reason why transposing the second matrix is faster is because you can SIMD two 4 element contiguous arrays.
In other words the columns are no longer 3 bytes apart in the second matrix but contiguous.
That is because in C/C++ (and some other languages) a two dimensional Matrix like float[4][4] is internally float[4*4].
Also for large matrices that don't fit in the cache a simgle line will fit in the cache. So you won't have to load the second matrix from main memory over and over again.
Edit: sorry was wrong, C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim, but seems C/C++ decided it made more sense to have [][] = [,] = [m*n] and then "obviously" use pointers to construct jagged arrays. Sigh...
@@ErazerPT "C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim"
It's not that C/C++ is being obtuse here, it's just reusing the same array indexing syntax for multi-indices. In C++ in particular that syntax is "just" an operator which means whether it's jagged or continuous can is up to the implementation. A vector of vectors will be jagged, but a custom matrix type can be implemented continuously (most library authors will overload the function call operator for this purpose though, so A(i, j) instead of A[i, j], at least until C++23 when the latter syntax will become legal).
This man is the king of keyboard shortcuts yet he uses his mouse to pause the video and to go back 😂
RUclips shortcuts don't have a leader key so it's not worth it. 🤣
True lol. I wish browser was more keyboard friendly. Best we can do is surfingkeys extension, or a text browser.
One of the bests courses of my university, the Operating System's course, truly amazing professor, I still remember it to this day.
yeah, it's all about getting the input values in continuous memory, ideally paragraph-aligned so the instructions for aligned move can be used. that's for assembly only though, i doubt rust can do that as it cannot know alignment of your data. in fact, this continuous memory layout is what the matrix rotation is good for. you can also get away by just duplicating the data (in both layouts) when loading the matrices. this really goes through the roof if you have too much data to fit into L3.
13:22 This brings back a memory during my apprenticeship (meaning, no Uni or higher mathematics, including calculus):
senior: Hey, implement this, please.
me: ok
also me a few minutes later: wtf are complex number?
senior: just take your time to learn it, the Internet should be good enough
So, ehm yeah, pretty much Wikipedia taught me what and how to work with complex numbers.
It's funny how I watched some of those videos before only to then later find out ThePrimeagen has a reaction video of it
Article missed something major. On GPU/TPUs there's memory so the matrices can be loaded directly on the GPU with cuda, so there's not the same bus bottleneck. That's how ML is done in neural nets. Saying that GPUs are x times better is from actual benchmarking, not theoretical.
20:14 Remember kids, always add a few more Nasty Loops to break the orgasm!
The (original) video was ...pretty bad. Sorry, it lacked structure and direction.
It made up with those hilarious subtitles, though. 😂
Original video is an absolute 100% certified A grade banger
One of the new firship videos has the greatest visualization of matrix multiplications I wish they showed us in college
The epic CPU background music sounds like some chill version of Killer Instict. Hope your wife doesn't mind your teraflops.
The comment about the website was so on point too. Engineers and academics doing 90's websites when they have the most important and thorough information to communicate. There's this one dude in Finland who used to do some university job and just for fun has any and every single language formatting and writing rule, style and advice (verbal and relating to typing it on computer documents) on his website, goes through programming language and just general information about Linux and IT on rather deep level. He's pure math bachelor and worked in academic IT center for decades, used to sit in computer politics committee as well. And his website is just a sort of green tilted turquoise background with basic font, pages and pages of text. Well written and paragraphed, but absolutely nothing extra. And he still updates it, 2023 updates have been "mathematical notation 2nd edition" and "pronouncing portuguese". The dude apparently just loves communication related stuff and teaching it. Pages like that are golden and don't look like much.
In university, a friend of mine worked on a project where they tried to optimize cpu die design/placement algorithms for GPUs. The result afaik was not that good because of cache. While the CPUs had large local caches all cores could easily access, the individual cores on the GPU had a hard time fetching the data, which for the problem given was much more dynamic than what e.g. games require. While that was some years ago, I can imagine this still being a problem for algorithms that need to access large chunks of data from less predicatable locations.
Who else wishes that GPU would finally start optimizing for integer and data operation as well? The 6000 CUDA cores that are each as complex as a pentium processor could help in other data processing as well. But it took until the NVidia 1080 release to get even the basics for this on the GPU.
The transpose optimization is quite simple really. SInce one of the matrices in the multiply is accessed column-wise it increases the chances of the next item being accessed not being in cache. If the matrix is large enough to not fit completely into cache you may have a situation where you might have element [0, 0] in cache but element [1, 0] is not in cache. And in GEMM one of the matrices is accessed in that pattern. If you transpose that matrix your access pattern changes to row-major and then your odds of cache-miss go way down. Now you'll be accessing [0, 1] after [0, 0] instead of [1, 0].
What I would still like to see a demonstration / proof for is how in practical terms the time taken for that transpose gets more than offset by the speedup from the cache-hits.
There's an awesome computerphile video comparing linked lists with arrays.
imagine you did a search in a web based product catalog, and every page only contained one result.
that's a linked list (as often implemented, with a single value per node).
I experienced about a 10x speedup with real code doing reinforcement learning. It's the difference between being able to run a RL job and lose an hour; vs the whole day is lost if you started the job wrong. Of course, you must be doing SIMD code via the GPU to do it. I watched a bunch of interns spend all day doing RL jobs. I got my personal faptop accelerated, and just gave it to them for the summer. My CEO saw how huge the difference was, and got 3 monster boxes with a pair of GPUs in each one. It very much matters for machine learning work.
But this is the performance of vectorized code; which has nothing to do with CPU-only code.
SIMD (AVX256 and AVX512) is being embedded into the JVM, so Matrix computations can be performed with a simple API. No external C/C++ libraries needed. Java, Kotlin, Groovy and Scala will all benefit. Just imaging doing Spark DataSet processing with SIMD instructions. Nice.
It would have been nice if the video ended with something about how the comparison between CPU and GPU ends up, answering what the actual difference was from the initial question. But other than that it wasn’t bad. Sure a lot of it was way over my head, but so was the SIMD rust video prime mentioned. It’s a gateway drug tho.
I love it when the video goes from “who’s this goofy goober” to “actually this is good shit”
Regular linked lists are slower, they don't have a hash table, they need to do traversal to find stuff.
Any random index in an array can be retrieved in O(1), insertion/deletion are O(1)+O(n), looping is O(n).
Whereas any random index in a linked list is O(n), looping is O(n), *and insertion/deletion* is O(1).... + O(n) to find the index.
Linked lists need custom made optimization to be fast at searching.
I've come up with one type of a linked structure that will exist to speed up insertion/deletion compared to arrays, but will also have much more optimal random index access time of O(√(n)/2/ap) where ap means access points and that can be added manually in O(√(n)/2)*8 time. Naturally all big o notations show the worst case scenario for any n elements in a list/array.
I'm writing it in javascript.
The only problem is object reference takes 8bytes and my nodes will have a lot of object references, thankfully not as much as skip lists that have technically faster random index acces time (traversal) but randomly clone nodes several times over and over. The best language for this would be something low level that can heavily optimise memory per object and memory per pointer.
the worse the blog/forum post looks the more useful the contents are. i was researching something about some heat resistant tape for a project i was doing and the trash-looking forum i ended up on literally had people working on nuclear reactors talking on it
its hard. one of the things i loved about coding ARM assembler was the simplicity of the instruction set. Optimising code for cycles was fairly easy.The bummer about x86 is the amount of extensions (x87, IA-32, x86-64, x86-S, MMX, 3DNow!, SSE, MCA, ACPI, SSE2, NX bit, SMT, SSE3, SSSE3, SSE4, SSE4.2, AES-NI, CLMUL, RDRAND, SHA, MPX, SME, SGX, XOP, F16C, ADX, BMI, FMA, AVX, AVX2, AVX-VNNI, AVX512, AVX10, VT-x, VT-d, AMD-V, AMD-Vi, TSX, ASF, TXT, APX). There are over 2000 CPU instructions for x86-64.
Thats why I love the RCA 1802 from the early 70s. Sure its the slowest processor you'll ever use but its instruction set is like Xen perfection and you can learn to program the thing in an afternoon. Intel's 4004 is surprisingly beautiful too.
This guy basically said "Python is really fucking slow" in a 12 minute video.
Transposing is simpler swapping - but caches it in L2 or something I think. That is why you get cache gains. Anyway, can anyone explain?
From what I understood, it's more that when you transpose, your matrix multiplication gets data next to each other for both matrices instead of one having data next to each other and the other jumping along by N
So, instead of adding up a row and a column, you are adding up 2 rows, where the data for a row is all close in cache meaning you get a lot more cache hits. Caches hate data that is spatially distant
cache access always happens in line blocks. even if you are reading single float value, cpu will load entire vlock from memory into its cache line.
given that, lets say you are multiplying big enough matrices where single matrix doesn't fit entirely on the cache. for a single value in a result matrix, you have to do A[i,j] * B[j,k] where j ranges from 0 to number of rows in B. Without transposing B, CPU will load whole line containing B[j,k] every inner loop, which will fill up the cache, and eventually evict previously loaded lines. so the next time you try to compute value for the C[i,k+1], it will encounter more cache misses. Transposing B before actual multiplication just reduces the chance of cache misses by just making the B matrix more cache friendly (essentially making matrix B's indexing to use B[k,j] scheme so that cache misses are reduced for adjacent j values).
This is regardless of L1, L2 or L3 cache. more evictions you have at your lower level cache, slower the operation.
Another way to think of it is just to flip the 2 inner loops. When you are doing {i, j, k} loop order, and your inner looks like C_{i, j} += A_{i, k} * B_{k, j}, you are constantly changing A_{i, k} and you are always fetching another array when accessing B_{k, j}. But if you flip the loop to be {i, k, j}, B_{k, j} will just be consecutive values in an array, and you also get the added benefit that A_{i, k} is effectively a constant for the entirety of the J loop. This means that you are effectively only fetching values for B when you do the inner most loop. Another added benefit is that you can check if A_{i, k} is 0. If it is, you can just skip the entirety of the J loop because you will just be accumulating 0 * something which is 0.
This has to do with memory locality, which increases the number of cache hits by taking advantage of the processor's prefetcher. I had to do this on college on matrices up to 32k x 32k, but the math behind A x B^(-1) and why it works I can't recall.
Transposing is just a simplest thing you can do to improve cache access pattern. You can do a lot more complex scrambling-descrambling to optimise caches, or GPU local memory. E.g., for a lot of image processing algorithms, representing the image planes as Z-curves (or other space-filling curves) result in a much better locality.
Check out the book, "Developing your own 32 bit operating system". It is a bit old but fast enough to go through and learn a lot. You probably can read through it in about a weekend.
I love old school, crappy looking geo-cities esque websites.
You learn this stuff in OS courses? BRB, going to enroll in an OS course RN. I'm going to SIMDeez nuts all over that teraChad CPU I just _know_ is lurking in the uni basement, waiting to entrap a poor, naive programmer such as myself.
Also, whoever decided that the constants in O(n) notation get reduced to 0 for algorithms that real humans use on real hardware ought to be Old-Yellered. The fact of the matter is that those constants are _not_ zero, never will be, and contribute to an awful lot of wasted CPU cycles.
Love the way the dude is all like (Kim Jong Un in Team America) "look guys, it so fukking simple, u just need 190 IQ like me, dont u know how fucking busy I am!?"
Wearing my flip-FLOPS for this for max immersion 😲
I hope your feet are large enough for an upgrade to GIGA-flip-FLOPS!
You could've capitalised FLIP as well since FLIP-FLOPS are also an electronic component (studied in last semester, already forgot about it lol)
@VivekYadav-ds8oz yes, I know :D
9:06 Arrays single list, contiguous memory. Specifically great for SIMD operations. You can't use SIMD with linked lists.
You totally reminded me....
It was back in 2013 that now-Blizzard President J. Allen Brack issued his infamously condescending line of “You think you do, but you don't” in response to fan requests for a World of Warcraft vanilla server.Feb 24, 2020
What you were saying about pretty websites being useless is right. I think you could codify it as "The usefulness of a website is inverse to its Javascript-to-HTML ratio."
It's same as how the bigger someone's television set is the poorer they are.
@@neodonkey The more suitable the TV size is to the room you want to put it in, the harder it is to find a good one. Most TVs are huge and most rooms are too small for it to look right. People who are poor on money are also poor on time and will give up on finding a good fit sooner.
My regret too is not having taken my Operating System course. I wanted to graduate early so I didn't do it 😭
Most of the time you just have to use a compiled language with no extra overhead (no gc, no unecessary heap allocation because of objects, etc) and do no nothing obviously bad and your program ends up being very fast without even having to optimize it.
Indeed. Just days ago I switched from Python + numpy to a couple lines of Rust and the same task immediately ran 650x faster. Seeing a modern CPU do its thing without brakes on is a thing to behold.
SIMD can give a 4x speedup (or more). It's often worth exploring, for anything where 4x the bandwidth would actually make any difference to the users. Which can be more often than you'd expect, if they have to wait for the results or blocks the UI, if it's not just some nightly batch job.
depending on your application the latency jitter of a GC can make certain languages a no-go too
@@blarghblargh I was trying to make a program go faster the other day with SIMD. My program ended up telling me that 2-0=0. I need to do more debugging on the SIMD lol
@@blarghblargh In my program I tried a lot of tricks and none were as fast as just the first unoptimized version. Even SIMD with 32 values per calculation was only half as fast. I guess the cache works in mysterious ways.
short and wide bus gang
I'd be interested to see how thoroughly and reliably Rust can auto-vectorize code.
C++ compilers often don't do such a great job, and Intel ISPC demos the benefits of a language specifically designed around vector code.
It's not so much that the compiler can and does vectorize code, but that the vectorization is stable, and doesn't degrade to scalar code silently when you modify the code slightly, or compile it on a different compiler version or build mode, etc. It's often the case that if you're bothering to ensure code is vectorized once, that you want it to stay that way.
I've experienced this as well with c++. I have had to sometimes switch compilers to get it to vectorize some code snippets.
@@327efrain dunno if it's still the case, but at least up until a few years ago, game engine devs would use compiler intrinsics instead of trying to coax things to auto-vectorize. Then they could be guaranteed it would emit the code they were looking for.
@@327efrainyou could've had used compiler intrisics to manualy vectorise
in my experience, GCC and clang on unix-like systems tend to be better than rust at auto-vectorising code
How hard would it be to improve Numpy or other programs doing LLM inference to take advantage of this. I'm not under the impression that it is being used already.
Arrays got that Gigaflop energy. Just a tight, consolidated group of Chads. Length lists got that Karenflop energy. Just holding everyone up.
I'd like to add some details that I've noticed are either misleading or missing, since I did optimize the GEMM algorithm as an university project and I'm familiar with it. That project probably ended up being the one I enjoyed the most in my entire degree. They'll need to be answers to this comment because of the length, though.
It's great to see that there are videos explaining this stuff to the general public (well maybe not that general in this case but you know what I mean) and the reactions derived from it. Keep up the good work. Also, if you're short on content sometime I'd be willing to go more in depth with this algorithm.
Let's start with the hardware comparisons. While it's true that a Xeon E5-2630 v3 peaks at 8 FLOP/cycle using avx2, it's not necessarily also true to other processors. Some processors may execute several operations per cycle, depending on the hardware and the data dependencies between the instructions. For example, the Zen 4 architecture is theoretically capable to run 2x avx512 float additions in parallel with 2x avx512 float multiplications, or in other words: multiplying 32 floats while adding another 32 floats at the same time reaching a mind-numbing 144 GFLOP/s per core for a Ryzen 9 7950x (2304 GFLOP/s across all its cores, which is actually not that far from the GPU in the video). Also, especially on desktop processors, the performance of single-threaded algoritms may be way higher than what the base clock of the processor would suggest because of their insane boost speed on short-running single-core loads. For example, the previous processor has a base clock of 4.5GHz, but has a max boost clock of up to 5.7GHz without overclocking depending on the conditions, which is a whopping 27% extra (or a burst at 182.4 GFLOP/s). If we take into consideration that said processor supports avx512, depending on the algorithm it may not be worth it to parallelize just due to the loss of the single-core boost combined with the parallelization overhead (which is something the video doesn't mention).
As for the GPUs, comparing CPUs vs GPUs is a bit like comparing apples to oranges. That is to say, it's plainly unfair. CPUs are more optimized for branch-oriented algorithms while GPUs may have some problems with those kinds of algorithms. I tend to thing of GPUs as an independent multi-core SIMD coprocessor in steroids. That is because of the way a GPU executes instructions: there's a "scheduler" per each group of 32 cores (64 in older RADEON cards, both modes are supported in newer ones) which dispatches the same operation to each and every one of them, potentially masking the result out. It may be a bit fairer to say that the CPU side has 2 sockets * 8 cores * 8 floats per cycle = 128 "cores", or that the GPU mentioned in the video has 3072 cuda cores / 32 per ROP = 96 "cores", which is still a pretty significant difference but not by as much as before. There's also differences in how the caches are organized too, which is especially important when optimizing CPU algorithms since you have more control on how the code is executed: for example, you can bind each thread to a specific CPU core so that it never changes and take advantage of the data locality this provides, which is something that (as far as I know) can't be done in a GPU. The GPU has also some architectural advantages like the local memory, which is like a fast cache shared between every core in each group of 32, but it requires manual usage (aka. writing and reading to it with explicit code rather than relying on hardware-implemented caching algorithms like the CPU caches do). GPUs also require you to copy data between the CPU's and the GPU's RAM memory since they are separated (unless it's an integrated graphics card, mobile, etc...) and (usually) built different. It depends a lot on the algorithm itself.
On to memories: Memory speed depends on a lot of factors too, such as the channel width (or gurth ;) and speed, but also on the number of channels and where the data is located physically. For example, say your data is on a single RAM stick (therefore probably a single channel) but your machine supports two independent channels, then you'll be using only half your bandwidth. Of course we can't choose which ram stick to store our data (as far as I'm aware) and the hardware probably already does that for us.
Something the video fails to mention are latencies, which exist both in the CPU instructions and the memory accesses. When the CPU requests some data from the memory, the memory takes some time to reply. I'm not talking about tens of cycles as the video shows but rather hundreds to thousands of cycles. This is because what we call RAM is actually DRAM memory (dynamic random access memory). The dynamic part means that it uses capacitors instead of bistable circuits, which are much more storage dense. The downside is that those capacitors need to be refreshed periodically since otherwise they'll lose the data (hence the volatility of the RAM). This refresh cycle is done every few milliseconds (or whatever the manufacturer decides), but that's not where all the latency comes from. For efficiency purposes, the RAM isn't byte-accessible, but rather it is row-accessible. That is, for every data read or written there's a more than likely chance that the active row has to be changed, further delaying the operations. But it's even worse: if my memory isn't failing me, those row reads are destructive and need to be written back to keep the data where it was.
Although latencies aren't mentioned, they aren't all that relevant on some algorithms because of the caches. The video mentions that alogrithms which use the data exactly once can't possibly benefit from it, but in reality they do benefit. When the CPU requests data to the RAM, it'll be returned as a block of (most likely) 64 bytes. That is the size of a single cache line on x86 hardware. That transfer may take a lot more cycles than the ones mentioned in the video, but remember that once the block has been brought to the L1 cache we'll basically have the entire 64 bytes accessible within a single cycle. So, even though it seems the algorithm isn't profiting from the cache, it actually is, and quite a lot. The point the video makes is that the algorithm can't profit from a cache more than it already is; or in other words: no matter how much we torture our remaining brain cells on that kind of algorithms, it won't make the cache hit rate better.
Cache management becomes especially important when dealing with SIMD because you're no longer accessing 4 or 8 bytes per cycle (assuming a dispatch width of 1 instruction per cycle), but rather you'll be dealing with 16 bytes (sse/avx/neon...), 32 bytes (avx2) or even 64 bytes on newer hardware (simd512). In other words, you may potentially end up requesting multiple whole cache lines per core to your (comparatively) slow RAM. It is also important to take it into account on multithreaded algorithms: the video mentions that multiple cores accessing the same cache line is potentially catastrophic, but that only applies when writing data, not when reading it. The problem with writing data is that (under normal circumstances) the data needs to be pulled into the L1 cache, then modified, then marked as dirty (aka. tell the neighbouring cores that the shared cache's copy of this block is not up to date). If multiple threads are writing data within the same cache line simultaneously, then for every write they'll need to send the block from thead A's L1 to thread A's L2 to the shared L3 (hopefully, otherwise into RAM and back to thread B's L3), then into thread B's L2 to finally reach thead B's L1 just to be modified, marked as dirty and the process to be repeated in reverse. There are some special instructions to avoid this problem (when avoiding it the obvious way is not possible) which are called non-temporal reads and writes. Those read and write directly to RAM (or the shared cache, I'm not sure which one it was since it should work with both depending on the hardware configuration). The non-temporal instructions have abismal latencies therefore the compiler usually generates them sometimes on write-only data (aka. data that you won't need later on, ex: for i in ... do a[i] = b[i] + c[i], a is write-only and doesn't really need to be cached within L1 or L2 really; but a[i] += b[i] does need to be cached and will have poor performance regardless if, for example, thread A computes even indices and thread B computes the odd ones; the correct way to fix that would be to make thread A compute the first half and thread B the second half).
As mentioned in the video, it is true that it is sometimes faster to copy the data with a different memory layout (including but not limited to transposing) and then operate over the new data than use the original one, I can vouch for that. Using the divide and conquer method of splitting the computation into blocks is also ridiculously effective with caches.
Caches also have their own quirks that need to be addressed, for example cache associativity. In some cases where data is (unfortunately) aligned it may completely destroy the cache usage rate, leaving your program using only 3.125% of your L1 (I can prove it since it happened to me). As a side note, the video mentions that loading aligned data is potentially faster than non-aligned loads, but that is no longer the case in new architectures (at least that I know of).
Alright, let's transition into compute stuff. It is true that compute speed (not just CPU clock speed) has grown at a quite faster rate than memory bus speeds, but that's what caches are supposed to make up for. There's quite a lot of algorithms that can improve a lot in cache misses but don't because of how they are coded. The matrix multiplication shown there is just one of them. When the video is showcasing the matrix multiplication it briefly mentions the divide and conquer algorithm (albeit with a different name), but let's leave that for later.
The video says that matrix multiplication has a computational complexity of O(n³) and a spatial complexity of O(n²). This means that the algorithm will grow the number of steps it needs by n³ as n keeps growing, while the space used will grow by n². However, the truth is that n² has nothing to do with memory transfers, which is what we're interested in. Each of the operations related to O(n³) will involve 3 reads, then a multiplication of two of those followed by and addition with the third, and finally storing the result back into memory. This means that the amount of data that will need to be transfered will also be O(n³). Spatial complexity just means how fast the space you need grows as the problem size grows, not how many data transfers you need.
We can all imagine how badly the CPU would perform without caches if every cycle you're not only requesting multiple cache lines per core but also your complexity on the memory accesses is O(n³) instead of O(n²). This is where caches shine: you can copy the data you're working with into the cache, work with only that data locally since the data will be available practically instantly, and after you're done with that data then you forget about it and move on to the next batch. This probably is what the next video in the series will probably address. I'm not sure how in depth will that video be but I'm looking forward to it since there's quite a few cool stuff you can do with it to make it ludicrously efficient (yes, not joking, I'm talking about improvements in orders of magnitude in cycles, cache misses and number of instructions at the same time when done correctly).
Finally, some notes on the GEMM algorithm itself. I like to think of matrix multiplication as taking two bidimensional arrays, placing them perpendicularly in a 3d space, generating the product of each intersection and collapsing it by adding everything together across the third axis. Or something like that, I haven't checked if that works on paper but I'm pretty sure it does. It's probably a lot easier to visualize than it is to understand what I just wrote. If you manage to get an actual mental representation of what is going on there, you can instantly understand how tiling works and how to optimize the tiling itself. For example, when using caches, the order in which the tiles are computed does matter, and depending on the tiling configuration it can reduce cache misses by orders of magnitude on all cache levels.
Also, having that representation available also makes it really easy to think up a really nice and compact piece of vectorized assembly to multiply those tiles. I managed to get between 10 and 11 operations (of those in O(n³)) per cycle per core on my machine, which is pretty impressive for a CPU. Granted that was only compute with zero cache misses, but still.
9:00 I love me a nice piece of tight memory.
That background music is from some video game, but I can't remember which. Nvm, it said Cities Skylines in the end.
Very educational! Hard stuff but good.
You make a literal X when describing twitter I see where they got the logo 😂
With modern advancements in software design principles, even greater achievements begin to come into view. A new milestone, a CPU slower than you can think. It's almost poetic. It's all come full circle. The circle of life.
SIMDeez nuts
Prime we r talking ml programers...
We optimize the shit out of everything obsessively scaling
That's what I've been saying for so long. Even if you use an "easy" language like go and implement a semi-efficient solution to a problem, it'll be incredibly fast because we have supercomputers in our pockets. However, software is getting so bad on average that it cancels out our hardware progress. No, you probably shouldn't use a web framework on top of electron to build a simple gui app. Start simple, and get more complex if you need it. Use tools that are actually performant.
What you're referring to is known as Wirth's law :)
There is one more important thing with heavy AVX CPU is lowering it real frequency when multiple cores do many AVX instructions to not burnout from high temperature (hello Intel)
The video is self explanatory when it stresses "cache line" regarding the optimization. A cache miss pulls in a whole cache line not a single element. So, per miss you want to pull in as much data as possible that will be relevant to the following computations. Transposing the matrix achieves that, and as it grows bigger, the cost to transpose is smaller than the cost of all those cache misses. But... and here comes the Clean Code killer... if your matrices are small enough that they can fit wholly within the cache then all data will be in the cache eventually, and the cost of "random access" to the cache might be smaller than the cost of transposing the matrix. So yeah, optimal performance will require a bit of RY to handle all mixes optimally, with a "generic fallback".
Nice to see that Bill Burr grew hair and got into programming. Good for him.
18:33 do programs get enough info to calculate whether an algorithm will be compute vs memory bounded?
This boy gonna be teraflopping at any point 😂😂😂
0:20 yep, can confirm i checked my discord 💀
That music slaps tho
Sometimes I fully agree with you even if I get offended and sometimes I politely beg to differ by hitting the dislike button (in 5% of cases). But, I like the fact that you're completely honest with your viewpoints without trying to play a hide-and-seek game on social media unlike some... Okay! Javascript is not easy. Complied languages have their own quirks and interpreted languages have theirs. In the end, nothing is easy. Certain programming languages indeed produce output that can be faster than similar programs written in another language. Nonetheless, in the hands of overqualified people things will get magically overwhelming; the programs that are supposed to be faster will feel overwhelmingly immobile when operated by the end users. Thus, Charisma matters, and so does FLOPS.
IBM Northpole is 8x faster than NVidia A100 and 6x faster than their top flight H100. It has been in development for 20 years and is more efficient as well and is on the 12nm process node so has way more bandwidth to be shrunk. "It has 256 cores and can perform 2,048 operations per core per cycle". What IBM and AMD are doing is embedding compute into memory on a single chip.
My braincells have shrunk so much that i can fit 5 times more in my skull! JavaScript makes you a genius!
UR SO RIGHT, I LOOKED AT MY DISCORD FOR A SEC. WHY IS EVERYONE KEEP DOING THIS FFS
kilochad
megachad
gigachad
terachad
petachad
exachad
zettachad
yottachad
bronto**chad
Now I can GIGAFLOP with awareness 😌
What if the blow stricken against humanity by the misguided and overhyped attempts at ai isn’t so much ‘robot overlords destroy us in nuclear fire’ as it is ‘copilot ensures novice programmers spend 532x more time making the same mistakes, and even expert programmers spend as much as 52x as much time recoding blunders’…?
Damn I'm definitely going to that channel and binging some videos
recently moved from windows to linux, haven't installed discord yet, easy method to not get triggered by discord notification sound
Well, you should have seen what I had to work on for 2 years. Lucee 5 created by Indians. You want 300 line functions, we got em!
You missed out on "Dr. Bandwidth" =))))
Not rotate. Transpose. Agen.
What cursor theme is Prime using? I may be hallucinating but it feels like his cursor isn't as skewed as the default ones.
pop os default
If just rotating your matrix works, what are compiler engineers doing?
If you use a math package like BLAS/LAPACK/numpy, i believe they will look for those sorts of optimizations, and more complicated ones
As a wgpu-py contributor... It's time to update
Jonathan Blow has an amazing rant on this.
Here for the baby-agen
Can you give a link?
@@akmubilin4363can't rn, maybe some other passerby knows
What is it called?
@@thewhitefalcon8539"Programmers don't know how fast their computers are"
Implying that programmers can own more than one computer 😂
Really complicated way of saying the path between the cpu and gpu is a potential bottleneck. You don't use the gpu where you are doing a small amount of work on a large dataset and then throwing it away. If the problem is inherently that it's just not a good fit. But many many problem are not that.
I think a lot of web developers are likely just forcing what they know instead of taking the time to completely restructure for the hardware. You do really need to rethink your flows and data structures from the ground up. Gpu hardware is really that different.
i have no idea what was going on..... brain hurt
array vs linked list... the problem is the lack of understanding of O notation and also how it fails to really make the nuance more explicit.
dude you are hijacking my YT viewing. every single video I watched lately you also react to, meaning I might not actually look for any new videos and watch your reactions right away instead D:
Am I the only one who only uses high level languages and now feel dumb?
The solution is,... using cash..
Just throw more money at it
My CPU is Hotter tha-
You remind me a lot of LobosJr. He's an OG justintv/twitch streamer, codes, and stares at s**t the same way you do. You're a tad more life-experienced
just went and checked my discord😂
JS programmers when they discover a real language like c/c++ be like:
dam mfer just teasing us with dem Gemms algorithms.
The original video is a test on how neurotypical vs atypical you are in society.
I personally loved it and the reaction, it was very informative, dense but not patronising. 🎉
I get recommended your programming videos. I WANT TO SEE IT. LET ME SEE IT. Maybe have a 30 minute segment so the normies don't get bored?
I GOT TRIGGERED AND CHECKED MY DISCORD!
what do you call a quintillion files?
ugh. he uses discord in light mode
i have coded in c++, swift, c# many languages, but i stayed away from javascript as soon as i touched it. I AM BIG BRAIN BABY.
Why do you have your hood under the headsets!
Swag points
Cash fixes a lot of performance problems.