The fastest matrix multiplication algorithm

Поделиться
HTML-код
  • Опубликовано: 8 фев 2025

Комментарии • 245

  • @planaritytheory
    @planaritytheory 2 года назад +641

    It's important to note that, although additions _are_ considered cheaper than multiplications, what we _aren't_ doing is making a tradeoff between the two operations (accepting many more additions to save a few multiplications). You didn't explicitly say that's what's going on, but someone could get the impression from this video that that's the deal---I myself used to think that. Instead, the number of additions, multiplications, and total number of operations are all lower for Strassen's algorithm, and it's because the number of "multiplications" at each step is counting the number of recursive calls that are made.

    • @DrTrefor
      @DrTrefor  2 года назад +173

      Thanks for the clarification. Indeed, I left the induction proof for additions as an "exercise" so this got glossed over.

    • @samueljehanno
      @samueljehanno 2 года назад +1

      69th

    • @gregorymorse8423
      @gregorymorse8423 2 года назад +4

      Considering that addition is O(n) and multiplication is O(n log n) in best case, given a 2s complement number system, treating multiplication and addition as like operations is completely incorrect especially as the values of the matrix grow larger. So it's not just the recursion that is relevant.

    • @destiny_02
      @destiny_02 2 года назад +6

      If I'm not wrong, multiplication is cheaper than addition in case of floating point number on a modern CPU.

    • @gregorymorse8423
      @gregorymorse8423 2 года назад +6

      @Jotaro Kujo just because it is simpler to implement floating point multiplication doesn't make it cheaper. It's not faster even if addition requires resources to shift the smaller argument and the result. There is still an integer multiply or add in the middle. My point was only for unbounded big integers as well. Treating addition and multiplication as like ops when using a finite bit width like 32 or 64 be they integers or floats does make sense. Even if their clock cycles are slightly different. The story just gets more complicated gived a fused multiply add (FMA) is used in modern hardware to improve accuracy and speed. And applies to complex number of matrix multiplication very specifically in fact. But unbounded data types mist treat addition and multiplication separately.

  • @Mattes_H
    @Mattes_H 2 года назад +310

    I think it should be noted that all of these faster algorithms are so called galactic algorithms, meaning the constant that is hidden by the phrase "order of" is so big they become useless for any practical application.
    In practice the challange is more about efficent and parallel implemetation. In fact GEMM the subroutine most often used for matrix-matrix-multiplication usally implements either the naive or Strassen-based algorithm but uses all the hardware tricks available on modern processors to get their high perforamce.

    • @CollectorWorth
      @CollectorWorth 2 года назад +1

      nice icon lol

    • @vaff69420
      @vaff69420 2 года назад +1

      I don't exactly understand your comment. Why aren't library developers implementing these algorithms if they are simply faster than the native or other slower algorithms?

    • @pablote325
      @pablote325 2 года назад +30

      @@vaff69420 Because of the constants.. take for example the last algorithm, this one does O(n^2.37285) multiplications. What this means is that the number of multiplications is exactly gonna be:
      Mult = A* n^2.37285 + B , for some A and B, usually unknown.
      thats what the big O notations means...
      So like the previous comment said, this weird galactic algorithms always have HUGE constants, so is not even practical to use it to multiply a 64x64 matrix (i'm just putting numbers here, but that's usually what happens, the real benefits of using these algorithms is when n is literally an astronomical number xd)

    • @porschepanamera92
      @porschepanamera92 2 года назад +2

      @@pablote325 ... and in that case it might be much faster but still extremely slow using current hardware?

    • @luckyw4ss4bi
      @luckyw4ss4bi 2 года назад +33

      @@porschepanamera92 I think the idea he's hinting at in general here is that the size of matrices you'd have to be multiplying to gain a performance boost from the current fastest algorithm is too large for modern computers to handle, anyway.

  • @frfr1022
    @frfr1022 2 года назад +73

    It reminds me of this algorithm for natural numbers multiplication, which is theoretically a lot faster than any other algorithm we know of, but gets worth using only at ridiculously huge numbers. I can't remember how its called, but i think it was mentioned in 3b1b's latest video on convolutions

    • @DrTrefor
      @DrTrefor  2 года назад +28

      If I recall correctly, some of the ideas of large number multiplications are related to ideas from this very Strassen algorithm.

    • @adamel-sawaf4045
      @adamel-sawaf4045 2 года назад +9

      @@DrTrefor Strassen also worked on a multiplication algorithm for numbers (the Schönhage-Strassen algorithm), so that would make sense!

    • @oyaoya2468
      @oyaoya2468 2 года назад +4

      Karatsuba algorithm, is this what you mean? Because it uitlizes the same technique as Strassen algorithm

  • @michaelsaba6769
    @michaelsaba6769 2 года назад +42

    Just read about this algorithm in CLRS. It's a perfect teaching tool to show why constant factors might not matter in asymptotic notations, but how they often do matter when working with recurrences. Like in this case, where 7 multiplications are preffered over 8, even if we're only multiplying the runtime by a constant factor. Strassen's yields a Theta(n^log(7)) time complexity, while the standard divide and conquer approach yields Theta(n^3) time.

  • @cbunix23
    @cbunix23 2 года назад +21

    I wrote a cache friendly matrix multiplication algorithm that improves performance by 40% when matrices are very large; basically, the multiplications and additions are the same, but the order is done in a way that respects cache memory.

  • @magicflour
    @magicflour 2 года назад +42

    I had a class where a whole third of it was dedicated to these linear algebra algorithms. The most I learned really was "optimize the hardware and optimize with hardware" to get good results lol. This was a computer engineering class so it made sense.

    • @DrTrefor
      @DrTrefor  2 года назад +13

      Ha, yes that has been my sense too even as mathematicians have fun with asymptotic limits:D

    • @marcotroster8247
      @marcotroster8247 Год назад

      Finally someone said it so that I don't need to comment it myself 😂

  • @alphalunamare
    @alphalunamare 2 года назад +17

    Fascinating! It brings back old memories. In 1977 I was mulling over the problem and managed, over a few days, to create an algorithm for 3x3 multiplications requiring 24 multiplications and 66 additions only. It reduced to strassen's method in the 2x2 case but was no improvement as that would have required 21 multiplications. I did this longhand over yards of line printer paper. Unfortunately a Canadian published an algorithm with 23 multiplications and 96 additions. it pissed me off no end, especially as it had no information regarding its derivation. (If I could trade 1 multiplication for 29 additions I could improve but I left it on the shelf) Sulking I guess. Over the years I wondered if it was just a novelty, then I realised that in an object oriented sense that the savings are manifestly more than for just simple arithmetic multiplications. An intriguing thought is whether the intermediate Pn multiplications of 'objects' have a real meaning in the physics of the model under analysis that perhaps could shed some light? Perhaps in the Geometric Algebraic treatment of Spacetime?

  • @kruksog
    @kruksog 2 года назад +37

    I remember learning the Karatsuba algorithm for multiplication. It is also computationally cheaper than the standard way of multiplying numbers. It had a very similar feel to this.
    Cool stuff. Thanks!

    • @adamel-sawaf4045
      @adamel-sawaf4045 2 года назад +8

      This also reminded me of Karatsuba! I learned it in my algorithms class. I implemented it in a C++ program for extra credit.

  • @eti-iniER
    @eti-iniER 2 года назад +25

    As a first year CS student, I'm completely blown away by the fact that humans know and understand concepts like this. Amazing video, btw
    Cue the mild depression

  • @wcdeich4
    @wcdeich4 2 года назад +8

    Interesting! I previously knew some algorithms could do matrix multiplication faster for large sparse matrices, but those algorithms are actually slower for large dense matrices.

  • @discreet_boson
    @discreet_boson 2 года назад +43

    Please make more computational math videos like this!
    Programming documentation never explains the math like you do

  • @adamel-sawaf4045
    @adamel-sawaf4045 2 года назад +10

    Great vid! Just as informative as it needed to be, while sparking inspiration to learn more. One thing I think you should’ve mentioned was how much the space complexity grows for the trade-off in speed.

  • @L4ZY2137
    @L4ZY2137 2 года назад +5

    There is a thing calleg "Hierarchical matrix", that allows to change a matrix to a more "compact" form that allows to compute many operations like inversion or multiplication in almost O(n) time. If I remember correctly if we remove the constant (k) that controls approximation the complexity is about O(n * (logn) ^ 2)

  • @rezamiau
    @rezamiau 2 года назад +4

    Absolutely brilliant!
    One of the major problems that engineers like me are dealing with is the time consuming eigenvalues calculations, because of increasing the precession, due to Ill-Conditioning of a matrix. Please let us know if there is a good solution to it.
    Thank you so much.

  • @airsquid8532
    @airsquid8532 2 года назад +17

    The amount of power I wield is unrivaled with this knowledge, thank you for yet another interesting video full of passion!!

  • @prashantsingh6370
    @prashantsingh6370 2 года назад +17

    As someone who absolutely loved solving matrices & determinants problems in higher secondary mathematics, way back in 2011-12, a huge thumbs up for the excellent video. 👍

    • @DrTrefor
      @DrTrefor  2 года назад +3

      Thanks for sharing!

    • @freshrockpapa-e7799
      @freshrockpapa-e7799 Год назад

      enjoying computations doesn't make you special in any way, there was no reason to mention that

    • @w花b
      @w花b 3 месяца назад +1

      ​@@freshrockpapa-e7799 Inferiority complex? It's fine, someone will love you and accept you for who you are...one day.

    • @freshrockpapa-e7799
      @freshrockpapa-e7799 2 месяца назад

      @@w花b I have plenty of people that love me, sorry you have to project your emotions like that, one day you'll be loved too friend. Try working on loving yourself first though.

  • @noot3316
    @noot3316 Год назад +1

    I just finished my first term of Engineering at UVic (you were an amazing calc prof by the way!) and I can FINALLY understand how cool this video is!

    • @DrTrefor
      @DrTrefor  Год назад +1

      Haha that’s awesome, thank you!

  • @cmilkau
    @cmilkau 2 года назад +6

    One important question with this algorithm: how many additions does the recursive algorithm use? Because in the end, we want a faster algorithm, not just a reduction in elementary multiplications. It's also a bit counterintuitive that this strategy *reduces* the total number of additions asymptotically.

  • @ddognine
    @ddognine Год назад +3

    Interesting video of an area of active research of which I was unaware. I would be interested in a similar video covering the state of the art in matrix inversion which I believe is much more problematic for large matrices than multiplication and is typically the first operation before any multiplication.

  • @quarkonium3795
    @quarkonium3795 2 года назад +5

    One thing to keep in mind is that most of these improvements from the Strassen algorithm are theoretical. Most only improve on the Strassen algorithm for extremely large matrices-so large that these algorithms will probably never be practical

  • @christossymA3A2
    @christossymA3A2 2 года назад +4

    Man it feels almost illegal such great knowlegde being available for free

  • @guigazalu
    @guigazalu 2 года назад +2

    While watching, I asked myself if you would cite the AlphaTensor article and results, as I read it in the recent months.
    It turned out you did! It is a very cool article, even though it's a "heavy reading" for me.

  • @U.Inferno
    @U.Inferno 2 года назад +2

    "This doesn't sound like a big improvement."
    Me, a computer scientist: "No. This does."

    • @GreatJO
      @GreatJO Год назад

      Is SIMD faster?

  • @garrettkajmowicz
    @garrettkajmowicz Год назад

    One area I found missing from your video was in showing that using block matrices got us the correct result. That is, that treating a 4x4 matrix as a 2x2 matrix of 2x2 matrices would, when multiplied, get us the same result as we would otherwise expect.

  • @mohammadhomsee8640
    @mohammadhomsee8640 2 года назад +3

    That's a great videos, I really love to hear your lessons. Good job professor!

  • @gooball2005
    @gooball2005 2 года назад +2

    I do enjoy a good video on complexity theory! Also, your thumbnail game is on point!

  • @changjeffreysinto3872
    @changjeffreysinto3872 Год назад

    Solution to 7:17
    Let the number of addition to the matrix product of 2^n*2^n sized matrixes be f(n). Since we calculate the multiplication of 7 smaller matrixes of size 2^(n-1)*2^(n-1), we have 7f(n-1) operations. Also, we calculate the sum and differences of these 2^(n-1)*2^(n-1) matrixes for 18 times in total, amounting to 18*(2*2^(n-1)) times overall. Therefore, the recursive relation is f(n)=7f(n-1)+(9/2) 4^n, which is a first order linear DE. We now substitute f(n)=7^n * f_n, obtaining f_n = f_(n-1) + (9/2) (4/7)^n. We have f_n = 9/2* sum 1-> n (4/7)^n, which is less than 9/2 * sum 1-> inf (4/7)^n = 21/2. Finally, f(n)=7^n * f_n

  • @MrHaggyy
    @MrHaggyy 2 года назад +2

    A huge deal in practical matrix multiplication is taking the naive approach: look for redundant operations, look for dependencies in computation, group up independent operations and parallelize them.
    A 3x3 matrix on a GPU would schedule every single one of that 27 multiplies on one 32 core chunk. Then you would copy the data in another 32 core chunk doing the addition. You could also do the addition of three matrix multiply in one execution of 32 additions.
    Thats the reason why Strasson is such a big deal. 8 multiply and 4 additions fit beautifully in 32 core chunks. And once you have setup your execution pipeline you can feed it any matrix. If a rank is odd the CPU will make it even by adding zeros befor passing it to the GPU.

  • @Yupppi
    @Yupppi Год назад

    Since this is specifically computational related, it might be worth talking about not only the algorithm for calculating sub matrices, but cache memory (its layout and use) and how transposing the matrix gives better performance due to the elements being laid out in the memory more efficiently and also how the relevant ones stay in the cache over the time you need to use them, which avoids cache misses. Furthermore copying the matrices (to local buffer) interestingly enough helps as well. And lastly if you split it to multiple threads. Which leads to performance doing such a leap that you can for example go from 2000 ms to 20 ms execution time. I feel like n^3 and n^2,4 aren't quite as concrete demonstrations compared to that. But of course Intel has a Basic Linear Algebra Subprogram implementation that does it in roughly 2 ms. These are in my understanding matrices the size of 1000.

  • @kristas-not
    @kristas-not Год назад +1

    ok, you officially kick ass in my book.

  • @johnniefujita
    @johnniefujita 2 года назад +3

    Single operations are not really what is expensive and increase algorithm complexity, the "O"s, actually complexity is direclty relates to nested computations. Usually we want a algorithm to be linear or linearithm. Polynomial can be tractable. But never exponential and factorial, those cannot scale. Also algorithms may vary worst, average and best case scenarios for each different implementations, and sometimes, like in the case of sorting algorithm, the best choice can vary based on how well organized the data already is.

  • @pyropulseIXXI
    @pyropulseIXXI Год назад +1

    I know the *proof by induction* works absolutely, but the 'step' of _finding_ the formula in question is often skipped. I know it is not strictly necessary for the 'formal proof,' but it is still very nice to get this infos via intuitive 'derivations.'

  • @аоаулорвылов
    @аоаулорвылов Год назад +2

    Congratulation for a 100pi subscribers!)

    • @DrTrefor
      @DrTrefor  Год назад +2

      Oh cool, thank you!

  • @Dawg1561
    @Dawg1561 2 года назад +1

    Didn't knew Dr. Strange left sorcery and started teaching mathematics. Btw, This was a great explanation

  • @flexeos
    @flexeos 2 года назад +4

    I am not sure that the pure optimization of operations is very relevant. The limiting factor is often moving the data in and out of memory. What you have to optimize is how may operations you can do for each load.

    • @Miguel_Noether
      @Miguel_Noether 2 года назад +1

      Same amount of memory and speed of allocation, doing faster operations is going to be the most efficient way to optimize everything else (that's all the hype of quantum computing)

  • @jebarijihed
    @jebarijihed Год назад +1

    This reminds a lot of the FFT and the DFT algorithems !

  • @lebdesmath2510
    @lebdesmath2510 2 года назад +5

    nice vid, keep up the good work

    • @DrTrefor
      @DrTrefor  2 года назад +2

      Thanks, will do!

  • @gofaonepaul
    @gofaonepaul Год назад +1

    This is so cool. ❤

  • @streampunksheep
    @streampunksheep 2 года назад +1

    ok this is cool. it makes me want to study matrices again

  • @gzhang207
    @gzhang207 2 года назад

    One specialization of matrix multiplication is the matrix multiplication with a vector (or filter) that is widely used in convolution based ML. Winograd convolution seems to follow the same optimization idea.

  • @taotaotan5671
    @taotaotan5671 2 года назад +2

    Very very interesting topic! Thanks for sharing this, Trefor!
    Wondering if the algorithm applies to a more general case of matrix multiplication, e.g. (m x n) * (n x p)
    Another thing I found interesting is, after some research, matrix inversion and matrix multiplication have the same time complexity. The optimization method you mentioned in this video also applies to matrix inversion. That contradicts my impression that matrix inversion causes more headaches than multiplication.

    • @DrTrefor
      @DrTrefor  2 года назад +4

      For non-square matrices you can just imagine they are square with extra zeros so the same upper bound applies here.

  • @realdragon
    @realdragon Год назад

    Speaking from hardwere perspective if possible try to avoid jumping between columns. Even 2D array is very long row so jumping between columns means jumping through huge parts of array while calling next number is this row is very fast

  • @dvir-ross
    @dvir-ross 2 года назад +5

    According to wikipedia, there is already an improvment from 2022.

    • @DrTrefor
      @DrTrefor  2 года назад +5

      Oh nice! Good to be obsolete already!

  • @dr.rahulgupta7573
    @dr.rahulgupta7573 2 года назад +1

    Excellent presentation 👌

  • @rob876
    @rob876 Год назад

    Here's a better version of Strassen's algorithm due to Winograd for A.B = C:
    a = A21 + A22
    b = a - A11
    c = B12 - B11
    d = B22 - c
    s = A11.B11
    t = a.c
    u = s + b.d
    v = u + (A11 - A21).(B22 - B12)
    C11 = s + A12.B21
    C12 = u + t + (A12 - b).B22
    C21 = v - A22.(d - B21)
    C22 = v + t
    This has 7 multiplications and 15 additions/subtractions.

  • @yutubl
    @yutubl 2 года назад

    Thanks for this very interesting information.

  • @swagatochatterjee7104
    @swagatochatterjee7104 2 года назад

    Hey the Alpha Tensor record in 9:50 has been beaten by two researchers at the university of Linz who showed that you can multiply 5*5 matrices in 95 multiplications instead of 96 multiplications which the paper showed.

  • @jcsjcs2
    @jcsjcs2 Год назад

    The reason why you don't care about additions is not because additions are "much faster" than multiplications. An FPU will usually take the same amount of time for either and will probably even feature combined "multiply and accumulate" functions.
    The reason is that numbers of operations is of the order n^3. Therefore, adding both together is still of the order n^3.

  • @soonts
    @soonts Год назад

    Modern computers don't necessarily do additions much faster than multiplications. A computer I use to watch your video has AMD Zen 3 CPU. For 32-bit floating point numbers, on my computer the performance is absolutely equal. Specifically, both vaddps (adds 8 numbers) and vmulps (multiplies 8 numbers) AVX instructions have 3 cycles latency, and 0.5 cycles throughput.

  • @ericphantri96734
    @ericphantri96734 Год назад

    Use divider into 2×2 or 3×3 the do all the way to use the grid point inverter and addition to perform by electronic across the board mean reducing to the determinant of each small matrix then move them together into smaller matrix with all the determinant and repeat until final determinant each is only single number then multiple ply out. And find the angle between the two number by trigonometric value then we had final vector so if n = 50 numbers each matrix then use the 2× 2 because even so far total operations about 50×49 estimate 2450 calculation by electronic of 20 mhz then it take 0.00000085 second and if use slower CPU like 1 mhz then it take 0.005 second for 50×50

  • @qqec
    @qqec Год назад

    title: fastest matrix multiplication algorithm. Video explains: The first improvement of matrix multiplication 😆. Oh and on top of clickbait a fat sponsorship. I love quality content

  • @DiabloQFDB
    @DiabloQFDB Год назад

    I need to research and see if these methods can be implemented with AVX.

  • @jimwinchester339
    @jimwinchester339 Год назад

    Interesting! I was class of 1980, so I learned only of the Strassen algorithm.

  • @notgabby604
    @notgabby604 Год назад

    The Walsh Hadamard Transform is kind of the fastest dense matrix multiply you can do.
    It's a fixed thing but you can somewhat unfix it with cheap operations like prior sign flipping of the input information or prior permutations etc.
    The number of multiples required then is zero.

  • @jaimeduncan6167
    @jaimeduncan6167 2 года назад +1

    If one is working with not so big numbers (in terms of number of relevant digits ) , like 64 bit b
    floating point numbers modern computers will produces multiplications in the same time of addition for this kind of problems (fully pipelined) in fact many will produce Ax + y 1 per cycle per pipeline. That means That additions can’t be ignored.

    • @ultimatedude5686
      @ultimatedude5686 2 года назад +1

      I think the reason reducing multiplications is so important is that multiplying matrices is way more computationally expensive than adding them. That means when you’re working with block matrices, you want the minimum number of multiplications of the components; extra additions are fine and won’t really affect the asymptotic complexity.

    • @jaimeduncan6167
      @jaimeduncan6167 7 месяцев назад

      @@ultimatedude5686 Yes, but that is not his point his point is "when multiplying matrixes I can ignore the scalar aditions because they are much faster" wrong. It was true it's no longer the case. Almost all modern architectures since Power 1 have some version of fusion multiply-add fully pipelined, again when dealing with floats and big matrixes (latency is different).

    • @ultimatedude5686
      @ultimatedude5686 7 месяцев назад

      @@jaimeduncan6167 Addition of numbers is about the same speed as multiplication of numbers, yes. But the matrices might actually be block matrices, in which case the components themselves are matrices, in which case adding the components is much faster than multiplying them.

  • @dougpark1025
    @dougpark1025 Год назад

    An interesting set of algorithms. However, as it turns out getting the numbers from memory into the CPU is a bottleneck that is easy to forget. If you can arrange to get data to load into the cache of the processor in contiguous blocks there is a great deal of performance gain to be had over just pulling the numbers from memory without any consideration. This was a technique used decades ago to do matrix multiplies efficiently when computer memory wasn't large enough to hold all of the matrices. Blocks would be loaded from disk, multiplied and then blocks written back out to disk. Same idea, but memory is a lot faster than writing to disk. Today we usually don't run into anything that big, but the concept still works.
    Next step, enter the multiple core processor. You can take the same idea and subdivide the work of multiplication into parts that run independently on separate cores. In the end you can approach an N times speed up where N is number of cores. (Subject to Amdahl's law of course). The nice thing about the matrix multiply is it falls into the category of being embarrassingly parallel. Which is to say, it doesn't require much in the way synchronization constructs.
    I have not tried any of these faster algorithms in a parallel implementation. It might be interesting to benchmark that sort of implementation.

  • @D_VAULTZ
    @D_VAULTZ 2 года назад +2

    does this only apply to large ints? Im wondering if there is an improvement available in much smaller values, such as 8bit integers.

    • @jonatanlind5408
      @jonatanlind5408 2 года назад +2

      These algorithms only tend to make sense for large ints. The algorithms presented in this video were developed to speed up computation, your 8bit example can already be calculated increadible quickly and even if you have a large amount of calculations it is possible to calculate every single 8bit multiply possible and simply look up the answer when you need it. I am unaware of any algorithms that can beat such a method.

  • @aliuncu2815
    @aliuncu2815 2 года назад +1

    You forgot to mention that not a week after the AlphaTensor group's paper, Kauers and Moosbauer improved their results on 5x5 matrices.

  • @saitougin7210
    @saitougin7210 2 года назад +1

    7:05 "using some log rules": starting from base change formula (log_b c) / (log_b a) = (ln c) / (ln a)
    (log_b c) * (ln a) = (log_b a) * (ln c)
    ln[a^(log_b c)] = ln[c^(log_b a)]
    a^(log_b c) = c^(log_b a) q.e.d.
    You're welcome. Hm, maybe someone should add this formula on Wikipedia or something. It might come in handy, if one just knew it.

  • @rumasc9854
    @rumasc9854 Год назад

    dude, I was waiting for the cool dance that the thumbnail portrayed was about to happen

  • @lunalildragon241
    @lunalildragon241 Год назад

    Well calculations in itself don't really need a long time, the biggest problem is the memory read time that you have if you read and write more data than the cpu can store.
    Meaning if you are calculating that huge operations you would have to care about spliting the data into chunks that fit in that area without having to reread.

  • @deeveshjain4246
    @deeveshjain4246 10 месяцев назад

    is this the same thing as batch matrix multiplication?

  • @joshualickey3426
    @joshualickey3426 2 года назад +1

    I think i found a mistake at 2:44. In the bottom lower right it might be p1-p2+p3+p6.

  • @chxyike
    @chxyike Год назад

    you can improve from n^3 to n^2.807 which is not significant, but the complexity increases a lot. The gain is small with this algorithm, to my opinion.

    • @MrHamof
      @MrHamof Год назад

      Say you have a matrix that's 100x100. With N^3, you need 1 000 000 multiplications to solve it. With n^2.807 you need 411 150 multiplications, and with the current record holder of 2.3728596 we're down to 55 683 multiplications. I don't know enough to speak to real world performance difference, but it definitely has a huge impact on the number of multiplications needed for large matrix multiplication.
      Depending on how much harder implementation is, I would agree that it's probably not worth it for smaller matrices, like most of these the difference between O(n^3). O(n^2) and O(n) isn't that big if n is a small number.

    • @chxyike
      @chxyike Год назад

      ​@@MrHamofyou are right. It makes sense for large matrix.

  • @wilurbean
    @wilurbean 2 года назад +1

    I feel like I've made it in my physics education when that slightly extended decimal exponential got me excited

    • @DrTrefor
      @DrTrefor  2 года назад +2

      lol

    • @wilurbean
      @wilurbean 2 года назад

      @@DrTrefor fr tho
      I went back to school at 30. Started at a tiny community College in rural northern Michigan a few and got accepted to a top tier engineering physics program (#1 in the US in my particular sub-field).
      My current physics classes are Modern Physics and lab where we're doing a lot of modeling/data analysis with big matrices. And, E&M based around Griffiths Electrodynamics.
      Wouldn't have passed Vector Calc or Linear Algebra without your videos. It's coming full circle now with Modern and Linear Algebra and EM with the vector Calc.
      Ty!

  • @Oscar1618033
    @Oscar1618033 2 года назад +1

    I have the feeling that such algorithms might involve some kind of matrix log and fourier transforms. Could it be?

  • @iiiiii-w8h
    @iiiiii-w8h 2 года назад +2

    which algorithm is implemented on MatLab?

    • @DrTrefor
      @DrTrefor  2 года назад +3

      The normal one by default I would presume. It takes truly gigantic matrices for this kind of asymptotic difference to make a measurable advantage.

  • @frankyin8509
    @frankyin8509 Год назад

    Love his T-shirt, 👕 topology meme🎉

  • @unlucky-777
    @unlucky-777 Год назад

    As a computer engineer, this didn't help me at all because my coworkers are still using the most basic methods in our codes

  • @jessstuart7495
    @jessstuart7495 2 года назад +1

    Do modern CPU's detect when you are trying to multiply by zero and return a result (zero) faster than a multiplying two non-zero numbers?

    • @DrTrefor
      @DrTrefor  2 года назад

      Yes I believe so, but likely depends on exact algorithm one uses.

    • @jonatanlind5408
      @jonatanlind5408 2 года назад

      Yes speed of computation is dependent on the data itself. Floating point numbers actually have 5 distinct data representations they can assume each of which must be handled seperately. These are Zero, denormal, normal, infinity and NaN. I believe computations regarding denormal numbers usually are slower that normal or zero and significantly so.

  • @holyshit922
    @holyshit922 2 года назад

    Standard multiplication maybe is not the fastest but it is still not bad because it has polynomial growth
    Compare it with such algorithms like calculating determinant by cofactor expansion of by summing products over all permutations which have factorial growth
    If we want to calculate coefficients of characteristic polynomial by expanding determinant via minors to avoid symbolic computation we will have exponential growth
    2:16 and here I catch you we accelerate algorithm but with cost of memory but both RAM and disk space or paper and ink cost in dollars
    Was it worth , acceleration isnt significant you wont be able to accelerate algorithm to O(n^2)

  • @David-gu8hv
    @David-gu8hv Год назад

    No wonder donuts and coffee go so well together...

  • @artus9022
    @artus9022 2 года назад

    But what about SIMD? Isnt this n²? How would the algorithms perform in this case?

  • @lucavogels
    @lucavogels 2 года назад

    Can you also explain the Alman, Willimas Algorithm please! 🤓

  • @anaunaga5471
    @anaunaga5471 2 месяца назад

    I think your equation for the bottom right of the matrix is wrong.
    I believe it should be p1-p2+p3+p6.
    I was using this video as help, and I couldn't understand why I was getting an incorrect answer.

  • @pierreabbat6157
    @pierreabbat6157 2 года назад

    What happens to the accuracy of computation with these matrix multiplication algorithms? I use the standard n³ algorithm, with pairwise addition for accuracy.

    • @calvindang7291
      @calvindang7291 2 года назад +2

      It's assumed that you want an exact result (and so won't get hit by rounding errors) when doing this - if you only need approximate answers and so are using an approximate data type, that's something else entirely.

  • @Moinsdeuxcat
    @Moinsdeuxcat 2 года назад +1

    8:40 THIS HAS TO BECOME A SUMMONING SALT VIDEO

  • @iwersonsch5131
    @iwersonsch5131 2 года назад +3

    At 5:32, if I look at matrices of size n=2k+1 for k approaching infinity, the complexity with this naive extension approaches (2n-1)^log2(7) = (n-0.5)^log2(49) > (n-1)^5, doesn't it? Of course there are ways to reconcile this but this argument appears to be insufficient for the claim

  • @Petch85
    @Petch85 Год назад

    What if you have a matrix x vector, are there similar tricks?

  • @ancient1844
    @ancient1844 18 дней назад

    what about systolic arrays?

  • @ankurraj6560
    @ankurraj6560 2 года назад +3

    Can we somehow use the concept of a Matrix as a linear transformation of a space to simplify our Matrix multiplication problems!? 🤔🤔

    • @DrTrefor
      @DrTrefor  2 года назад +4

      Well composition of linear transformation is just matrix multiplication so they are definitely related.

  • @TheViperZed
    @TheViperZed Год назад

    multiplication and addition having a different impacts from a computational complexity standpoint is incorrect from the perspective of modern hardware, if you're not doing this on embedded hardware the multiply, int or fp, is going to take one cycle, sometimes even just fractional cycles. Most often it's a fractional cycle multiply accumulate doing both the mult and the add. Computational complexity really needs to be combined with an analysis of memory throughput and memory footprint to be a viable analytical strategy these days, on its own it's an anachronism from a time where the main performance bound wasn't IO throughput into the the CPU. It's really worth mentioning these facts if you're going to approach a topic from a computational complexity standpoint.

  • @abdulshabazz8597
    @abdulshabazz8597 10 месяцев назад

    I would think the computational lower bound is less than O(N^2) if any digit repeats in more than one quadrant...

  • @subhodeepmondal7937
    @subhodeepmondal7937 2 года назад +1

    Let me tell you a story, So I just started a learning CUDA and matrix multiplication is basic algorithm that you've to learn. But they were using a algorithm which launches a nXn thread grid to calculate sum(a^i * b^j) term. which is virtually O(n) complexity if you've more CUDA cores than the order of matrix. But I thought can I do it better, so what I did I calculated all the a^i *b^j term and then did a rolling sum which is of complexity of log(n). So the point is if we could somehow able to parallelize our problem we can process a lot faster.

  • @farfa2937
    @farfa2937 2 года назад

    If you hardcode a matix, then you can get the result in constant time, and you'll be right as long as you only try to multiply matrices that result in your chosen one.

    • @tissuepaper9962
      @tissuepaper9962 Год назад

      This is the well-known "divine intervention" algorithm which is also employed in "miracle sort" and the "birthday attack" on hashing algorithms.

  • @jimjacobs2817
    @jimjacobs2817 2 года назад

    Is there a quicker way to calculate a 4 x 4 matrix? That is, there a similar approach to increasing the speed of 2 x 2 matrices?

  • @cmilkau
    @cmilkau 2 года назад +6

    I don't like title cheating. I somehow expected this would present a state-of-the-art algorithm, not just explain Strassen and then continue with "there are better ones this fast".

    • @Miguel_Noether
      @Miguel_Noether 2 года назад

      It's the fastest, where is the cheating part?

    • @SimGunther
      @SimGunther 2 года назад +2

      @@Miguel_Noether An explanation on the latest Williams equation would've sufficed for this video, but he chooses not to elaborate on its mechanics whatsoever.

  • @cantkeepitin
    @cantkeepitin 2 года назад

    What about rounding errors?

  • @jawadshinwari7365
    @jawadshinwari7365 2 года назад

    hi sir! plz sir this algorithm is explined is in example..🙏🙏🙏

  • @jimnewton4534
    @jimnewton4534 2 года назад

    i'm surprised that if the number of rows is not a power of 2, then you can simply add rows and columns of zeros. I've always thought I needed adjoin rows and columns of zeros but 1's on the main diagonal.

  • @dmitrykazakov2829
    @dmitrykazakov2829 Год назад

    This misses the question of accuracy. Additions (and subtractions) suffer precision loss in floating point arithmetic. If you have accumulating additions in your algorithm you are risking catastrophic precision loss for some inputs. So shuffling multiplication into a bunch of additions might be not such a bight idea as it appears.
    In general floating-point operations are not associative, this all optimization methods based of reshuffling operations like this one must be done with great care.

  • @telotawa
    @telotawa 2 года назад

    9:09 but you can do integer multiplication in n log n tho

  • @gnomeba12
    @gnomeba12 2 года назад +1

    Ok now show us the best algorithm for SVD!

    • @DrTrefor
      @DrTrefor  2 года назад +1

      ha actually I've been wanting to do a SVD video

  • @GradientAscent_
    @GradientAscent_ 2 года назад

    Can you show a proof of n^2 being the theoretical minimum?

    • @L2M2K2
      @L2M2K2 Год назад +1

      Depends on what you mean with the minimum. The trivial proof: There are n^2 unique elements to the input and they all are used at least once to compute the output. As such, one cannot possibly compute all n^2 output elements with less than n^2 operations in total (this would require a fixed number of operations for each input element to generate all n^2 output elements, a very tough task). Note, this does not say anything about the ability to get anywhere near this limit of n^2 operations. It is simply a strict lower limit one cannot possibly beat. (Improving the theoretical lower limits, as in a proof that it must be worse than here the trivial n^2 limit, is its own field of mathematics. And, once the theoretical lower limit and the best known algorithm meet, we complete a problem. Well, kind of, one can still improve the practical applications a lot. And, not all problems are even completable.)

  • @Rin-qj7zt
    @Rin-qj7zt Год назад +1

    apparently the link for brilliant is still active cuz it gave me the discount lol

    • @DrTrefor
      @DrTrefor  Год назад +1

      Awesome! Hope you enjoy:)

  • @mathematicia
    @mathematicia 2 года назад

    This could be easier method but learning , memorising and use this in practica scenerios is really painfull . As long as calculations is concerned it could be somewhere usefull but when we do any type of analysis with metrices , it couldn't replace the ordinary multiplication.

  • @alekthpariso8606
    @alekthpariso8606 2 года назад +1

    still trying to figure out where i am gonna need this.

    • @DrTrefor
      @DrTrefor  2 года назад +2

      haha probably never tbh:D

  • @drwizard2875
    @drwizard2875 2 года назад

    Any advancement in non commutative behavior of the matrix multiplication.

  • @muskyoxes
    @muskyoxes 2 года назад +2

    I'm not believing that a few multiplications are worse than a huge pile of additions. Multiplying two 10,000 digit numbers is a pain, but a 10,000-dimension matrix doesn't imply 10,000 digit numbers in the entries.
    Has anyone found which of these algorithms is _actually_ fastest when used for real problems on real computers?

    • @DrTrefor
      @DrTrefor  2 года назад +2

      Maybe one way to think about it is that multiplication IS just iterated addition, so every multiplication itself is hiding a huge pile of additions.

    • @muskyoxes
      @muskyoxes 2 года назад +2

      @@DrTrefor But those additions are parallel and hardware makes mincemeat of them. (Unlike division, which needs sequential subtraction and is thus slow.)
      Multiplication often takes only one clock cycle. Sometimes three. Rarely much more than that as far as i've seen

    • @calvindang7291
      @calvindang7291 2 года назад

      You don't actually calculate based the number of multiplications. You calculate asymptotic complexity the normal way, which gives the same result as just counting the number of multiplications.
      Of course, which one is actually fastest depends on how large your matrices are. Multiplying matrices that are 100k in each dimension is probably enough that the extra overhead of fast methods are worth it, but if they're significantly smaller than that it might not be needed.

    • @muskyoxes
      @muskyoxes 2 года назад +1

      @@calvindang7291 If you're replacing a 3 clock cycle operation with a dozen 1 clock cycle operation, aren't you losing at all sizes? It's not that simple (what about vector operations, pipelining, or GPUs?), but all the more reason to know the _practically_ best algorithm.
      I'd also only be interested in large matrix sizes that are used in real problems, and not report that some algorithm is best on matrices orders of magnitude bigger than any application has ever worked with.

    • @calvindang7291
      @calvindang7291 2 года назад

      @@muskyoxes As I said - you count the operation count the normal way. The way I learned in school, that includes O(1) multiplication, since we aren't assuming big numbers.
      Nobody cares about the exact amount of operations. If a twice as large matrix is always only approximately 7 times more operations, that's going to be treated as better than one where it's 8 times more (algorithm analysis doesn't care about any other overhead, either - just whichever part of the algorithm is slowest, which you can usually figure out by looking at it for 5 seconds)

  • @dabbler1
    @dabbler1 Год назад

    correction: it is p1 + p3 - p2 +p6. not p1 + p2 - p3 +p6