This is a game changer! (AlphaTensor by DeepMind explained)

Поделиться
HTML-код
  • Опубликовано: 7 ноя 2024

Комментарии • 271

  • @YannicKilcher
    @YannicKilcher  2 года назад +33

    OUTLINE:
    0:00 - Intro
    1:50 - Sponsor: Assembly AI (link in description)
    3:25 - What even is Matrix Multiplication?
    6:10 - A very astounding fact
    8:45 - Trading multiplications for additions
    12:35 - Matrix Multiplication as a Tensor
    17:30 - Tensor Decompositions
    20:30 - A formal way of finding multiplication algorithms
    31:00 - How to formulate this as a game?
    39:30 - A brief primer on AlphaZero / MCTS
    45:40 - The Results
    48:15 - Optimizing for different hardware
    52:40 - Expanding fundamental math
    53:45 - Summary & Final Comments

    • @CosmiaNebula
      @CosmiaNebula 2 года назад +7

      TLDR:
      ## The problem
      Given two matrices A, B of size mxn, nxl, find a way to compute AB, such that we do as few scalar multiplications as possible (add and subtract are free).
      Reformulate the problem to make it amenable to AI: given the 3D "matrix multiplication tensor" T of size (mn)x(nl)x(ml), find a decomposition into T = ∑i ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}.
      If this decomposition has r terms, then we have a corresponding way to compute AB using only r scalar multiplications.
      ### TensorGame
      Now create the 1-player TensorGame: given a 3D tensor T, ask it to make moves of the form ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. The state of the game after the move is (T - ui⊗vi⊗wi). The game ends either when the game state is 0, or if the player has played K rounds and thus "ran out of time" (where K is an upper bound to the rank of T). The reward at each step is (-1), to encourage the agent to play as fast as possible. If the agent runs out of time, then the reward for the last step is (-r), where r is an upper bound to the game state in the end.
      ### Agent
      The agent is a RL agent "AlphaTensor". It, like AlphaZero, uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure. It is trained on randomly generated 3D tensors constructed by adding randomly generated ∑i ui⊗vi⊗wi. TensorGames are easy to make like that.
      ## Results
      After some training, the resulting agent is able to play the TensorGame very well. When applied to matrix multiplication tensors, it found some ways to do better than previous human-found techniques.
      For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method.
      By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}).
      Consider what they achieved: It took humans 10 years to go
      from O(n^{2.8074}) in Strassen (1969)
      to O(n^{2.780}) in Bini, Capovani, Romani (1979).
      and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.
      I scoured their papers and couldn't find out how much time it took to train the agent, other than somewhere around 1 million steps?. By Google standards, I guess that means it took them a month. So, 1 month vs 10 years.
      BUT WAIT THERE"S MORE
      They also found the optimal algorithm for multiplying a nxn skew-symmetric matrix and a vector. Previous SOTA (2018), it was known to take at most n^2 multiplications, (see "Fast structured matrix computations: tensor rank and Cohn-Umans method"). They found a method taking only 0.5(n-1)(n+2) multiplications. This is obviously the asymptotically optimal number (since a nxn skew-symmetric matrix has ~ 0.5 n^2 different entries).
      BUT WAIT THERE'S MORE
      They ran the same algorithm but this time scoring its strategy based not on how many scalar multiplications it takes, but how much time it takes to run on a particular hardware (GPU or TPU). It found improvements as well, and these improvements are specific to GPU or TPU (using the optimized algorithm for GPU on TPU would yield a much lower improvement, and vice versa).
      > AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for.

    • @tlf4354
      @tlf4354 2 года назад +1

      @@CosmiaNebula Thanks for this TL;DR~

    • @tlf4354
      @tlf4354 2 года назад +1

      @@CosmiaNebula TL;DR
      🖥 🧠 🔢 ⏫ 😱

    • @pouet4608
      @pouet4608 2 года назад

      @@CosmiaNebula but are the found algorithms proven afterwards? Because finding an algorithm is half the work. Proving it works, not only by some results but conceptually is the rest. I Don t understand how they can use discovered algorithms if they cannot proof it is true.

    • @CosmiaNebula
      @CosmiaNebula 2 года назад

      @@pouet4608 verifying the algorithms is trivial algebra

  • @ztangent3188
    @ztangent3188 2 года назад +422

    Actually when the fast matrix multiplication algorithm is applied recursively (to big matrix), not just the multiplication number reduces, so is the addition. This is because each sub-matrix multiplication requires super-linear O(m^(2+eps)) operations, while sub-matrix addition still takes O(m^2). So we are not trading multiplications for faster additions - both operations decrease.

    • @Neptutron
      @Neptutron 2 года назад +47

      Yes - this!! Please upvote the OP's comment! It should be emphasized that these improvements are not just constant factors - they improve the asymptotic complexity of matrix multiplication for certain matrix sizes in a practical way!

    • @nonamehere9658
      @nonamehere9658 2 года назад +15

      +1, Yannic should've at least mentioned Strassen's recursive matrix multiplications algorithm.

    • @dr.mikeybee
      @dr.mikeybee 2 года назад +8

      @@nonamehere9658 Yannick shows it in the paper.

    • @CosmiaNebula
      @CosmiaNebula 2 года назад +15

      For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method.
      By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}).
      Consider what they achieved: It took humans 10 years to go
      from O(n^{2.8074}) in Strassen (1969)
      to O(n^{2.780}) in Bini, Capovani, Romani (1979).
      and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.

    • @blackpepper2610
      @blackpepper2610 2 года назад

      no

  • @antopolskiy
    @antopolskiy 2 года назад +99

    thank you, Yannic. I love these paper explaining videos. it is really hard for me to focus on reading papers. You explain the main points very clearly in a fraction of time it would take me to read it.

    • @LarryPanozzo
      @LarryPanozzo 2 года назад +17

      We literally don’t have time to read every great paper these days. Summaries like these do save the day.

  • @apothum
    @apothum 2 года назад +17

    Production quality on this video was fantastic. For instance, the simple things like the blue background and circle are nice touches.

  • @cmilkau
    @cmilkau 2 года назад +31

    The reason that addition is "cheap" is not so much CPU's, these can do number multiplication in a single cycle, too. It's because the entries are usually matrices themselves (instead of numbers) and adding matrices is just adding their entries.
    Full story: Its when you use these results to multiply huge matrices. You do this using "block matrix multiplication". For instance, to multiply two 1000x1000 matrices, you can reinterprete these as 2x2 matrices, but each of the eight entries are 500x500 matrices ("blocks") themselves. Obviously, if you use a 2x2 multiplication formula, each multiplication in that formula actually corresponds to multiplying two 500x500 matrices (which you can do recursively applying your formula). The naive formula takes n×n×n multiplications to multiply two n×n matrices, but only n×n additions to add two such matrices. As you can see, this is a huge difference for large n. This difference stays huge even for "clever" versions of multiplication.

    • @cmilkau
      @cmilkau 2 года назад +1

      @@leeroyjenkins0 Ok, maybe, but I don't like emphasis put on that because it really isn't as relevant. The matrices that are actually relevant in state-of-the-art ANN have n in the thousands to tens of thousands. A direct Tensor decomposition would be in the order of millions to billions actions. So, the block multiplication case is the real relevant case.
      Of course, if you want to do actual graphics, 4x4 is your most common size and directly improving that can vastly improve the characteristics of your graphics chip or driver.

  • @DRealHammer
    @DRealHammer 2 года назад +54

    -2 and 2 make sense as you can use a bitshift for the multiplication, so it is basically free and gives a bigger space of possible solutions.

    • @DRealHammer
      @DRealHammer 2 года назад

      @@h..h can you reformulate the question?

    • @AlexanderShamov
      @AlexanderShamov 2 года назад +1

      ​@@h..h Why not? Just add 1 to the exponent, it's still much faster than general floating point multiplication.

    • @vladimirtchuiev2218
      @vladimirtchuiev2218 2 года назад +1

      I guess so is -4, 4, -8, 8 or any n with 2^n and -(2^n). But I guess the capacity to search in this space is more limited.

    • @DRealHammer
      @DRealHammer 2 года назад

      @@vladimirtchuiev2218 I think so too! However, my intuition also is, that the benefits of these high factors are not as great as adding the factors 2/-2.

  • @rxphi5382
    @rxphi5382 2 года назад +26

    As a new CS student this kind of content is very interesting, even tough I don't understand everything. It motivates me to keep learning.
    Thank you Yannic!

    • @w花b
      @w花b Год назад +1

      @Greg LeJacques w-what..?

  • @adityay525125
    @adityay525125 2 года назад +14

    Thank you Yannic, I missed these older style paper explainers

  • @cmilkau
    @cmilkau 2 года назад +22

    "A matrix is a square box of numbers" - AlphaTensor can also multiply rectangular matrices, and the paper gives an example. Of course 2x2 and maybe 4x4 matrices are of specific interest because they can be easily generalized and implemented in algorithms for large matrices.

  • @mgostIH
    @mgostIH 2 года назад +63

    I think it's also important besides the domain they applied it on (matrix multiplication algorithms), since it shows that reinforcement learning via deep learning is strong enough to explore extremely large discrete spaces to optimize. I can expect that in the future most NP-Hard problems we'll want to tackle will be attacked by these approaches rather than heuristics.

    • @Jiftoo
      @Jiftoo 2 года назад +6

      real

    • @AufBerghofNAM
      @AufBerghofNAM 2 года назад +1

      nphard has incomplete m-sets, but solutions are provable in post

  • @BjarkeEbert
    @BjarkeEbert 2 года назад +12

    One thing I have wondered: we use multiplications because it's mathematically convenient. But what if we use some cheaper operations on the layer inputs, and "just learn" to parametrize them correctly. Who cares if it's exact multiplication - what matters is that the NN gets the job done. The learning could maybe work around the "incorrect but fast" multiplication

    • @JurekOK
      @JurekOK 2 года назад +9

      "Analogue computers" do that -- they exploit the non-linearity of silicon diode in the 0..1V region of conduction.
      As long as you have some nonlinear operation, you can have a neural network. Multiplication happens to be a convenient one for **digital** computers.

    • @Fordance100
      @Fordance100 2 года назад +2

      No, you have to go through nn to estimate it which would is composed a bunch of matrix multiplications, end up doing much more work than just doing the matrix multiplication straight.

    • @appletree6741
      @appletree6741 2 года назад +6

      You would need a “incorrect but fast” backprop as well, it’s not clear to me that backprop still works if you break multiplication

  • @minimath5882
    @minimath5882 2 года назад +3

    Karatsuba's algorithm trades additions with multiplication with numbers back in 1963
    and now DeepMind did with it matrices with AlphaTensors in 2022.
    Amazing

    • @NoNameAtAll2
      @NoNameAtAll2 2 года назад

      analog of karatsuba's algo for matrixes existed for decades already
      deep mind found even better one

  • @andrewdunbar828
    @andrewdunbar828 2 года назад +1

    Hey thanks for explaining this in a way I can almost understand. English tip for you in appreciation: "long-winded" is not about winding it's about the wind blowing. Dunno why.

  • @mujtabaalam5907
    @mujtabaalam5907 2 года назад +4

    27:40 I think there's a mistake: the modification comes from the 5th column, not the 7th, because we're looking at the tensor where the top row of u is non-zero

  • @GandalfTheBrown117
    @GandalfTheBrown117 2 года назад +1

    Thank you for your videos. You're the Agadmator of ML/AI.

  • @Derrekito
    @Derrekito 2 года назад +7

    This is a fantastic coincidence! I was trying to figure out similar generalized forms you mentioned in the beginning, but from the perspective of a computer architecture data dependency issue several months ago. I shelved the idea because after a few weeks I found it difficult to justify the time to spend on it. This solution might be a much better fit if it really works! I'm so excited to read this paper and start work towards implementing it.

    • @maloxi1472
      @maloxi1472 2 года назад +1

      Are you a descendant of the great Rolf Landauer ?!

    • @Derrekito
      @Derrekito 2 года назад +2

      @@maloxi1472 I sometimes fantasize that I have some connection to the Landauer limit. Sadly, I'm unaware of any familial connection. 😅

  • @iestynne
    @iestynne 2 года назад +1

    This kind of fine-grained explanation is such a great service - thank you!!

  • @Anders01
    @Anders01 2 года назад +1

    That's a great result. I have been thinking of machine learning training other machine learning algorithms, but to improve other algorithms is also an interesting approach which now is shown to be practically possible.

  • @Handelsbilanzdefizit
    @Handelsbilanzdefizit 2 года назад +3

    I think, with the dawn of quantumcomputers, such problems will be blown away.
    Because matrix multiplication consists of many scalarproducts (or dot-products).
    row * column =
    With the help of superposition and quantumparallelism, it will deliver results almost instantly.

  • @mildpass
    @mildpass 2 года назад +5

    I would have liked to hear your comments on the structured matrix multiplication part of the paper. This is potentially the biggest speedup to linear algebra presented in the paper. Their skew symmetric matrix algorithm beats the state of the art by a factor of 2. Not 2 multiply operations, half as many as the previous best! Unfortunately the structured matrix section of the paper read a bit vague to me. Particularly the "extrapolation" part of the skew symmetric algorithm to larger matrices looked like it was done manually by a human. Would be really cool to see an ML algorithm be able to automate extrapolation of those algorithms. To me that would be the holy grail of AI optimized linear algebra.

  • @videowatching9576
    @videowatching9576 2 года назад +4

    I appreciate the content, so when I feel that I am not having time well spent, then watching and learning about AI on this channel is super useful and interesting.

  • @stevey7997
    @stevey7997 2 года назад +1

    Really great video, thanks! Just a small mistake at 27:48. You look at the wrong entry in U (but I see how that easily happens of course).

  • @vanderkarl3927
    @vanderkarl3927 2 года назад +6

    This almost sounds too good to be true. Improving the speed of matrix multiplication... that basically speeds up EVERYTHING nowadays.

  • @Wobbothe3rd
    @Wobbothe3rd 2 года назад +1

    Thanks for this video! Had just heard of this development but didn't understand it, this explains it well.

  • @carlosrivadulla8903
    @carlosrivadulla8903 2 года назад

    I asked about this in 2 minutes parpers yesterday but no one answered me. Thnx u are making a video. Such a change on fundamentals should have a cascade impact in all fields of tech and science.

  • @robrobbins
    @robrobbins 2 года назад +5

    I tried to write a Python program to implement their algorithm for multiplying 4x4 matrices in 47 steps but I suspect they left out the subtractions in their algorithm. There were nothing but additions and multiplications which is unusual. The numbers I got were way too high. You can find two of the algorithms they found in the PDF accompanying the article.

    • @PrzemyslawDolata
      @PrzemyslawDolata 2 года назад +6

      The 47-step algorithm is for *modular arithmetic*. In this system all the numbers are "modulo p" for some number p, i.e. the numbers "wrap around". E.g. in modulo 5 arithmetic, 4+4 = 3 (it would be 8, but 8 mod 5 is 3). That's why subtractions aren't necessary - numbers will stay confined to the (0;p-1) range by definition. And that's why numbers blew up when you were doing normal arithmetic.

    • @NoNameAtAll2
      @NoNameAtAll2 2 года назад

      @@PrzemyslawDolata can't one just take really big p?

    • @MagicGonads
      @MagicGonads 2 года назад +4

      @@NoNameAtAll2 I take a really big p sometimes if I drink too much water

  • @GantryG
    @GantryG 2 года назад

    Awesome, thanks for breaking that down for us! 😀

  • @alefratat4018
    @alefratat4018 2 года назад +2

    That's great, but I would still bet that clever product quantization algorithms for approximated matmul are the way to go for faster deep learning.

  • @cmilkau
    @cmilkau 2 года назад +2

    The relationship between matrix multiplication and the Tensor T can be written as C = AB = A:T:B. Here, we use double contractions because the "axis" of T corresponding to A (the one indexing its entries a1,a2,a3,a4) is actually two axis (a1=a11, a2=a12, a3=a21, a4=a22) indexing the rows and columns of A, respectively. Same for B. The result has two remaining axis, indexing the rows and columns of the product C.
    Decomposition means A:T:B = A:(u1 ⊗ v1 ⊗ w1 + u2 ⊗ v2 ⊗ w2 + ...):B = A:(u1 ⊗ v1 ⊗ w1):B + A:(u2 ⊗ v2 ⊗ w2):B + ... Here u,v,w are actually matrices the same size as A,B,C, respectively.
    Now A:(u ⊗ v ⊗ w):B can actually be computed using a single multiplication. First, compute A:u, which adds/subtracts entries of A (u only has entries 1,-1 and 0). Second, compute v:B, acting in the same way. Multiply the two numbers to obtain the number m = (A:u)(v:B) = A:(u ⊗ v):B. Now you only have to multiply m and w to get m ⊗ w = A:(u ⊗ v ⊗ w):B. Because w also only has entries +1,-1 and 0, m ⊗ w essentially means replacing the +1's in w by m and the -1's by -m.

  • @nohrocks
    @nohrocks 2 года назад +1

    Thank you for making this easy to understand

  • @MrJaggy123
    @MrJaggy123 2 года назад

    I love the home shopping network tunes in the sponsor part ❤️

  • @StoianAtanasov
    @StoianAtanasov 2 года назад +4

    Does this mean that the small speedups in matrix multiplications compound to huge speedups when running a Deep Network model with hundreds of matrix multiplications?

  • @oflasch
    @oflasch 2 года назад

    Nice to see another excellent paper explanation video. Thank you! 😊

  • @russelldicken9930
    @russelldicken9930 2 года назад +1

    Er . . . you lost me partway through, but I think I got the gist of it. I'm getting too old for this. The implications are huge. I just can't wait to see the compiler implementations!

  • @sndengale
    @sndengale 2 года назад +1

    Was waiting for this!

  • @MrKrtek00
    @MrKrtek00 2 года назад

    Very good and simple demonstration of what is happening. so algebra is just a game...

  • @MenkoDany
    @MenkoDany 2 года назад +2

    What a time to be alive!

    • @AlexanderBukh
      @AlexanderBukh 2 года назад +1

      Hold on to your GPUs!

    • @MenkoDany
      @MenkoDany 2 года назад +2

      @@AlexanderBukh Funny you say that, yesterday I sold my 3080 to get a 4090

  • @isaacandrewdixon
    @isaacandrewdixon 2 года назад +3

    AI accelerating AI. Every day we are closer to the singularity.

    • @michaelnurse9089
      @michaelnurse9089 2 года назад +1

      Strictly speaking we get closer but the whole idea of singularity is that every step gets easier on the path. Rather the real situation is that we have made 27 steps on a 100 step path and each step gets harder with costs ballooning exponentially in some way. In other words, there will be no singularity with evolution of neural nets as after 32 steps any further steps costs more than nation states can afford.

    • @isaacandrewdixon
      @isaacandrewdixon 2 года назад

      @@michaelnurse9089 We're so close. The rate of progress is rapidly increasing.

  • @johanlarsson9805
    @johanlarsson9805 2 года назад

    16:00 Glad that you caught your misstake.. I was thinking you are in the second "slice" so it should be a2

  • @-E42-
    @-E42- 2 года назад +2

    amazing how they find out about this now, 1000 trillion matrix multiplications and thousands of math lib implementations later :D

  • @serta5727
    @serta5727 2 года назад +4

    This adds up 😉

  • @AlericResident
    @AlericResident 2 года назад

    Maybe they should use the diffuse algorithm to get the tensor decompositions instead of finding it column by column; that gave a tremendous improvement for making drawings too.

  • @Zed_Oud
    @Zed_Oud 2 года назад +2

    Received October 2021
    Published October 2022
    A year and 3 days.

  • @khoda81
    @khoda81 2 года назад +2

    I can already see articles titled "DeepMind made an ai that can make itself faster!".

  • @i4m3rr0r
    @i4m3rr0r 2 года назад

    I was wondering a lot why multiplication is actually so important as naturally in hardware multiplying numbers isn't that much worse than adding them (in comparison to for example division). The magic seems to be when we start thinking about big matrices as blocks of smaller matrices, like a 20x20 matrix can be though of as a 4x4 matrix where each entry is a 5x5 matrix. And you can then apply your matrix multiplication algorithms in the same way, but now every multiplication which was a scalar multiplication will become a MATRIX multiplication, whereas an addition will just be matrix addition. And as matrix addition is so much cheaper than multiplication, it is really important to have as few of them as possible. So these savings are relevant in the asymptotic case where we get to multiply two very large matrices and instead of using the n^3 algorithm we split the matrices smartly into block to apply the more optimized algorithms.

  • @derasor
    @derasor 2 года назад +1

    Could an even more efficient method be discovered by a next iteration of AlphaTensor? That would be truly mindblowing, but if not (meaning there is a hard limit on the most basic operation of machine learning), then there are IMO 2 takes from this achievement.
    1- There definitely won't be a fast take off when AgI is implemented, and all the fear-mongering on AI being an eXisTEnTiAL thREat was just the overblown conclusion to a very defficient thought experiment.
    2- Hardware engineers have a golden opportunity to design machines that implement this algorithm in the lowest levels.

  • @janasandeep
    @janasandeep 2 года назад +2

    "These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware" as per deepmind blog.

  • @cmilkau
    @cmilkau 2 года назад

    We're using a 3-dimensional tensor here but it's actually 6 dimensions because each axis corresponds to an entry in one of the 3 matrices (factor A, factor B, product C), and an entry is identified by a row and a column.

  • @AlericResident
    @AlericResident 2 года назад

    Error at 27:45 ... its the top row of U, not the bottom row, so column 5 causes the cancel, not column 7.

  • @gaggix7095
    @gaggix7095 2 года назад +11

    Multiplication is usually not more expensive than addition in modern hardware. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, for example, the 2×2 matrix multiplication can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Also numerical stability can be a issue with these faster algorithm. So at the end of the day the standard O(N^3) will remain the way matrix multiplication is implementated, at least for a long time.

    • @nullptrRL
      @nullptrRL 2 года назад +9

      Really cool info, and I think the paper does a good job of addressing that reducing multiplication is only one possible training goal. Another possible training goal is matrix multiplication compute time, so that you can optimize for what the hardware can do fastest.

    • @drdca8263
      @drdca8263 2 года назад +1

      For large integers, I would think that multiplication would have to be slower?
      Like, adding two n digit numbers can be done in O(n) time, and, if you allow two different input tapes for the two inputs, or interleave the digits of the two numbers, can even be done by a finite state machine, but surely this can’t be done for multiplication, right?
      Maybe for the numbers which we practically will compute with on the hardware we have, they might take the same time?
      Would that be true even on eg GPU?

    • @gaggix7095
      @gaggix7095 2 года назад +2

      @@drdca8263 computers use a fixed number of bits to represent numbers because the ALUs are built to perform addition, multiplication etc in costant time, if they accept n bits to add and multiply, then yes, if the numbers increase, more times will be needed to perform these operations.

    • @JensDoll
      @JensDoll 2 года назад +4

      To be precise, it depends on the definition of "expensive".
      You can speedup a multiplication for a fixed format like IEEE754. Simply by unrolling the algorithm for each bit directly into a circuit.
      But the complexity of the circuit and therefore the energy consumption may be higher.

    • @zyansheep
      @zyansheep 2 года назад +1

      @@gaggix7095 doesn't multiplication still take more clock cycles than addition? And there are definitely more transistors... Would a custom silicon implementation of matrix multiplication (like you find in tensor chips) really not benefit from this research?

  • @muskyoxes
    @muskyoxes 2 года назад +1

    Let's train on 3SAT just in case P=NP (although, if it does, seems like we'd have to train a super long time)

  • @airazure2050
    @airazure2050 2 года назад +2

    Will new factorizations found in this paper actually make their own ways into existing optimized BLAS (Basic Linear Algebra Subprograms) libraries?

    • @mareko8940
      @mareko8940 2 года назад +1

      I don't think so. Matrix multiplication is mainly limited due to data loading as far as I'm aware, but please correct me if I'm wrong

    • @himanshupalve3454
      @himanshupalve3454 2 года назад +1

      I don't think so since BLAS libraries don't use strassen algorithm for Matrix multiplication

  • @hola-kx1gn
    @hola-kx1gn 2 года назад

    Love these videos, thank you so much.

  • @hichamalaoui8024
    @hichamalaoui8024 2 года назад

    Instead of getting multiplication of two numbers by converting to binary digits and doing long addition operation , we can define a table for simple numbers multiplication from 1x1 to 9x9

  • @TylerThomas
    @TylerThomas 2 года назад

    "That was long winded" about explained how well I understand it now lol

  • @hellokingston-z7g
    @hellokingston-z7g 2 года назад +1

    The progress is really far smaller than the hype. Those in the theory community are not quite excited because the improvement happens not in real numbers, but in a field called F2, which only contains 0 and 1 and is not used by most people. Putting this restriction aside, the modest improvement is with respect to a result more than 50 years ago..... So why cares.
    From a practical perspective, the value is even smaller. If you really want to do matrix multiplication in real life, you have to consider numerical stability, memory allocation, and a bunch of other issues, and this is why Strassen algorithm is not used in the first place. So the experiment is really unconvincing to me and those who have experience in high-performance computing.
    It is definitely a cool result, but really not as big as most people think.

  • @yilei1051
    @yilei1051 2 года назад

    You wanted to explain a1*b1*c4 would be cancelled out, and started from a1*b4*c1, then later changed to a4*b4*c1... and it still worked out, good for you XD

  • @eafindme
    @eafindme 2 года назад

    Sometimes, what we need is to just get back to the basic and rediscover.

  • @Obe_Qwaet
    @Obe_Qwaet 2 года назад +10

    I'm actually a little bit mad right now because I proposed doing this during high school and college and was told that's not the correct way to multiply matricies. Seeing an academic paper on it... I'm not sure how to feel atm.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 2 года назад

      You had low IQ instructors

    • @AlexanderBukh
      @AlexanderBukh 2 года назад +1

      Some faster ways were found decades back, now they only automated finding new ones.

    • @AlexanderBukh
      @AlexanderBukh 2 года назад +2

      Also, isn't it always "the wrong way to do it" until someone comes forward and _proves_ them(sayers) wrong?

    • @kikc
      @kikc 2 года назад

      If I were you I would show this to every human that ever wronged you and then kill them brutally for costing your future

    • @rmajdodin
      @rmajdodin 2 года назад +1

      Next time you have a great idea, just implement it. People love to see sth working.

  • @blanamaxima
    @blanamaxima 2 года назад

    This tricks are typically marginally affecting performance. There are other factors in data-flow that have a huge impact. The 8b multiplier is not the one that consumes energy or space and this techniques make the hw less uniform and kill utilization.

  • @thcoura
    @thcoura 2 года назад

    Interesting approach. I didn't read the article yet . So I making my point based on the face value of your video.
    Here we go: I have concern about the total number of operations (+,*) dealing with float values and the residual error produced

    • @peterszilvasi752
      @peterszilvasi752 2 года назад +1

      They did not use float values for the operations. They constrained the entries to "discrete set of coefficients F (for example, F = {−2, −1, 0, 1, 2})". See the DRL for algorithm discovery section on page 48.

  • @NoNameAtAll2
    @NoNameAtAll2 2 года назад +3

    so you got sponsored by voice-to-text company...
    and your video STILL has no subtitles???

  • @siquod
    @siquod 2 года назад +1

    How much energy would have been saved if these multiplication algorithms had been discovered and universally implemented 3 years ago?

  • @sator666666
    @sator666666 2 года назад

    Great explanation. Thank you.

  • @kinngrimm
    @kinngrimm 2 года назад

    I would have been a bit more interested in actual usecases later on in the real world.
    (or speculations about them as these may not have been established yet)
    Things that came to mind formost are prediction models.
    Flight path calculations to safe fuel including temperature and density of current fligth hight environment, maybe with weather forcast data. Then shape, weight and engines used within a plane/copter and many other factors which may play a role, but i would suppose could overall be improved by this approach of alpha tensors.
    Then there is space flight with orbital dynamics.
    Maybe tests of tensile strength objects with different materials or combination of materials?
    Pressure and dynamic tests within fluids.
    ... next to endless seem reallife applications as tensors are usable to adress so many things.

  • @ScottSummerill
    @ScottSummerill 2 года назад +2

    Received October 2, 2021?! Published a year later?

  • @miniminerx
    @miniminerx 2 года назад

    And now alpha tensor can run faster to find more improvements. That part is crazy

  • @Ou8y2k2
    @Ou8y2k2 2 года назад +1

    It's been so many decades that I don't even remember what a matrix is. Is it that movie with John Wick?

  • @kaeptnbaloo
    @kaeptnbaloo 2 года назад

    Getting teached from Niko Bellic about Alpha Tensor is quite exciting.

  • @simdimdim
    @simdimdim Год назад

    Looking forward to compilers development in 2023

  • @TheJysN
    @TheJysN 2 года назад

    Amazing thanks for the explanation.

  • @paxdriver
    @paxdriver 2 года назад +2

    Why the hell didn't those geniuses at numpy figure this out years ago?! 😭

    • @AlexanderBukh
      @AlexanderBukh 2 года назад +1

      They did, see the table towards the end of the video, the smaller matrices were optimized by people decades ago. This improves larger ones, and by only a few steps.
      52:10

    • @paxdriver
      @paxdriver 2 года назад +1

      @@AlexanderBukh they did but they didn't find the most optimal ones 😜 47 steps vs 48 is a huge difference when working with hundreds of thousands of parameters across many layers and repeating. I'm surprised they didn't use numpy itself to find more optimizations for numpy. There are only what, like 7 or 11 of them or something, right?

  • @CristianGarcia
    @CristianGarcia 2 года назад

    Click bait title... yeah, why not :)
    Thanks for the video!

  • @drewduncan5774
    @drewduncan5774 2 года назад +2

    Is there any mention of numerical stability of the resulting algorithms in the paper? I've only skimmed it, but I couldn't find anything.

    • @TheSinTube
      @TheSinTube 2 года назад +1

      As far as i understand the resulting algorithm is a simple adition and multiplication of the elements in the matices and therefor there shouldn't be any stability issues.

    • @drewduncan5774
      @drewduncan5774 2 года назад +1

      @@TheSinTube There definitely can be. Strassen's two-step algorithm is less stable than the naive algorithm, which is why it is often not used in practice despite being asymptotically faster.

    • @MagicGonads
      @MagicGonads 2 года назад

      he mentioned only using coefficients in the set {-2,-1,0,1,2} for numerical stability

    • @drewduncan5774
      @drewduncan5774 2 года назад

      @@MagicGonads Strassen's algorithm in fact only uses {-1, 0 ,1}.

  • @Drakmo79
    @Drakmo79 2 года назад

    I didn't get all of the math but I LOVE the drawing

  • @martinhsl68hw
    @martinhsl68hw 2 года назад

    Thank you for doing that for us

  • @RamRachum
    @RamRachum 2 года назад

    Bro had so much fun multiplying matrices, he almost forgot to talk about the research.

  • @bloomp7999
    @bloomp7999 2 года назад +1

    It's great, i thought mathematicians had already figured general rules of the limit of matrices, but it seems they dont

    • @farrael004
      @farrael004 2 года назад +2

      I guess this is more in the realm of computer science than just math (even though both are closely related). There are many tricks computer programs do to get the same mathematical results while being much more computationally efficient.

    • @bloomp7999
      @bloomp7999 2 года назад +1

      @@farrael004 ​ yes but in math there are no way around things have to be mathematically right, so that's cool ai can do math things right. because in comparison, ai generated art is the opposite of anything mathematically right, the outputs are always wobbely and whafky, and we can't find mathematical truth in it

    • @farrael004
      @farrael004 2 года назад +2

      @@bloomp7999 Something can be mathematically right in many different ways. Some ways are just more efficient. One of the goals of computer science is finding these efficient ways to make calculations.

  • @IAMschizoaffective
    @IAMschizoaffective 2 года назад

    Nice mohawk I'm rocking one too bro turok for life🧬

  • @stevengill1736
    @stevengill1736 2 года назад

    Very cool indeed - cheers!

  • @nadahlberg
    @nadahlberg 2 года назад +3

    Can I just say that 3d image during the tensor decomposition bit would have been SO much less confusing if they made the individual slices the C values and used each slice to show the A, B pairs you need to multiply and add to get that C

  • @mailoisback
    @mailoisback 2 года назад +1

    So we trade multiplications with additions? But why are additions faster than multiplications? I think on modern CPUs this is no longer true.

    • @mohaofa1544
      @mohaofa1544 2 года назад

      try to multiple 2 numbers in binary system and add two numbers in binary system which one will take more operations to give desire result ? remember computers work in Binary system

  • @fcasadome
    @fcasadome 2 года назад +1

    Wish I could spend time just "thinking" on stuff like this at my job... and get paid for it :)

    • @michaelnurse9089
      @michaelnurse9089 2 года назад +2

      There are likely jobs that get paid for this type of work: Speeding up 3D rendering with AI (e.g. DLSS) or speeding up compiler e.g. for Microsoft C# come to mind. Speeding up cloud servers at AWS. Very narrow applications but they exist.

  • @nadavmihov
    @nadavmihov 2 года назад +2

    In 28:00 isn't it supposed to be the other collum because you check the first row in U?

    • @mattiasfagerlund
      @mattiasfagerlund 2 года назад +2

      I thought so too, but by pure luck it worked out.

  • @cedricvillani8502
    @cedricvillani8502 2 года назад +1

    The first solution for partial differential equations was a Matrix function that was developed by the founder of MathWorks, in fact the logo of math works is that very matrix 😮😮 better looking 👀 now but go back to the original logo.. and ya fun Snapple Fact 🎉 whatever

  • @balooleffe
    @balooleffe 2 года назад

    thanks yannic

  • @drdca8263
    @drdca8263 2 года назад +2

    The action space is larger than for go?
    Ok, so if each number entry is in {-2,-1,0,1,2} and each vector has n^2 entries,
    ah, ok, I was initially thinking only n entries, so I was thinking 4 because I was thinking 4x4 matrices.
    But the vectors representing coefficients of each matrix entry would have n^2 entries.
    Ok.
    So, for 2x2 matrices, then there would be (5^(2^2))^3 = (5^4)^3 = 5^12 possible moves, which... is more than 2^24 , which, ok, yeah that’s many more possible moves than there are in Go.

    • @MagicGonads
      @MagicGonads 2 года назад +1

      the general action space is of size 5^(3((nmp)^2)) where we are taking an n x m matrix by an m x p matrix

  • @sayamqazi
    @sayamqazi 2 года назад +6

    It actually used to bother me when I first learned about a2 - b2 = (a+b)*(a-b). I had this idea of the weight of operations like multiplication is a heavier operation than addition. So a and b both being possibly unique terms I was amazed that this relationship exists.

    • @andrewferguson6901
      @andrewferguson6901 2 года назад +2

      a small formatting tip, a^2 -b^2 is an expression that i find more easily understood. not to say you're wrong or anything, but it took me a second to figure out what you meant

  • @cmilkau
    @cmilkau 2 года назад

    16:00 dude, it's c1 = a1b1 + a2b3, it's written literally under the picture...

  • @90benj
    @90benj 2 года назад

    I am a bit confused whene you talk about rank 1 tensor, because I thought rank 1 tensors are just vectors. Time to brush up my linear algebra 101
    30:27 The tensor shown on the left is not a 3D tensor, it is a 4D rank 3 tensor, as far I as understand it. imagine one row there is one vector, so a rank 1 tensor. If a vector has 2 numbers, it's a 2D vector, if it has 3 it's a 3D vector, so a 3D rank 1 tensor. Just because visually it is represented as a 3D object doesn't make it mathematical 3D I think.

  • @dreamphoenix
    @dreamphoenix 2 года назад

    Thank you.

  • @Myrslokstok
    @Myrslokstok Год назад

    Cool with a matrix cube, when you see 2D matrix, you think what about a 3D matrix.

  • @darkmatter9583
    @darkmatter9583 2 года назад +1

    can it be possible an example calculating manually this method? this is a gamechanger also for students in linear algebra, +is the code published?

  • @peterszilvasi752
    @peterszilvasi752 2 года назад

    "Neural network architecture that they analyze things with here. It is TRANSFORMER based... who would have thought that?"😅

  • @DamBlairFam
    @DamBlairFam 3 месяца назад

    This feels like it should be hardware-based.

  • @carlosquinones7560
    @carlosquinones7560 2 года назад

    Man Google really loves tensors.

  • @kylepena8908
    @kylepena8908 2 года назад +4

    Do you think we can use this to discover O(N^2) algos for matrix multiplies? I sure hope so.

    • @drdca8263
      @drdca8263 2 года назад +3

      I suspect the minimal possible exponent there is greater than 2.

    • @swordwaker7749
      @swordwaker7749 2 года назад +4

      There is an n^2*log(n) lower bound for n.
      This assumes you use an arithmetic circuit... OFC, which only supports multiplication and addition. This still couldn't rule out weird stuffs like logarithms and so on.

    • @Dron008
      @Dron008 2 года назад

      I guess it can be O(1) if you pre-calculate results of multiplications of all possible matrices and store them somewhere. Concatenation of 2 matrices will be an index in this huge array.

    • @kylepena8908
      @kylepena8908 2 года назад +1

      @@Dron008 well erm I guess. But what's your space complexity

    • @ruroruro
      @ruroruro 2 года назад +4

      @@Dron008 what you just described is still at least O(N^2) in the best case. You still have to read both N^2 sized matrices to find your precomputed result.

  • @cbuchner1
    @cbuchner1 2 года назад

    Well great! Now the rabbits will be multiplying even faster.

  • @Frankthegravelrider
    @Frankthegravelrider 2 года назад

    Did you ever explain what you use to make your videos?

  • @karpfenboy
    @karpfenboy 2 года назад

    it actually seems like magic
    great feat