OUTLINE: 0:00 - Intro 1:50 - Sponsor: Assembly AI (link in description) 3:25 - What even is Matrix Multiplication? 6:10 - A very astounding fact 8:45 - Trading multiplications for additions 12:35 - Matrix Multiplication as a Tensor 17:30 - Tensor Decompositions 20:30 - A formal way of finding multiplication algorithms 31:00 - How to formulate this as a game? 39:30 - A brief primer on AlphaZero / MCTS 45:40 - The Results 48:15 - Optimizing for different hardware 52:40 - Expanding fundamental math 53:45 - Summary & Final Comments
TLDR: ## The problem Given two matrices A, B of size mxn, nxl, find a way to compute AB, such that we do as few scalar multiplications as possible (add and subtract are free). Reformulate the problem to make it amenable to AI: given the 3D "matrix multiplication tensor" T of size (mn)x(nl)x(ml), find a decomposition into T = ∑i ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. If this decomposition has r terms, then we have a corresponding way to compute AB using only r scalar multiplications. ### TensorGame Now create the 1-player TensorGame: given a 3D tensor T, ask it to make moves of the form ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. The state of the game after the move is (T - ui⊗vi⊗wi). The game ends either when the game state is 0, or if the player has played K rounds and thus "ran out of time" (where K is an upper bound to the rank of T). The reward at each step is (-1), to encourage the agent to play as fast as possible. If the agent runs out of time, then the reward for the last step is (-r), where r is an upper bound to the game state in the end. ### Agent The agent is a RL agent "AlphaTensor". It, like AlphaZero, uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure. It is trained on randomly generated 3D tensors constructed by adding randomly generated ∑i ui⊗vi⊗wi. TensorGames are easy to make like that. ## Results After some training, the resulting agent is able to play the TensorGame very well. When applied to matrix multiplication tensors, it found some ways to do better than previous human-found techniques. For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method. By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}). Consider what they achieved: It took humans 10 years to go from O(n^{2.8074}) in Strassen (1969) to O(n^{2.780}) in Bini, Capovani, Romani (1979). and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well. I scoured their papers and couldn't find out how much time it took to train the agent, other than somewhere around 1 million steps?. By Google standards, I guess that means it took them a month. So, 1 month vs 10 years. BUT WAIT THERE"S MORE They also found the optimal algorithm for multiplying a nxn skew-symmetric matrix and a vector. Previous SOTA (2018), it was known to take at most n^2 multiplications, (see "Fast structured matrix computations: tensor rank and Cohn-Umans method"). They found a method taking only 0.5(n-1)(n+2) multiplications. This is obviously the asymptotically optimal number (since a nxn skew-symmetric matrix has ~ 0.5 n^2 different entries). BUT WAIT THERE'S MORE They ran the same algorithm but this time scoring its strategy based not on how many scalar multiplications it takes, but how much time it takes to run on a particular hardware (GPU or TPU). It found improvements as well, and these improvements are specific to GPU or TPU (using the optimized algorithm for GPU on TPU would yield a much lower improvement, and vice versa). > AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for.
@@CosmiaNebula but are the found algorithms proven afterwards? Because finding an algorithm is half the work. Proving it works, not only by some results but conceptually is the rest. I Don t understand how they can use discovered algorithms if they cannot proof it is true.
Actually when the fast matrix multiplication algorithm is applied recursively (to big matrix), not just the multiplication number reduces, so is the addition. This is because each sub-matrix multiplication requires super-linear O(m^(2+eps)) operations, while sub-matrix addition still takes O(m^2). So we are not trading multiplications for faster additions - both operations decrease.
Yes - this!! Please upvote the OP's comment! It should be emphasized that these improvements are not just constant factors - they improve the asymptotic complexity of matrix multiplication for certain matrix sizes in a practical way!
For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method. By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}). Consider what they achieved: It took humans 10 years to go from O(n^{2.8074}) in Strassen (1969) to O(n^{2.780}) in Bini, Capovani, Romani (1979). and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.
thank you, Yannic. I love these paper explaining videos. it is really hard for me to focus on reading papers. You explain the main points very clearly in a fraction of time it would take me to read it.
The reason that addition is "cheap" is not so much CPU's, these can do number multiplication in a single cycle, too. It's because the entries are usually matrices themselves (instead of numbers) and adding matrices is just adding their entries. Full story: Its when you use these results to multiply huge matrices. You do this using "block matrix multiplication". For instance, to multiply two 1000x1000 matrices, you can reinterprete these as 2x2 matrices, but each of the eight entries are 500x500 matrices ("blocks") themselves. Obviously, if you use a 2x2 multiplication formula, each multiplication in that formula actually corresponds to multiplying two 500x500 matrices (which you can do recursively applying your formula). The naive formula takes n×n×n multiplications to multiply two n×n matrices, but only n×n additions to add two such matrices. As you can see, this is a huge difference for large n. This difference stays huge even for "clever" versions of multiplication.
@@leeroyjenkins0 Ok, maybe, but I don't like emphasis put on that because it really isn't as relevant. The matrices that are actually relevant in state-of-the-art ANN have n in the thousands to tens of thousands. A direct Tensor decomposition would be in the order of millions to billions actions. So, the block multiplication case is the real relevant case. Of course, if you want to do actual graphics, 4x4 is your most common size and directly improving that can vastly improve the characteristics of your graphics chip or driver.
@@vladimirtchuiev2218 I think so too! However, my intuition also is, that the benefits of these high factors are not as great as adding the factors 2/-2.
As a new CS student this kind of content is very interesting, even tough I don't understand everything. It motivates me to keep learning. Thank you Yannic!
"A matrix is a square box of numbers" - AlphaTensor can also multiply rectangular matrices, and the paper gives an example. Of course 2x2 and maybe 4x4 matrices are of specific interest because they can be easily generalized and implemented in algorithms for large matrices.
I think it's also important besides the domain they applied it on (matrix multiplication algorithms), since it shows that reinforcement learning via deep learning is strong enough to explore extremely large discrete spaces to optimize. I can expect that in the future most NP-Hard problems we'll want to tackle will be attacked by these approaches rather than heuristics.
One thing I have wondered: we use multiplications because it's mathematically convenient. But what if we use some cheaper operations on the layer inputs, and "just learn" to parametrize them correctly. Who cares if it's exact multiplication - what matters is that the NN gets the job done. The learning could maybe work around the "incorrect but fast" multiplication
"Analogue computers" do that -- they exploit the non-linearity of silicon diode in the 0..1V region of conduction. As long as you have some nonlinear operation, you can have a neural network. Multiplication happens to be a convenient one for **digital** computers.
No, you have to go through nn to estimate it which would is composed a bunch of matrix multiplications, end up doing much more work than just doing the matrix multiplication straight.
Karatsuba's algorithm trades additions with multiplication with numbers back in 1963 and now DeepMind did with it matrices with AlphaTensors in 2022. Amazing
Hey thanks for explaining this in a way I can almost understand. English tip for you in appreciation: "long-winded" is not about winding it's about the wind blowing. Dunno why.
27:40 I think there's a mistake: the modification comes from the 5th column, not the 7th, because we're looking at the tensor where the top row of u is non-zero
This is a fantastic coincidence! I was trying to figure out similar generalized forms you mentioned in the beginning, but from the perspective of a computer architecture data dependency issue several months ago. I shelved the idea because after a few weeks I found it difficult to justify the time to spend on it. This solution might be a much better fit if it really works! I'm so excited to read this paper and start work towards implementing it.
That's a great result. I have been thinking of machine learning training other machine learning algorithms, but to improve other algorithms is also an interesting approach which now is shown to be practically possible.
I think, with the dawn of quantumcomputers, such problems will be blown away. Because matrix multiplication consists of many scalarproducts (or dot-products). row * column = With the help of superposition and quantumparallelism, it will deliver results almost instantly.
I would have liked to hear your comments on the structured matrix multiplication part of the paper. This is potentially the biggest speedup to linear algebra presented in the paper. Their skew symmetric matrix algorithm beats the state of the art by a factor of 2. Not 2 multiply operations, half as many as the previous best! Unfortunately the structured matrix section of the paper read a bit vague to me. Particularly the "extrapolation" part of the skew symmetric algorithm to larger matrices looked like it was done manually by a human. Would be really cool to see an ML algorithm be able to automate extrapolation of those algorithms. To me that would be the holy grail of AI optimized linear algebra.
I appreciate the content, so when I feel that I am not having time well spent, then watching and learning about AI on this channel is super useful and interesting.
I asked about this in 2 minutes parpers yesterday but no one answered me. Thnx u are making a video. Such a change on fundamentals should have a cascade impact in all fields of tech and science.
I tried to write a Python program to implement their algorithm for multiplying 4x4 matrices in 47 steps but I suspect they left out the subtractions in their algorithm. There were nothing but additions and multiplications which is unusual. The numbers I got were way too high. You can find two of the algorithms they found in the PDF accompanying the article.
The 47-step algorithm is for *modular arithmetic*. In this system all the numbers are "modulo p" for some number p, i.e. the numbers "wrap around". E.g. in modulo 5 arithmetic, 4+4 = 3 (it would be 8, but 8 mod 5 is 3). That's why subtractions aren't necessary - numbers will stay confined to the (0;p-1) range by definition. And that's why numbers blew up when you were doing normal arithmetic.
The relationship between matrix multiplication and the Tensor T can be written as C = AB = A:T:B. Here, we use double contractions because the "axis" of T corresponding to A (the one indexing its entries a1,a2,a3,a4) is actually two axis (a1=a11, a2=a12, a3=a21, a4=a22) indexing the rows and columns of A, respectively. Same for B. The result has two remaining axis, indexing the rows and columns of the product C. Decomposition means A:T:B = A:(u1 ⊗ v1 ⊗ w1 + u2 ⊗ v2 ⊗ w2 + ...):B = A:(u1 ⊗ v1 ⊗ w1):B + A:(u2 ⊗ v2 ⊗ w2):B + ... Here u,v,w are actually matrices the same size as A,B,C, respectively. Now A:(u ⊗ v ⊗ w):B can actually be computed using a single multiplication. First, compute A:u, which adds/subtracts entries of A (u only has entries 1,-1 and 0). Second, compute v:B, acting in the same way. Multiply the two numbers to obtain the number m = (A:u)(v:B) = A:(u ⊗ v):B. Now you only have to multiply m and w to get m ⊗ w = A:(u ⊗ v ⊗ w):B. Because w also only has entries +1,-1 and 0, m ⊗ w essentially means replacing the +1's in w by m and the -1's by -m.
Does this mean that the small speedups in matrix multiplications compound to huge speedups when running a Deep Network model with hundreds of matrix multiplications?
Er . . . you lost me partway through, but I think I got the gist of it. I'm getting too old for this. The implications are huge. I just can't wait to see the compiler implementations!
Strictly speaking we get closer but the whole idea of singularity is that every step gets easier on the path. Rather the real situation is that we have made 27 steps on a 100 step path and each step gets harder with costs ballooning exponentially in some way. In other words, there will be no singularity with evolution of neural nets as after 32 steps any further steps costs more than nation states can afford.
Maybe they should use the diffuse algorithm to get the tensor decompositions instead of finding it column by column; that gave a tremendous improvement for making drawings too.
I was wondering a lot why multiplication is actually so important as naturally in hardware multiplying numbers isn't that much worse than adding them (in comparison to for example division). The magic seems to be when we start thinking about big matrices as blocks of smaller matrices, like a 20x20 matrix can be though of as a 4x4 matrix where each entry is a 5x5 matrix. And you can then apply your matrix multiplication algorithms in the same way, but now every multiplication which was a scalar multiplication will become a MATRIX multiplication, whereas an addition will just be matrix addition. And as matrix addition is so much cheaper than multiplication, it is really important to have as few of them as possible. So these savings are relevant in the asymptotic case where we get to multiply two very large matrices and instead of using the n^3 algorithm we split the matrices smartly into block to apply the more optimized algorithms.
Could an even more efficient method be discovered by a next iteration of AlphaTensor? That would be truly mindblowing, but if not (meaning there is a hard limit on the most basic operation of machine learning), then there are IMO 2 takes from this achievement. 1- There definitely won't be a fast take off when AgI is implemented, and all the fear-mongering on AI being an eXisTEnTiAL thREat was just the overblown conclusion to a very defficient thought experiment. 2- Hardware engineers have a golden opportunity to design machines that implement this algorithm in the lowest levels.
We're using a 3-dimensional tensor here but it's actually 6 dimensions because each axis corresponds to an entry in one of the 3 matrices (factor A, factor B, product C), and an entry is identified by a row and a column.
Multiplication is usually not more expensive than addition in modern hardware. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, for example, the 2×2 matrix multiplication can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Also numerical stability can be a issue with these faster algorithm. So at the end of the day the standard O(N^3) will remain the way matrix multiplication is implementated, at least for a long time.
Really cool info, and I think the paper does a good job of addressing that reducing multiplication is only one possible training goal. Another possible training goal is matrix multiplication compute time, so that you can optimize for what the hardware can do fastest.
For large integers, I would think that multiplication would have to be slower? Like, adding two n digit numbers can be done in O(n) time, and, if you allow two different input tapes for the two inputs, or interleave the digits of the two numbers, can even be done by a finite state machine, but surely this can’t be done for multiplication, right? Maybe for the numbers which we practically will compute with on the hardware we have, they might take the same time? Would that be true even on eg GPU?
@@drdca8263 computers use a fixed number of bits to represent numbers because the ALUs are built to perform addition, multiplication etc in costant time, if they accept n bits to add and multiply, then yes, if the numbers increase, more times will be needed to perform these operations.
To be precise, it depends on the definition of "expensive". You can speedup a multiplication for a fixed format like IEEE754. Simply by unrolling the algorithm for each bit directly into a circuit. But the complexity of the circuit and therefore the energy consumption may be higher.
@@gaggix7095 doesn't multiplication still take more clock cycles than addition? And there are definitely more transistors... Would a custom silicon implementation of matrix multiplication (like you find in tensor chips) really not benefit from this research?
Instead of getting multiplication of two numbers by converting to binary digits and doing long addition operation , we can define a table for simple numbers multiplication from 1x1 to 9x9
The progress is really far smaller than the hype. Those in the theory community are not quite excited because the improvement happens not in real numbers, but in a field called F2, which only contains 0 and 1 and is not used by most people. Putting this restriction aside, the modest improvement is with respect to a result more than 50 years ago..... So why cares. From a practical perspective, the value is even smaller. If you really want to do matrix multiplication in real life, you have to consider numerical stability, memory allocation, and a bunch of other issues, and this is why Strassen algorithm is not used in the first place. So the experiment is really unconvincing to me and those who have experience in high-performance computing. It is definitely a cool result, but really not as big as most people think.
You wanted to explain a1*b1*c4 would be cancelled out, and started from a1*b4*c1, then later changed to a4*b4*c1... and it still worked out, good for you XD
I'm actually a little bit mad right now because I proposed doing this during high school and college and was told that's not the correct way to multiply matricies. Seeing an academic paper on it... I'm not sure how to feel atm.
This tricks are typically marginally affecting performance. There are other factors in data-flow that have a huge impact. The 8b multiplier is not the one that consumes energy or space and this techniques make the hw less uniform and kill utilization.
Interesting approach. I didn't read the article yet . So I making my point based on the face value of your video. Here we go: I have concern about the total number of operations (+,*) dealing with float values and the residual error produced
They did not use float values for the operations. They constrained the entries to "discrete set of coefficients F (for example, F = {−2, −1, 0, 1, 2})". See the DRL for algorithm discovery section on page 48.
I would have been a bit more interested in actual usecases later on in the real world. (or speculations about them as these may not have been established yet) Things that came to mind formost are prediction models. Flight path calculations to safe fuel including temperature and density of current fligth hight environment, maybe with weather forcast data. Then shape, weight and engines used within a plane/copter and many other factors which may play a role, but i would suppose could overall be improved by this approach of alpha tensors. Then there is space flight with orbital dynamics. Maybe tests of tensile strength objects with different materials or combination of materials? Pressure and dynamic tests within fluids. ... next to endless seem reallife applications as tensors are usable to adress so many things.
They did, see the table towards the end of the video, the smaller matrices were optimized by people decades ago. This improves larger ones, and by only a few steps. 52:10
@@AlexanderBukh they did but they didn't find the most optimal ones 😜 47 steps vs 48 is a huge difference when working with hundreds of thousands of parameters across many layers and repeating. I'm surprised they didn't use numpy itself to find more optimizations for numpy. There are only what, like 7 or 11 of them or something, right?
As far as i understand the resulting algorithm is a simple adition and multiplication of the elements in the matices and therefor there shouldn't be any stability issues.
@@TheSinTube There definitely can be. Strassen's two-step algorithm is less stable than the naive algorithm, which is why it is often not used in practice despite being asymptotically faster.
I guess this is more in the realm of computer science than just math (even though both are closely related). There are many tricks computer programs do to get the same mathematical results while being much more computationally efficient.
@@farrael004 yes but in math there are no way around things have to be mathematically right, so that's cool ai can do math things right. because in comparison, ai generated art is the opposite of anything mathematically right, the outputs are always wobbely and whafky, and we can't find mathematical truth in it
@@bloomp7999 Something can be mathematically right in many different ways. Some ways are just more efficient. One of the goals of computer science is finding these efficient ways to make calculations.
Can I just say that 3d image during the tensor decomposition bit would have been SO much less confusing if they made the individual slices the C values and used each slice to show the A, B pairs you need to multiply and add to get that C
try to multiple 2 numbers in binary system and add two numbers in binary system which one will take more operations to give desire result ? remember computers work in Binary system
There are likely jobs that get paid for this type of work: Speeding up 3D rendering with AI (e.g. DLSS) or speeding up compiler e.g. for Microsoft C# come to mind. Speeding up cloud servers at AWS. Very narrow applications but they exist.
The first solution for partial differential equations was a Matrix function that was developed by the founder of MathWorks, in fact the logo of math works is that very matrix 😮😮 better looking 👀 now but go back to the original logo.. and ya fun Snapple Fact 🎉 whatever
The action space is larger than for go? Ok, so if each number entry is in {-2,-1,0,1,2} and each vector has n^2 entries, ah, ok, I was initially thinking only n entries, so I was thinking 4 because I was thinking 4x4 matrices. But the vectors representing coefficients of each matrix entry would have n^2 entries. Ok. So, for 2x2 matrices, then there would be (5^(2^2))^3 = (5^4)^3 = 5^12 possible moves, which... is more than 2^24 , which, ok, yeah that’s many more possible moves than there are in Go.
It actually used to bother me when I first learned about a2 - b2 = (a+b)*(a-b). I had this idea of the weight of operations like multiplication is a heavier operation than addition. So a and b both being possibly unique terms I was amazed that this relationship exists.
a small formatting tip, a^2 -b^2 is an expression that i find more easily understood. not to say you're wrong or anything, but it took me a second to figure out what you meant
I am a bit confused whene you talk about rank 1 tensor, because I thought rank 1 tensors are just vectors. Time to brush up my linear algebra 101 30:27 The tensor shown on the left is not a 3D tensor, it is a 4D rank 3 tensor, as far I as understand it. imagine one row there is one vector, so a rank 1 tensor. If a vector has 2 numbers, it's a 2D vector, if it has 3 it's a 3D vector, so a 3D rank 1 tensor. Just because visually it is represented as a 3D object doesn't make it mathematical 3D I think.
There is an n^2*log(n) lower bound for n. This assumes you use an arithmetic circuit... OFC, which only supports multiplication and addition. This still couldn't rule out weird stuffs like logarithms and so on.
I guess it can be O(1) if you pre-calculate results of multiplications of all possible matrices and store them somewhere. Concatenation of 2 matrices will be an index in this huge array.
@@Dron008 what you just described is still at least O(N^2) in the best case. You still have to read both N^2 sized matrices to find your precomputed result.
OUTLINE:
0:00 - Intro
1:50 - Sponsor: Assembly AI (link in description)
3:25 - What even is Matrix Multiplication?
6:10 - A very astounding fact
8:45 - Trading multiplications for additions
12:35 - Matrix Multiplication as a Tensor
17:30 - Tensor Decompositions
20:30 - A formal way of finding multiplication algorithms
31:00 - How to formulate this as a game?
39:30 - A brief primer on AlphaZero / MCTS
45:40 - The Results
48:15 - Optimizing for different hardware
52:40 - Expanding fundamental math
53:45 - Summary & Final Comments
TLDR:
## The problem
Given two matrices A, B of size mxn, nxl, find a way to compute AB, such that we do as few scalar multiplications as possible (add and subtract are free).
Reformulate the problem to make it amenable to AI: given the 3D "matrix multiplication tensor" T of size (mn)x(nl)x(ml), find a decomposition into T = ∑i ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}.
If this decomposition has r terms, then we have a corresponding way to compute AB using only r scalar multiplications.
### TensorGame
Now create the 1-player TensorGame: given a 3D tensor T, ask it to make moves of the form ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. The state of the game after the move is (T - ui⊗vi⊗wi). The game ends either when the game state is 0, or if the player has played K rounds and thus "ran out of time" (where K is an upper bound to the rank of T). The reward at each step is (-1), to encourage the agent to play as fast as possible. If the agent runs out of time, then the reward for the last step is (-r), where r is an upper bound to the game state in the end.
### Agent
The agent is a RL agent "AlphaTensor". It, like AlphaZero, uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure. It is trained on randomly generated 3D tensors constructed by adding randomly generated ∑i ui⊗vi⊗wi. TensorGames are easy to make like that.
## Results
After some training, the resulting agent is able to play the TensorGame very well. When applied to matrix multiplication tensors, it found some ways to do better than previous human-found techniques.
For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method.
By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}).
Consider what they achieved: It took humans 10 years to go
from O(n^{2.8074}) in Strassen (1969)
to O(n^{2.780}) in Bini, Capovani, Romani (1979).
and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.
I scoured their papers and couldn't find out how much time it took to train the agent, other than somewhere around 1 million steps?. By Google standards, I guess that means it took them a month. So, 1 month vs 10 years.
BUT WAIT THERE"S MORE
They also found the optimal algorithm for multiplying a nxn skew-symmetric matrix and a vector. Previous SOTA (2018), it was known to take at most n^2 multiplications, (see "Fast structured matrix computations: tensor rank and Cohn-Umans method"). They found a method taking only 0.5(n-1)(n+2) multiplications. This is obviously the asymptotically optimal number (since a nxn skew-symmetric matrix has ~ 0.5 n^2 different entries).
BUT WAIT THERE'S MORE
They ran the same algorithm but this time scoring its strategy based not on how many scalar multiplications it takes, but how much time it takes to run on a particular hardware (GPU or TPU). It found improvements as well, and these improvements are specific to GPU or TPU (using the optimized algorithm for GPU on TPU would yield a much lower improvement, and vice versa).
> AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for.
@@CosmiaNebula Thanks for this TL;DR~
@@CosmiaNebula TL;DR
🖥 🧠 🔢 ⏫ 😱
@@CosmiaNebula but are the found algorithms proven afterwards? Because finding an algorithm is half the work. Proving it works, not only by some results but conceptually is the rest. I Don t understand how they can use discovered algorithms if they cannot proof it is true.
@@pouet4608 verifying the algorithms is trivial algebra
Actually when the fast matrix multiplication algorithm is applied recursively (to big matrix), not just the multiplication number reduces, so is the addition. This is because each sub-matrix multiplication requires super-linear O(m^(2+eps)) operations, while sub-matrix addition still takes O(m^2). So we are not trading multiplications for faster additions - both operations decrease.
Yes - this!! Please upvote the OP's comment! It should be emphasized that these improvements are not just constant factors - they improve the asymptotic complexity of matrix multiplication for certain matrix sizes in a practical way!
+1, Yannic should've at least mentioned Strassen's recursive matrix multiplications algorithm.
@@nonamehere9658 Yannick shows it in the paper.
For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method.
By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}).
Consider what they achieved: It took humans 10 years to go
from O(n^{2.8074}) in Strassen (1969)
to O(n^{2.780}) in Bini, Capovani, Romani (1979).
and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.
no
thank you, Yannic. I love these paper explaining videos. it is really hard for me to focus on reading papers. You explain the main points very clearly in a fraction of time it would take me to read it.
We literally don’t have time to read every great paper these days. Summaries like these do save the day.
Production quality on this video was fantastic. For instance, the simple things like the blue background and circle are nice touches.
The reason that addition is "cheap" is not so much CPU's, these can do number multiplication in a single cycle, too. It's because the entries are usually matrices themselves (instead of numbers) and adding matrices is just adding their entries.
Full story: Its when you use these results to multiply huge matrices. You do this using "block matrix multiplication". For instance, to multiply two 1000x1000 matrices, you can reinterprete these as 2x2 matrices, but each of the eight entries are 500x500 matrices ("blocks") themselves. Obviously, if you use a 2x2 multiplication formula, each multiplication in that formula actually corresponds to multiplying two 500x500 matrices (which you can do recursively applying your formula). The naive formula takes n×n×n multiplications to multiply two n×n matrices, but only n×n additions to add two such matrices. As you can see, this is a huge difference for large n. This difference stays huge even for "clever" versions of multiplication.
@@leeroyjenkins0 Ok, maybe, but I don't like emphasis put on that because it really isn't as relevant. The matrices that are actually relevant in state-of-the-art ANN have n in the thousands to tens of thousands. A direct Tensor decomposition would be in the order of millions to billions actions. So, the block multiplication case is the real relevant case.
Of course, if you want to do actual graphics, 4x4 is your most common size and directly improving that can vastly improve the characteristics of your graphics chip or driver.
-2 and 2 make sense as you can use a bitshift for the multiplication, so it is basically free and gives a bigger space of possible solutions.
@@h..h can you reformulate the question?
@@h..h Why not? Just add 1 to the exponent, it's still much faster than general floating point multiplication.
I guess so is -4, 4, -8, 8 or any n with 2^n and -(2^n). But I guess the capacity to search in this space is more limited.
@@vladimirtchuiev2218 I think so too! However, my intuition also is, that the benefits of these high factors are not as great as adding the factors 2/-2.
As a new CS student this kind of content is very interesting, even tough I don't understand everything. It motivates me to keep learning.
Thank you Yannic!
@Greg LeJacques w-what..?
Thank you Yannic, I missed these older style paper explainers
"A matrix is a square box of numbers" - AlphaTensor can also multiply rectangular matrices, and the paper gives an example. Of course 2x2 and maybe 4x4 matrices are of specific interest because they can be easily generalized and implemented in algorithms for large matrices.
I think it's also important besides the domain they applied it on (matrix multiplication algorithms), since it shows that reinforcement learning via deep learning is strong enough to explore extremely large discrete spaces to optimize. I can expect that in the future most NP-Hard problems we'll want to tackle will be attacked by these approaches rather than heuristics.
real
nphard has incomplete m-sets, but solutions are provable in post
One thing I have wondered: we use multiplications because it's mathematically convenient. But what if we use some cheaper operations on the layer inputs, and "just learn" to parametrize them correctly. Who cares if it's exact multiplication - what matters is that the NN gets the job done. The learning could maybe work around the "incorrect but fast" multiplication
"Analogue computers" do that -- they exploit the non-linearity of silicon diode in the 0..1V region of conduction.
As long as you have some nonlinear operation, you can have a neural network. Multiplication happens to be a convenient one for **digital** computers.
No, you have to go through nn to estimate it which would is composed a bunch of matrix multiplications, end up doing much more work than just doing the matrix multiplication straight.
You would need a “incorrect but fast” backprop as well, it’s not clear to me that backprop still works if you break multiplication
Karatsuba's algorithm trades additions with multiplication with numbers back in 1963
and now DeepMind did with it matrices with AlphaTensors in 2022.
Amazing
analog of karatsuba's algo for matrixes existed for decades already
deep mind found even better one
Hey thanks for explaining this in a way I can almost understand. English tip for you in appreciation: "long-winded" is not about winding it's about the wind blowing. Dunno why.
27:40 I think there's a mistake: the modification comes from the 5th column, not the 7th, because we're looking at the tensor where the top row of u is non-zero
Thank you for your videos. You're the Agadmator of ML/AI.
This is a fantastic coincidence! I was trying to figure out similar generalized forms you mentioned in the beginning, but from the perspective of a computer architecture data dependency issue several months ago. I shelved the idea because after a few weeks I found it difficult to justify the time to spend on it. This solution might be a much better fit if it really works! I'm so excited to read this paper and start work towards implementing it.
Are you a descendant of the great Rolf Landauer ?!
@@maloxi1472 I sometimes fantasize that I have some connection to the Landauer limit. Sadly, I'm unaware of any familial connection. 😅
This kind of fine-grained explanation is such a great service - thank you!!
That's a great result. I have been thinking of machine learning training other machine learning algorithms, but to improve other algorithms is also an interesting approach which now is shown to be practically possible.
I think, with the dawn of quantumcomputers, such problems will be blown away.
Because matrix multiplication consists of many scalarproducts (or dot-products).
row * column =
With the help of superposition and quantumparallelism, it will deliver results almost instantly.
I would have liked to hear your comments on the structured matrix multiplication part of the paper. This is potentially the biggest speedup to linear algebra presented in the paper. Their skew symmetric matrix algorithm beats the state of the art by a factor of 2. Not 2 multiply operations, half as many as the previous best! Unfortunately the structured matrix section of the paper read a bit vague to me. Particularly the "extrapolation" part of the skew symmetric algorithm to larger matrices looked like it was done manually by a human. Would be really cool to see an ML algorithm be able to automate extrapolation of those algorithms. To me that would be the holy grail of AI optimized linear algebra.
I appreciate the content, so when I feel that I am not having time well spent, then watching and learning about AI on this channel is super useful and interesting.
Really great video, thanks! Just a small mistake at 27:48. You look at the wrong entry in U (but I see how that easily happens of course).
This almost sounds too good to be true. Improving the speed of matrix multiplication... that basically speeds up EVERYTHING nowadays.
Thanks for this video! Had just heard of this development but didn't understand it, this explains it well.
I asked about this in 2 minutes parpers yesterday but no one answered me. Thnx u are making a video. Such a change on fundamentals should have a cascade impact in all fields of tech and science.
I tried to write a Python program to implement their algorithm for multiplying 4x4 matrices in 47 steps but I suspect they left out the subtractions in their algorithm. There were nothing but additions and multiplications which is unusual. The numbers I got were way too high. You can find two of the algorithms they found in the PDF accompanying the article.
The 47-step algorithm is for *modular arithmetic*. In this system all the numbers are "modulo p" for some number p, i.e. the numbers "wrap around". E.g. in modulo 5 arithmetic, 4+4 = 3 (it would be 8, but 8 mod 5 is 3). That's why subtractions aren't necessary - numbers will stay confined to the (0;p-1) range by definition. And that's why numbers blew up when you were doing normal arithmetic.
@@PrzemyslawDolata can't one just take really big p?
@@NoNameAtAll2 I take a really big p sometimes if I drink too much water
Awesome, thanks for breaking that down for us! 😀
That's great, but I would still bet that clever product quantization algorithms for approximated matmul are the way to go for faster deep learning.
The relationship between matrix multiplication and the Tensor T can be written as C = AB = A:T:B. Here, we use double contractions because the "axis" of T corresponding to A (the one indexing its entries a1,a2,a3,a4) is actually two axis (a1=a11, a2=a12, a3=a21, a4=a22) indexing the rows and columns of A, respectively. Same for B. The result has two remaining axis, indexing the rows and columns of the product C.
Decomposition means A:T:B = A:(u1 ⊗ v1 ⊗ w1 + u2 ⊗ v2 ⊗ w2 + ...):B = A:(u1 ⊗ v1 ⊗ w1):B + A:(u2 ⊗ v2 ⊗ w2):B + ... Here u,v,w are actually matrices the same size as A,B,C, respectively.
Now A:(u ⊗ v ⊗ w):B can actually be computed using a single multiplication. First, compute A:u, which adds/subtracts entries of A (u only has entries 1,-1 and 0). Second, compute v:B, acting in the same way. Multiply the two numbers to obtain the number m = (A:u)(v:B) = A:(u ⊗ v):B. Now you only have to multiply m and w to get m ⊗ w = A:(u ⊗ v ⊗ w):B. Because w also only has entries +1,-1 and 0, m ⊗ w essentially means replacing the +1's in w by m and the -1's by -m.
Thank you for making this easy to understand
I love the home shopping network tunes in the sponsor part ❤️
Does this mean that the small speedups in matrix multiplications compound to huge speedups when running a Deep Network model with hundreds of matrix multiplications?
Oh shiiii…
Nice to see another excellent paper explanation video. Thank you! 😊
Er . . . you lost me partway through, but I think I got the gist of it. I'm getting too old for this. The implications are huge. I just can't wait to see the compiler implementations!
Was waiting for this!
Very good and simple demonstration of what is happening. so algebra is just a game...
What a time to be alive!
Hold on to your GPUs!
@@AlexanderBukh Funny you say that, yesterday I sold my 3080 to get a 4090
AI accelerating AI. Every day we are closer to the singularity.
Strictly speaking we get closer but the whole idea of singularity is that every step gets easier on the path. Rather the real situation is that we have made 27 steps on a 100 step path and each step gets harder with costs ballooning exponentially in some way. In other words, there will be no singularity with evolution of neural nets as after 32 steps any further steps costs more than nation states can afford.
@@michaelnurse9089 We're so close. The rate of progress is rapidly increasing.
16:00 Glad that you caught your misstake.. I was thinking you are in the second "slice" so it should be a2
amazing how they find out about this now, 1000 trillion matrix multiplications and thousands of math lib implementations later :D
This adds up 😉
Maybe they should use the diffuse algorithm to get the tensor decompositions instead of finding it column by column; that gave a tremendous improvement for making drawings too.
Received October 2021
Published October 2022
A year and 3 days.
I can already see articles titled "DeepMind made an ai that can make itself faster!".
I was wondering a lot why multiplication is actually so important as naturally in hardware multiplying numbers isn't that much worse than adding them (in comparison to for example division). The magic seems to be when we start thinking about big matrices as blocks of smaller matrices, like a 20x20 matrix can be though of as a 4x4 matrix where each entry is a 5x5 matrix. And you can then apply your matrix multiplication algorithms in the same way, but now every multiplication which was a scalar multiplication will become a MATRIX multiplication, whereas an addition will just be matrix addition. And as matrix addition is so much cheaper than multiplication, it is really important to have as few of them as possible. So these savings are relevant in the asymptotic case where we get to multiply two very large matrices and instead of using the n^3 algorithm we split the matrices smartly into block to apply the more optimized algorithms.
Could an even more efficient method be discovered by a next iteration of AlphaTensor? That would be truly mindblowing, but if not (meaning there is a hard limit on the most basic operation of machine learning), then there are IMO 2 takes from this achievement.
1- There definitely won't be a fast take off when AgI is implemented, and all the fear-mongering on AI being an eXisTEnTiAL thREat was just the overblown conclusion to a very defficient thought experiment.
2- Hardware engineers have a golden opportunity to design machines that implement this algorithm in the lowest levels.
"These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware" as per deepmind blog.
We're using a 3-dimensional tensor here but it's actually 6 dimensions because each axis corresponds to an entry in one of the 3 matrices (factor A, factor B, product C), and an entry is identified by a row and a column.
Error at 27:45 ... its the top row of U, not the bottom row, so column 5 causes the cancel, not column 7.
Multiplication is usually not more expensive than addition in modern hardware. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, for example, the 2×2 matrix multiplication can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Also numerical stability can be a issue with these faster algorithm. So at the end of the day the standard O(N^3) will remain the way matrix multiplication is implementated, at least for a long time.
Really cool info, and I think the paper does a good job of addressing that reducing multiplication is only one possible training goal. Another possible training goal is matrix multiplication compute time, so that you can optimize for what the hardware can do fastest.
For large integers, I would think that multiplication would have to be slower?
Like, adding two n digit numbers can be done in O(n) time, and, if you allow two different input tapes for the two inputs, or interleave the digits of the two numbers, can even be done by a finite state machine, but surely this can’t be done for multiplication, right?
Maybe for the numbers which we practically will compute with on the hardware we have, they might take the same time?
Would that be true even on eg GPU?
@@drdca8263 computers use a fixed number of bits to represent numbers because the ALUs are built to perform addition, multiplication etc in costant time, if they accept n bits to add and multiply, then yes, if the numbers increase, more times will be needed to perform these operations.
To be precise, it depends on the definition of "expensive".
You can speedup a multiplication for a fixed format like IEEE754. Simply by unrolling the algorithm for each bit directly into a circuit.
But the complexity of the circuit and therefore the energy consumption may be higher.
@@gaggix7095 doesn't multiplication still take more clock cycles than addition? And there are definitely more transistors... Would a custom silicon implementation of matrix multiplication (like you find in tensor chips) really not benefit from this research?
Let's train on 3SAT just in case P=NP (although, if it does, seems like we'd have to train a super long time)
Will new factorizations found in this paper actually make their own ways into existing optimized BLAS (Basic Linear Algebra Subprograms) libraries?
I don't think so. Matrix multiplication is mainly limited due to data loading as far as I'm aware, but please correct me if I'm wrong
I don't think so since BLAS libraries don't use strassen algorithm for Matrix multiplication
Love these videos, thank you so much.
Instead of getting multiplication of two numbers by converting to binary digits and doing long addition operation , we can define a table for simple numbers multiplication from 1x1 to 9x9
"That was long winded" about explained how well I understand it now lol
The progress is really far smaller than the hype. Those in the theory community are not quite excited because the improvement happens not in real numbers, but in a field called F2, which only contains 0 and 1 and is not used by most people. Putting this restriction aside, the modest improvement is with respect to a result more than 50 years ago..... So why cares.
From a practical perspective, the value is even smaller. If you really want to do matrix multiplication in real life, you have to consider numerical stability, memory allocation, and a bunch of other issues, and this is why Strassen algorithm is not used in the first place. So the experiment is really unconvincing to me and those who have experience in high-performance computing.
It is definitely a cool result, but really not as big as most people think.
You wanted to explain a1*b1*c4 would be cancelled out, and started from a1*b4*c1, then later changed to a4*b4*c1... and it still worked out, good for you XD
Sometimes, what we need is to just get back to the basic and rediscover.
I'm actually a little bit mad right now because I proposed doing this during high school and college and was told that's not the correct way to multiply matricies. Seeing an academic paper on it... I'm not sure how to feel atm.
You had low IQ instructors
Some faster ways were found decades back, now they only automated finding new ones.
Also, isn't it always "the wrong way to do it" until someone comes forward and _proves_ them(sayers) wrong?
If I were you I would show this to every human that ever wronged you and then kill them brutally for costing your future
Next time you have a great idea, just implement it. People love to see sth working.
This tricks are typically marginally affecting performance. There are other factors in data-flow that have a huge impact. The 8b multiplier is not the one that consumes energy or space and this techniques make the hw less uniform and kill utilization.
Interesting approach. I didn't read the article yet . So I making my point based on the face value of your video.
Here we go: I have concern about the total number of operations (+,*) dealing with float values and the residual error produced
They did not use float values for the operations. They constrained the entries to "discrete set of coefficients F (for example, F = {−2, −1, 0, 1, 2})". See the DRL for algorithm discovery section on page 48.
so you got sponsored by voice-to-text company...
and your video STILL has no subtitles???
How much energy would have been saved if these multiplication algorithms had been discovered and universally implemented 3 years ago?
Great explanation. Thank you.
I would have been a bit more interested in actual usecases later on in the real world.
(or speculations about them as these may not have been established yet)
Things that came to mind formost are prediction models.
Flight path calculations to safe fuel including temperature and density of current fligth hight environment, maybe with weather forcast data. Then shape, weight and engines used within a plane/copter and many other factors which may play a role, but i would suppose could overall be improved by this approach of alpha tensors.
Then there is space flight with orbital dynamics.
Maybe tests of tensile strength objects with different materials or combination of materials?
Pressure and dynamic tests within fluids.
... next to endless seem reallife applications as tensors are usable to adress so many things.
Received October 2, 2021?! Published a year later?
And now alpha tensor can run faster to find more improvements. That part is crazy
It's been so many decades that I don't even remember what a matrix is. Is it that movie with John Wick?
Getting teached from Niko Bellic about Alpha Tensor is quite exciting.
Looking forward to compilers development in 2023
Amazing thanks for the explanation.
Why the hell didn't those geniuses at numpy figure this out years ago?! 😭
They did, see the table towards the end of the video, the smaller matrices were optimized by people decades ago. This improves larger ones, and by only a few steps.
52:10
@@AlexanderBukh they did but they didn't find the most optimal ones 😜 47 steps vs 48 is a huge difference when working with hundreds of thousands of parameters across many layers and repeating. I'm surprised they didn't use numpy itself to find more optimizations for numpy. There are only what, like 7 or 11 of them or something, right?
Click bait title... yeah, why not :)
Thanks for the video!
Is there any mention of numerical stability of the resulting algorithms in the paper? I've only skimmed it, but I couldn't find anything.
As far as i understand the resulting algorithm is a simple adition and multiplication of the elements in the matices and therefor there shouldn't be any stability issues.
@@TheSinTube There definitely can be. Strassen's two-step algorithm is less stable than the naive algorithm, which is why it is often not used in practice despite being asymptotically faster.
he mentioned only using coefficients in the set {-2,-1,0,1,2} for numerical stability
@@MagicGonads Strassen's algorithm in fact only uses {-1, 0 ,1}.
I didn't get all of the math but I LOVE the drawing
Thank you for doing that for us
Bro had so much fun multiplying matrices, he almost forgot to talk about the research.
It's great, i thought mathematicians had already figured general rules of the limit of matrices, but it seems they dont
I guess this is more in the realm of computer science than just math (even though both are closely related). There are many tricks computer programs do to get the same mathematical results while being much more computationally efficient.
@@farrael004 yes but in math there are no way around things have to be mathematically right, so that's cool ai can do math things right. because in comparison, ai generated art is the opposite of anything mathematically right, the outputs are always wobbely and whafky, and we can't find mathematical truth in it
@@bloomp7999 Something can be mathematically right in many different ways. Some ways are just more efficient. One of the goals of computer science is finding these efficient ways to make calculations.
Nice mohawk I'm rocking one too bro turok for life🧬
Very cool indeed - cheers!
Can I just say that 3d image during the tensor decomposition bit would have been SO much less confusing if they made the individual slices the C values and used each slice to show the A, B pairs you need to multiply and add to get that C
So we trade multiplications with additions? But why are additions faster than multiplications? I think on modern CPUs this is no longer true.
try to multiple 2 numbers in binary system and add two numbers in binary system which one will take more operations to give desire result ? remember computers work in Binary system
Wish I could spend time just "thinking" on stuff like this at my job... and get paid for it :)
There are likely jobs that get paid for this type of work: Speeding up 3D rendering with AI (e.g. DLSS) or speeding up compiler e.g. for Microsoft C# come to mind. Speeding up cloud servers at AWS. Very narrow applications but they exist.
In 28:00 isn't it supposed to be the other collum because you check the first row in U?
I thought so too, but by pure luck it worked out.
The first solution for partial differential equations was a Matrix function that was developed by the founder of MathWorks, in fact the logo of math works is that very matrix 😮😮 better looking 👀 now but go back to the original logo.. and ya fun Snapple Fact 🎉 whatever
thanks yannic
The action space is larger than for go?
Ok, so if each number entry is in {-2,-1,0,1,2} and each vector has n^2 entries,
ah, ok, I was initially thinking only n entries, so I was thinking 4 because I was thinking 4x4 matrices.
But the vectors representing coefficients of each matrix entry would have n^2 entries.
Ok.
So, for 2x2 matrices, then there would be (5^(2^2))^3 = (5^4)^3 = 5^12 possible moves, which... is more than 2^24 , which, ok, yeah that’s many more possible moves than there are in Go.
the general action space is of size 5^(3((nmp)^2)) where we are taking an n x m matrix by an m x p matrix
It actually used to bother me when I first learned about a2 - b2 = (a+b)*(a-b). I had this idea of the weight of operations like multiplication is a heavier operation than addition. So a and b both being possibly unique terms I was amazed that this relationship exists.
a small formatting tip, a^2 -b^2 is an expression that i find more easily understood. not to say you're wrong or anything, but it took me a second to figure out what you meant
16:00 dude, it's c1 = a1b1 + a2b3, it's written literally under the picture...
I am a bit confused whene you talk about rank 1 tensor, because I thought rank 1 tensors are just vectors. Time to brush up my linear algebra 101
30:27 The tensor shown on the left is not a 3D tensor, it is a 4D rank 3 tensor, as far I as understand it. imagine one row there is one vector, so a rank 1 tensor. If a vector has 2 numbers, it's a 2D vector, if it has 3 it's a 3D vector, so a 3D rank 1 tensor. Just because visually it is represented as a 3D object doesn't make it mathematical 3D I think.
Thank you.
Cool with a matrix cube, when you see 2D matrix, you think what about a 3D matrix.
can it be possible an example calculating manually this method? this is a gamechanger also for students in linear algebra, +is the code published?
"Neural network architecture that they analyze things with here. It is TRANSFORMER based... who would have thought that?"😅
This feels like it should be hardware-based.
Man Google really loves tensors.
Do you think we can use this to discover O(N^2) algos for matrix multiplies? I sure hope so.
I suspect the minimal possible exponent there is greater than 2.
There is an n^2*log(n) lower bound for n.
This assumes you use an arithmetic circuit... OFC, which only supports multiplication and addition. This still couldn't rule out weird stuffs like logarithms and so on.
I guess it can be O(1) if you pre-calculate results of multiplications of all possible matrices and store them somewhere. Concatenation of 2 matrices will be an index in this huge array.
@@Dron008 well erm I guess. But what's your space complexity
@@Dron008 what you just described is still at least O(N^2) in the best case. You still have to read both N^2 sized matrices to find your precomputed result.
Well great! Now the rabbits will be multiplying even faster.
Did you ever explain what you use to make your videos?
it actually seems like magic
great feat