To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/PurpleMind/ . You’ll also get 20% off an annual premium subscription.
Cunningham's Law: "The best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer." Kolmogorov demonstrated Cunningham's Law before the internet existed lol.
On the graph, you can see that the bend for the python code curve starts at 2^6, which is incidentally the same as 64. This is because on your computer - assuming you use a 64 bit computer - there is a chunk of silicon that has the multiplication algorithm burnt into it. Burning it in the silicon has the advantage that it can do all the additions in parallel, so the time is no longer dependent on the number of simple operations. When python has higher numbers and needs to use Karatsuba algorithm, it only breaks down numbers until they are 64 bit, and then sends that to the cpu, as the cpu always does its multiplication at the same speed, no matter how many digits you need.
I'm dumb af, I'm not yet ready to instinctively assume that 64 "digits" can be binary digits lmao, good catch! But then he's wrong when he says it's just the basic algorithm being applicated here, it just uses an hardware instruction, and any "big number" uses karatsuba's algorithm anyway
so after thinking more about it, there's no way python uses a simple mul instruction because that wouldn't handle overflow, but maybe it makes multiple 32 bits multiplications in a 64 bits register so that it can't overflow?
@thomquiri9860It probably depends on the architecture, but x86_64 does have a 64-bit multiplication instruction. It stores the full result in 2 registers.
I'm surprised he didn't mention it since it corresponds to the Word model and "doing some basic arithmetic on single digit numbers" that he mentioned in the introduction at 1:30. As you described it, the 64-bit CPU architecture has physical components that ensure constant time operation on 64-bit numbers. So in this architecture, 64-bit numbers (which we call a word) are the actual "single digit" numbers that cost exactly one operation. And this graph is a direct confirmation that the 64-bit version of Python performs as it should by delegating all 64-bit operations to the CPU and cutting bigger numbers into 64-bit blocks.
8:22 I didn't need to pause the video. a*c and b*d can be calculated first, then stored in memory and used for the "middle step", saving computation time. The proof of turning 100(a*c + b*d) into 100[ (a+b)*(c+d) - a*c - b*d] isn't nearly as obvious though, unless you know the mathematical trick for it I think 🤔. I haven't tried to do it yet 🤷♂️.
@@louisrobitaille5810 yeah, It felt counter intuitive to add b and c to the terms, and then subtract the products of the addition again. That to me is the genius, it was a flashback to wonderfully simple math learned 30 years ago ;)
However this .bit_length() method is really good but it gives you 1+floor(log(n)) and couldn’t give you rational number It will be good for discrete cases
@@mohamedahmed-rk5ez Why? There is an exact number of bits. Even if it rounded up to the next whole byte, it would still be an integer, ie. rational. Seems like python would be faster just to count the number of bytes or whatever int-class is used. Right off, I could just take the max_index of the array that holds the number, multiply that by the int_size, then subtract off each zero bit at the top. Always an integer answer, no slow FP functions needed. Am I smarter than the python programmers? Probably not.
btw, the schonhage-streissen algorithm uses FFT / NTT which is another genius very practical O(n log n) algorithm for multiplying polynomials, but when applying it to number multiplication, we run into precision issues or the NTT modulo becomes too small beyond some point, thats why there is an extra log log n factor.
@@anon_y_mousse being funny is an extrinsic property (a relationship between an audience and the joke), not an intrinsic one. So, you can't define what is a joke or not based on how funny it is to you.
@@anon_y_mousse Jokes can simply be unfunny to a demographic. The universality claim can also be easily disproven. For any arbitrary joke, there's at least one person who thinks it is funny- the author if it.
I wish you would have mentioned that a+b and c+d have one more digit (or bit) than each of the halves, and how that affects (or actually doesn't affect) the scaling. And surprisingly it doesn't even mess up the "powers of two" scheme that hardware is typically aimed at. Your most elementary unit that you eventually fall back to just has to be able to multiply numbers that are 1 longer than you'd think from dividing your number's length by 2 to the power of the number of "Karatsuba divisions" that you do.
That is correct! I did actually mention this in a red blurb on the screen at that time but for the sake of concision I decided to leave it as a footnote of sorts.
I mean as much as this detail is interesting, about anyone can figure it out by themselve while watching the video. When there's 'one more digit' due to addition, this digit is always a 1 no matter the base, and this is especially easy to take into account in base 2.
This video is so well made! The flow of information in this is perfect, when you were talking about time complexity I was in fact recalling about galactic algorithms and how the constant multiple could sometimes grow so large seemingly faster functions could be outperformed by "slower" ones for data used in practice, and then you mentioned about galactic algorithms at the end and I was utterly delighted. The flow of info was exceedingly good going from introducing about algorithms and mathematics and then going into a story form talking about history of its discovery for the breakthrough idea, deriving everything from scratch and explaining the reason behind everything, showing the practical implementation in Python and testing it, and even after that not stopping there and explaining the scenario of development and improvement on algorithms beyond that and showing how useful research is. Amazing video, I'm impressed!
@20:04 - "efficient multiplications with thousands of digits are now essential to public Key Cryptographic systems like RSA that keep your information secure." This is not accurate, because RSA is NOT used to encrypt user data. RSA is only used to encrypt the user's session key for a symmetric algorithm such as AES, Triple DES, DES, Blowfish, etc. The *average* session key used on the Internet is 200 bits, which is about 25 decimal digits. That's not a big number at all, by this video's definition of "big." Having to multiply a 25 digit number by another 25 digit number using any algorithm just once for a single SSL session (eg, one web page) is not going to make any difference to the user's experience. That's my opinion.
@@seheyt That's true, but in order to get a 4096-bit RSA key you have to multiply two 1024-bit prime numbers, p and q, then append the product of two other 1024-bit numbers, p-1 and q-1. And my point was that you have to do this process only once per session, not once per block of user text. Moreover, in this video, he said that these faster algorithms are better only when you have thousands of decimal digits, and 1024 bits is only 128 digits. So there's no performance benefit to the newest algorithms for something that's done only once every web page.
@@TheNameOfJesus Fair analysis. It was rather irrelevant to point out the session key though :) I wonder what the role of FFT/convolution based algorithms is in this picture. I assume they may not be suited for exact integers (rather for numerical computation)
I figured out the two digit at a time multiplication and a few things not mentioned in the video when I was 14 y/o in 1982. I was rotating points of a tie fighter on an Apple II+ 6502 CPU. I think I got it down to 20 clock cycles to do a 16 bit multiplication on an 8 bit CPU that took 1 to 5 clocks to do a single operation on the CPU. Reached 50,000 mults per second. Then rotating 3D points required sine and cosine so I had to optimize that as well plus square roots. Lots of work and experimenting. I could rotate 35 or so 3D points on Darth’s Tie Fighter at 3 frames a second including time to clear the screen and draw the lines one pixel at a time in assembly code. This was 1 frame per second faster than Woz who built a 3D rotation utility program. I didn’t realize the importance of what I was doing, was just trying to rotate a tie fighter to show my geeky friends.
How funny! At 14, I figured out how to manually calculate square roots. My aha moment came when i did 9/3=3. That can be seen as an equilibrium. Divide by something smaller than the square root and you get something larger. So 9/2=4,5. Just calculate the average and repeat (2+4.5)/2=3,25 which is better. Similar to the babylonian method but more intuitive. Best of all is that it can be extended to do cube roots. Or fourth roots. I did nothing, not even telling my lousy math teacher who I thought would say that someone had told me this.
In university we had a method that involved: reducing the inputs modulo a series of primes, where a large power of 2 divided p-1. Then do a finite field Fourier transform, add the results, and reconstruct using Chinese remainder theorem. I can't remember the cost, but I think the reason had to do with the maximizing the computational benefit of the word size on the computer.
Saying "it's impossible to improve on this" is the cheat code to get people to find a better solution, just so they can prove you wrong. Reminds me of posting a question and then giving yourself a wrong answer on purpose, because people would rather prove someone wrong than help someone else 😆
13:12 If you want a concrete interpretation of why O(n^log2(3)) shows up, look into the Master Theorem. Basically, for any recursive divide-and-conquer algorithm that's leaf-heavy, all you need to find the complexity class is to know the number of problems it splits into at each level and the new problem size. So every recursion of Karatsuba calls Karatsuba 3 times, each with a problem size of n/2. That's where the 2 and 3 come from, and that's also why naive multiplication which calls itself 4 times becomes O(n^log2(4))=O(n^2).
A simple divide and conquer algorithm has a log(n) tree with n leafs. Karatsuba has the lo-hi vs hi-lo mul making it not a simple case. The Master Theorem formally proves it but it was already obvious
1:44 allocating is not a neutral example of "accessing memory", and is usually several orders of magnitude more costly than the other examples of prototypical "operations"
GPUs use the naive O(n^3) for matmul, because that one can be done in parallel. Recursive (divide and conquer) does not work great with specialized circuits.
Wrong. Papers are published showing Strassen and Winograd outperform naive algorithms for as little as 2048×2048 matrices. Either you've been living under a rock for 20 years or rely only on free public tools which aren't providing state of the art.
They use special parallell algorithms for matrix multiplication. Which is the clue of their speed. As all hardware implementations I know, yes GPUS also use some sort of O(n^2) algorithm for plain multiplication.
The recursive splitting is something I came up with independently while trying to figure out how to build a multiplier that could be used to either do two n-bit multiplications in parallel or a single 2n-bit multiplication. I ended up on that track after noticing that a lot of GPU specs on Wikipedia listed half-precision performance being double that of single-precision, and wondering how Nvidia and AMD did it. That factor of two difference suggested the possibility of parallel operations, and while it was easy to figure out for addition, multiplication had me stumped for quite a while before I realised you could do a 2-digit multiplication as 4 single-digit ones with appropriate adding and scaling of the results, and that the same thing could be translated into circuitry by treating blocks of bits like digits for smaller multipliers. I eventually came up with something in Logisim Evolution that would do a single 32-bit multiplication, or two 16-bit ones, or four 8-bit ones.
Man, how in the world do you not have more views and subscribers???? Like seriously, the animations beautiful, the math interesting, and the explanations are chopped down into more digestible chunks.
The last term in the sum is n*1.5^k where k=log_2(n) =n*(2^log_2(1.5))^k =n*2^(log_2(1.5)*k) =n*(2^k)^log_2(1.5) =n*n^log_2(1.5) =n^(1+log_2(1.5)) =n^(log_2(1.5*2)) =n^log_2(3) So the last term in the sum, the bottom row of the recursion, takes asymptotically the same amount of compute as the whole thing. (up to a factor of 3.)
This video beautifully illustrates how a mathematical breakthrough can change the course of technology. Karatsuba’s Algorithm remains a cornerstone in computational theory. Pairing these insights with SolutionInn’s study tools is a fantastic way to deepen understanding.
Quick question: At 12:14, shouldn't it be from 1 to log2(N)? I don't think I can summarize well my line of thought, but I will give it a try. Suppose we have two numbers of two digits (N = 2) being multiplied. To calculate it, we divide them into 3 subproblems of size N=1, so the total work would be 3 * c * 1. As the stated summation goes from K=0 to K = 1, there would be two steps (3 * c * 1 + c * 2), while in reality only one is actually made, at K = 0 we already have the result, so no work is done in that bit. Well, this is my counter example. Now as for logically speaking, the algorithm's last step is summing up the calculated subproducts obtained from the step before, said last step is of size 3^1*N/2, after that we have the result and there is no step further to add for the work. If we made one more step, its work would be c * N, as if we had to do some calculations with the result. I know it is just semantics, and in the end everything will be eaten up by the bigO notation, but I just want to make sure I got everything right.
Indeed, the master theorem is a very powerful tool. I chose to do the analysis from scratch so it's more transparent where the result is coming from, without an additional generalization added in.
Most Python implementations use Comba multiplication rather than the schoolboy method for numbers too small to resort to the Karatsuba method. It has the same big-O time as the schoolboy method, but takes an aggressive approach to carry propagation that reduces the constant factor significantly. (The recursion in the Karatsuba logic also bottoms out into Comba, rather than going all the way down to single precision.) Lovely explanation! Subscribed.
Time Complexity is such a difficult topic. This video doesn't do it justice, but it's an entire field of math that specializes in optimizing math. It's weird how a field of math based on "being practical" could become so theoretical, it stops being practical...
17:05 But why does the blue line's slope not change instantly at 2^7 digits? If Python's algorithm switches at that point, then shouldn't we see an instant snap in slope to match Karatsuba's algorithm? Why does it curve for a little bit? Edit: Ok so using background knowledge, Python has a similar case with sorting algorithms. I know there is a naive O(n^2) sorting approach: insertion sort, and a refined O(n*log(n)) sorting approach: merge sort. However, insertion sort is significantly faster at small array lengths compared to merge sort. To take advantage of this, Tim Peters implemented Timsort into Python, an ingenious algorithm that essentially executes insertion and merge sorts in parallel. This implementation is, interestingly, almost identical to Python's built-in multiplication. I want to ask a question regarding this, and while I know it is 100% answered somewhere online, I have other things to do. I can't take another rabbit hole. This might be a terrible question, but here goes. In that curved region after 2^7 digits, does Python use the naive and Karatsuba algorithms _simultaneously?_
OH MY GOD YOU WRITE A SET INCLUSION SYMBOL INSTEAD OF EQUALITY WHEN TALKING ABOUT TIME COMPLEXITY, FINALLY SOMEONE THAT DOES IT PROPERLY!!!!!!!!!! THANK YOU!!!!!!!!
One minor nitpick though: big O notation is an upper bound, so not all algorithms that are O(n) look like a line. For example, any algorithm that is O(log n) is also O(n), but it doesn't look like a line
This video reminds me how I got good at multiplying by 9. Just add a 0 at the end and then subtract the original number! Somehow just trying to multiply by 9 without that was too much headache.
Zero at the end of what? Youboeft it kind of vague sorry..or isnt itneasier or maybe as easy and as clear as ypur method is just replace 9nwith 10 minus 1..that's what i think of and Inthink as effective as your method..
I am not sure about the initial behaviour of the builtin multiplication at 17:00. 2**-22 s is about 10**-7 s, which is horribly slow for a simple multiplication, which should be done in ns (10**-9 s). Probably some function calling overhead is the bottle neck at that point, where it also is important how you benchmark functions (measuring around or in loop? which time function? threading? etc) In Python I definitely would compile the functions with numba or alike, to reduce the calling overhead. The math is presented really well though :)
My guess is this measurement is biased because of incompressible time of API calls handling datastructure transfert or transformation from the interpretive language to the machine code.
10-7 is 100 ns, which is to be expected when you're working at arbitrary precision, even at less than a machine word of precision, an arbitrary-precision multiplication can't just do a hardware multiply and nothing else: you have to follow a pointer to each number, determine the length of each, decide what algorithm to use, retrieve the actual value from memory (if it indeed fits in a machine word and can thereby be loaded into a register), and only then can you perform the hardware multiplication, and then you have to store it back into memory. Even without any function call overhead, the bookkeeping will be expensive, as will the memory accesses if they miss in cache (especially if they get punted all the way down to L3 or RAM. A single access that misses in cache can easily eat 100 cycles).
@@trueriver1950 I am not doubting the validity of the O notation, but people might draw wrong conclusions from the example shown. For example the location of the "kink" will be shifted due to the offset, leading to wrong conclusions about where the transition happens, which actually does impact the presented explanation. I did not expect an in-depth performance tutorial, but hinting at caveats would have been nice. I am just trying to provide a clearer picture, point out pit-falls and what should be expected when people try to do similar studies out of curiosity.
@@JonBrase As the presented code does not show the data type used, it is absolutely possible more book keeping is done than I expected, plus I am not familiar with performance levels of arbitrary precision modules in Python. However, Python does have a very heavy footprint when it comes to calling functions, I would not even look at different cache levels for this issue. Also keep in mind, that even for a single-digit multiply the cost is allegedly 100ns, which to me is absurd, so I still doubt this without seeing more details. However, it would be important to know how the performance is measured, otherwise we are just guessing. I shot myself into the foot once, because the rng for the data was part of the loop, and as the algorithm got faster (rewriting and compiling) that was the limiting factor at some point. The amount of input data pre-calculated and stored then also affects the cache levels used during the loop, assuming we measure the sequential throughput.
This makes me more interested in division algorithms. I wrote my own division algorithm for arbitrarily large integers decades ago, but it was not efficient at all. It would have been quicker to subtract logarithms and then ant-log the result.
@19:30 It is like sorting. Qsort and mergesort have a complexity of O(n log n) but sleepsort has a complexity of O(n). The latter is a lot slower than the former.
Amazingly, in five decades of professionally writing code, i never needed to know any of this. Interestingly, low-level math works differently on numbers you can express using binary bits rather than strings of digits.
In the eighties there was a considerable (though minority) interest in doing everything in interest arithmetic. The language Forth is probably the best place to start if you want to understand how that worked back in the day. Numbers like pi and e would be represented as rationals, that is as ordered pairs of integers, rather than as floats. The introduction of the floating point co processor, and the later incorporation of floating point arithmetic into the standard CPU was more programmer-friendly so soon won the battle. However with sufficient programmer skill it would be possible even now to get better performance using only integer-rational arithmetic. That would entail scientists, engineers, and accountants all learning more integer number theory than they seem to want to...
Python's built-in multiplication just uses the CPU for integers that fit in 64bits. The CPU multiplies all bits in the two registers in parallel and process the result in o(1) time. See booth's algorithm.
Even if we implemented long multiplication in a CPU, I am surprised that would be O(N*N). Because all it has to do, is a max of N additions for N bits. And addition itself doesn't do addition in serial as that would be way too slow. It does fast addition, with look ahead carry, to parallelize addition. That being said, CPU's still don't use long multiplication I think, as there's faster methods like this video talks about. I just don't think a CPU doing shift adds in binary with fast adders is O(N*N). Feel free to correct me.
Big-O notation describes how an algorithm behaves as N trends toward infinity. Modern CPUs do have fast multiplication and addition ***for specific finite-sized inputs*** (say, 64-bit integers). On arbitrarily large inputs, long multiplication is still O(N*N). Let's assume that some CPU can do a 64-bit multiply-shift-add in 1 clock cycle. Long multiplying 128-bit still requires 4 mult-shift-adds, 256-bit requires 16, etc. The inputs double in size, time complexity quadruples. Still O(N*N).
Hey! When is the voronoi diagram puzzle solution coming? I have been thinking about it for a while now and I would love to see the answer. Amazing videos btw!
2:57 you should say it looks *evermore* like a straight line *as* you put in arbitrarily larger numbers. n+cbrt(n) is O(n) but it very much isn't a line.
There will always be steps in the line, due to the points where one or both numbers cross the boundary and need an extra "digit". That means that in formal maths the slope does not come to a limit, as you can't shrink delta indefinitely small. What you can do is to shrink (delta/n) arbitrarily small, so you can find the value of slope/n.
ohhh, ya. u remember building my own LargeInt and LargeFloat types in c++ and js. my js arbitrary math system is still open source. multiplication is actually pretty well solved, it's division that has specific potential for improvement depending on specific inputs. iirc I think I came up with a multiplication system that used powers of 2 and an accumulator to reach O(n) speed technically... real world testing showed that chunking multiplication was many times faster despite being O(n^2) I originally wrote my when I wanted a fully homebrew rsa implementation for my fully home brew irc esque chat system so I could have end to end encryption. tested everything with phps gmp class
@@PurpleMindCSCan you PLEWSE share how or why Laratsuba would come up with thia and hiw can I come upmwith something like thst orngresterand be a math genius. I won't settle for anything less. Thanks for sharing and hope to hear frlm you.
It is astonishing that Kolmogorov believed that multiplication would not be possible faster than quadratic time, since fast convolution via fast Fourier transform was already known. Ok, Gauss' fast Fourier method was not widely known and Good-Thomas maybe was also not widely known and Schönhage-Strassen came up only few years after Cooley-Tukey.
When messing around with how to mathmaticaly represent the number of single digit multiplications you have to do in the unoptimized method, i discovered that x^log2(n) is the same as n^log2(x) (it also works with other logs) and now I'm going to spend the next hour trying to figure out why that works
do note the O(nlog(n)) of Harvey Hoven algorithm appears to have a huge constant, such that it becomes optimal only for large n on the order of 10^^2 (10^100)
I would expect the standard form of multiplication would be the fastest in binary given a number with a small bitcount (total 1s), so that might be taken into consideration when choosing Karatsuba or not. With a low enough bitcount it would end up being just a few additions in binary.
7:15 if i am understanding how this works it like separating each digits to their 10^n places? Like 1234 is broken down to 1000 200 30 4 and then individually multiplied to 5000 600 70 8?
For explanatory purposes yes. In a computer the break down would be into powers of two rather than ten. (Exception below) Or on some machines that are operated for 8bit manipulation the smallest growing might turn out to be in groups of 8 bits. However, some machines also implement BCD arithmetic (or at least used to: for example the IBM 360 and 370 series, where fixed point integers of arbitrary length were stored in decimal, using 4 bits for each digit: hence the name "binary coded decimal"). Those machines could natively multiply arbitrarily long BCD integers in a single machine instruction, a non-interruptible instruction that typically took huge numbers of CPU cycles to complete. These were much favoured in the COBOL era, as there's a direct mapping from the COBOL data types that don't tend to exist in science oriented languages. Whether the microcode implemented the naïve algorithm or something better is an interesting question, which I never bothered to ask back in the day. Anyone with a preserved 370 is welcome to do the timing tests ...
A look up table in base 2 is trivial, and in effect already used in all binary multipliers which are hardwired that 1×1=1, and any other inputs give zero. A look up table in base 10 is plausible, and even in base 100 world only need 10k bytes (5.5k if optimized for commutativity) However that would still be slower for 8 digit numbers than a CPU-native 64bit multiply, giving a result of 64 bits and a carry of another 64bits. Same for other subsets of the 64 bits Can we store ALL the possible results of multiplying arbitrary 64 bit numbers? To store these results in a look up table would take 32 billion billion bytes, which is not practical, even with attached storage that would slow the thing down hugely. So for bases higher than base 2 your idea, however ingenious, is out performed by crunching at the hardware level. The CPU designers will have optimised that to some extent by pipelining and parallel execution, but the time taken is typically linear.
would it make sense to hash the sub-results instead of calculating them over and over again? the longer the numbers get, the higher the changes to find sub-multiplications, you've already did - is checking a dict/hash table more expensive then just calculating the result again?
Isn’t there also a hardware development coupled with the theory which serves to bring the constant down, such as parallelizing all the simple multiplications?
In the real implementation, the third line on the final graph, that has been done by the folks who designed the CPU. Python almost certainly uses native CPU for "small" numbers, and also for computing the partial products of the larger numbers. So that's kind of bundled into the almost linear time part of the third line.
YEah, but those are usually trivial implementation details. No big gains just trying to optimize that. Reducing constant time thru Parallelization only help in very very few cases, you need especific parallel algorithms.
I enjoyed atumbling upon your channel! Great video. I got a question for you. Are there any similar smart "shortcuts" for exponentials? I.e. x^y where x and y are both large? What if only y is large? What about x=y, or in other words x^x?
That's a thornier problem yet :) And CPUs don't just use fast algorithms, e.g. SRT, but use lookup tables. See the infamous Pentium division bug for how that went wrong quite a long time ago.
Curious. I thought this video was gonna be about how does the hardware multiply numbers. I remember having heard somewhere that it just does a loop where it adds the bigger of the 2 numbers to itself n times, equal to the lower number. But this sounds rather inefficient. I will look it up, but I guess that at the very least you could do a bit shift left with the most significant bit of the smaller number prior to doing the loop (less times). I actually made some calculations on paper to see if it would work, and it does. The algorithm I came up with was: Bit shift left the bigger number amount of times, equal to position of the most significant bit of the smaller number, and turn that bit to 0 in the smaller one. Then repeat the operation and keep adding the results together until the smaller number becomes 0. For example: 21 (10101) * 23 (10111) = 483 (111100011) 1) result = 10111 111001100 3) result = result + 10111
Hi, I just realized that I omitted a word in the title which I had meant to include. Now the title of the video is: The Genius Way Computers Multiply **Big** Numbers. So yes, at a low level, hardware can multiply numbers up to a certain size in parallel. But the topic of the video is algorithmic approaches for multiplying big numbers with thousands of digits.
Yes they do things like that, in logic, superfast, on a single CPU clock. But that's still O(n*n) complexity. THe operation still depends on the size of both numbers.
What would interest me now is what performs better on the task of multiplying a large number with a small number lets say a number with more than 1000 digits with one with less than 2. Is the school way better or karatsubas method?
I'm pretty sure the python slope for numbers below 9-10 digits changes because it changes from hardware multiplication to software multiplication. The hardware one probably uses the naive method, but it may be something else depending on hardware.
Doing something B or B-1 times means that the number of operations is proportional to B, so this method would actually be O(n). In addition, addition is linear time, so multiplying those time complexities together yields O(n^2) time complexity.
Below 64-bits for each number, on modern CPUs, it is faster to allow the silicon to multiply (however it does it), than to implement any sort of algorithm. Likely, that is what python is doing.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/PurpleMind/ . You’ll also get 20% off an annual premium subscription.
NOOOOOOOOOOOOOO!!!!!!!!!
STOP!!
How to have the offer free for 30 days?
Cunningham's Law: "The best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer." Kolmogorov demonstrated Cunningham's Law before the internet existed lol.
Lol
No , the best way to get a flood of unwanted email ads is to post an answer on Reddit .
@@edwardmacnab354 Cunningham's Law in action right here
@edwardmacnab354 wait what?? could you explain more
On the graph, you can see that the bend for the python code curve starts at 2^6, which is incidentally the same as 64. This is because on your computer - assuming you use a 64 bit computer - there is a chunk of silicon that has the multiplication algorithm burnt into it. Burning it in the silicon has the advantage that it can do all the additions in parallel, so the time is no longer dependent on the number of simple operations.
When python has higher numbers and needs to use Karatsuba algorithm, it only breaks down numbers until they are 64 bit, and then sends that to the cpu, as the cpu always does its multiplication at the same speed, no matter how many digits you need.
I'm dumb af, I'm not yet ready to instinctively assume that 64 "digits" can be binary digits lmao, good catch! But then he's wrong when he says it's just the basic algorithm being applicated here, it just uses an hardware instruction, and any "big number" uses karatsuba's algorithm anyway
so after thinking more about it, there's no way python uses a simple mul instruction because that wouldn't handle overflow, but maybe it makes multiple 32 bits multiplications in a 64 bits register so that it can't overflow?
@thomquiri9860It probably depends on the architecture, but x86_64 does have a 64-bit multiplication instruction. It stores the full result in 2 registers.
I'm surprised he didn't mention it since it corresponds to the Word model and "doing some basic arithmetic on single digit numbers" that he mentioned in the introduction at 1:30.
As you described it, the 64-bit CPU architecture has physical components that ensure constant time operation on 64-bit numbers. So in this architecture, 64-bit numbers (which we call a word) are the actual "single digit" numbers that cost exactly one operation. And this graph is a direct confirmation that the 64-bit version of Python performs as it should by delegating all 64-bit operations to the CPU and cutting bigger numbers into 64-bit blocks.
8:18 Speaker: I'd encourage you to pause the video and see if you can figure that out.
Me: No, no, please go on.
mood!
8:22 I didn't need to pause the video. a*c and b*d can be calculated first, then stored in memory and used for the "middle step", saving computation time. The proof of turning 100(a*c + b*d) into 100[ (a+b)*(c+d) - a*c - b*d] isn't nearly as obvious though, unless you know the mathematical trick for it I think 🤔. I haven't tried to do it yet 🤷♂️.
I actually paused the video, figured it out, and continued the video only to hear him suggesting I would pause the video.
@@projekcja Been there! I didn't with this one, but I've totally done that. :)
@@louisrobitaille5810 yeah, It felt counter intuitive to add b and c to the terms, and then subtract the products of the addition again. That to me is the genius, it was a flashback to wonderfully simple math learned 30 years ago ;)
In Knuth's "The Art of computer Programming" Volume 2 has a Chapter 4.3.3. "How Fast Can We Multiply?". It is a very interesting read.
FYI Python integers have a `.bit_length()` method you could have used instead of converting them to a string and taking their length
Ah! I did not realize that. Was looking for something like this. Thanks :)
Thanks I'll use this
However this .bit_length() method is really good but it gives you 1+floor(log(n)) and couldn’t give you rational number
It will be good for discrete cases
Damn! Nice!
@@mohamedahmed-rk5ez Why? There is an exact number of bits. Even if it rounded up to the next whole byte, it would still be an integer, ie. rational. Seems like python would be faster just to count the number of bytes or whatever int-class is used.
Right off, I could just take the max_index of the array that holds the number, multiply that by the int_size, then subtract off each zero bit at the top. Always an integer answer, no slow FP functions needed. Am I smarter than the python programmers? Probably not.
This video is ridiculously well made! holy crap! The animations and scripts are beautiful and incredible!
Thanks so much! Glad you enjoyed.
It probably uses panim
I know your type. I bet you're one of those pancake-eaters the machine seems to like so much.
It looks similar, if not identical, to 3Blue1Brown’s open source animation tool.
@@JordanMetroidManiac It is VERY similar.
btw, the schonhage-streissen algorithm uses FFT / NTT which is another genius very practical O(n log n) algorithm for multiplying polynomials, but when applying it to number multiplication, we run into precision issues or the NTT modulo becomes too small beyond some point, thats why there is an extra log log n factor.
Based pfp btw
Nice of you to Grant exposure to a small channel like 3Blue1Brown❤
P.S. this is a joke
@@kaustubhpandey1395 I think what you meant to say was that you *intended* for it to be a joke, but jokes have to be funny.
@@anon_y_mousse being funny is an extrinsic property (a relationship between an audience and the joke), not an intrinsic one. So, you can't define what is a joke or not based on how funny it is to you.
@@denki2558 There's enough universality that you often can, and let's face it, the majority of dad "jokes" are universally reviled.
@@anon_y_mousse Jokes can simply be unfunny to a demographic. The universality claim can also be easily disproven. For any arbitrary joke, there's at least one person who thinks it is funny- the author if it.
I wish you would have mentioned that a+b and c+d have one more digit (or bit) than each of the halves, and how that affects (or actually doesn't affect) the scaling.
And surprisingly it doesn't even mess up the "powers of two" scheme that hardware is typically aimed at. Your most elementary unit that you eventually fall back to just has to be able to multiply numbers that are 1 longer than you'd think from dividing your number's length by 2 to the power of the number of "Karatsuba divisions" that you do.
That is correct! I did actually mention this in a red blurb on the screen at that time but for the sake of concision I decided to leave it as a footnote of sorts.
I mean as much as this detail is interesting, about anyone can figure it out by themselve while watching the video. When there's 'one more digit' due to addition, this digit is always a 1 no matter the base, and this is especially easy to take into account in base 2.
This video is so well made! The flow of information in this is perfect, when you were talking about time complexity I was in fact recalling about galactic algorithms and how the constant multiple could sometimes grow so large seemingly faster functions could be outperformed by "slower" ones for data used in practice, and then you mentioned about galactic algorithms at the end and I was utterly delighted. The flow of info was exceedingly good going from introducing about algorithms and mathematics and then going into a story form talking about history of its discovery for the breakthrough idea, deriving everything from scratch and explaining the reason behind everything, showing the practical implementation in Python and testing it, and even after that not stopping there and explaining the scenario of development and improvement on algorithms beyond that and showing how useful research is. Amazing video, I'm impressed!
@20:04 - "efficient multiplications with thousands of digits are now essential to public Key Cryptographic systems like RSA that keep your information secure." This is not accurate, because RSA is NOT used to encrypt user data. RSA is only used to encrypt the user's session key for a symmetric algorithm such as AES, Triple DES, DES, Blowfish, etc. The *average* session key used on the Internet is 200 bits, which is about 25 decimal digits. That's not a big number at all, by this video's definition of "big." Having to multiply a 25 digit number by another 25 digit number using any algorithm just once for a single SSL session (eg, one web page) is not going to make any difference to the user's experience. That's my opinion.
You still need the RSA keys which are routinely 4096 bits
@@seheyt That's true, but in order to get a 4096-bit RSA key you have to multiply two 1024-bit prime numbers, p and q, then append the product of two other 1024-bit numbers, p-1 and q-1. And my point was that you have to do this process only once per session, not once per block of user text. Moreover, in this video, he said that these faster algorithms are better only when you have thousands of decimal digits, and 1024 bits is only 128 digits. So there's no performance benefit to the newest algorithms for something that's done only once every web page.
@@TheNameOfJesus Fair analysis. It was rather irrelevant to point out the session key though :) I wonder what the role of FFT/convolution based algorithms is in this picture. I assume they may not be suited for exact integers (rather for numerical computation)
10 bits is 3 decimal digits (2^10 = 1024) so 200 bits is about 60 decimal digits.
@@TheNameOfJesus No, 1024 bits corresponds to about 308 decimal digits.
I figured out the two digit at a time multiplication and a few things not mentioned in the video when I was 14 y/o in 1982. I was rotating points of a tie fighter on an Apple II+ 6502 CPU. I think I got it down to 20 clock cycles to do a 16 bit multiplication on an 8 bit CPU that took 1 to 5 clocks to do a single operation on the CPU. Reached 50,000 mults per second. Then rotating 3D points required sine and cosine so I had to optimize that as well plus square roots. Lots of work and experimenting. I could rotate 35 or so 3D points on Darth’s Tie Fighter at 3 frames a second including time to clear the screen and draw the lines one pixel at a time in assembly code. This was 1 frame per second faster than Woz who built a 3D rotation utility program. I didn’t realize the importance of what I was doing, was just trying to rotate a tie fighter to show my geeky friends.
How funny! At 14, I figured out how to manually calculate square roots. My aha moment came when i did 9/3=3. That can be seen as an equilibrium. Divide by something smaller than the square root and you get something larger. So 9/2=4,5. Just calculate the average and repeat (2+4.5)/2=3,25 which is better. Similar to the babylonian method but more intuitive. Best of all is that it can be extended to do cube roots. Or fourth roots. I did nothing, not even telling my lousy math teacher who I thought would say that someone had told me this.
That does sound like a lot of work..was it.modtly fun and enjoyable? Thanks for sharing.
@@TheFrewah And then there's the guy who can just get the answer in his head with zero calculation
In university we had a method that involved: reducing the inputs modulo a series of primes, where a large power of 2 divided p-1. Then do a finite field Fourier transform, add the results, and reconstruct using Chinese remainder theorem. I can't remember the cost, but I think the reason had to do with the maximizing the computational benefit of the word size on the computer.
I could feel the joy you have in your explanation. Thanks.
Saying "it's impossible to improve on this" is the cheat code to get people to find a better solution, just so they can prove you wrong.
Reminds me of posting a question and then giving yourself a wrong answer on purpose, because people would rather prove someone wrong than help someone else 😆
13:12 If you want a concrete interpretation of why O(n^log2(3)) shows up, look into the Master Theorem. Basically, for any recursive divide-and-conquer algorithm that's leaf-heavy, all you need to find the complexity class is to know the number of problems it splits into at each level and the new problem size. So every recursion of Karatsuba calls Karatsuba 3 times, each with a problem size of n/2. That's where the 2 and 3 come from, and that's also why naive multiplication which calls itself 4 times becomes O(n^log2(4))=O(n^2).
A simple divide and conquer algorithm has a log(n) tree with n leafs. Karatsuba has the lo-hi vs hi-lo mul making it not a simple case. The Master Theorem formally proves it but it was already obvious
1:44 allocating is not a neutral example of "accessing memory", and is usually several orders of magnitude more costly than the other examples of prototypical "operations"
It depends on the memory allocator being used.
GPUs use the naive O(n^3) for matmul, because that one can be done in parallel. Recursive (divide and conquer) does not work great with specialized circuits.
Wrong. Papers are published showing Strassen and Winograd outperform naive algorithms for as little as 2048×2048 matrices. Either you've been living under a rock for 20 years or rely only on free public tools which aren't providing state of the art.
They use special parallell algorithms for matrix multiplication. Which is the clue of their speed.
As all hardware implementations I know, yes GPUS also use some sort of O(n^2) algorithm for plain multiplication.
The recursive splitting is something I came up with independently while trying to figure out how to build a multiplier that could be used to either do two n-bit multiplications in parallel or a single 2n-bit multiplication.
I ended up on that track after noticing that a lot of GPU specs on Wikipedia listed half-precision performance being double that of single-precision, and wondering how Nvidia and AMD did it.
That factor of two difference suggested the possibility of parallel operations, and while it was easy to figure out for addition, multiplication had me stumped for quite a while before I realised you could do a 2-digit multiplication as 4 single-digit ones with appropriate adding and scaling of the results, and that the same thing could be translated into circuitry by treating blocks of bits like digits for smaller multipliers.
I eventually came up with something in Logisim Evolution that would do a single 32-bit multiplication, or two 16-bit ones, or four 8-bit ones.
woah! as i was watching i didn't realize just how underrated this channel was! subbed, and great work on the video!!
Never expected this type of great explanation. One of the greatest channel I found on computation.
Thanks so much! Glad you liked it.
Man, how in the world do you not have more views and subscribers???? Like seriously, the animations beautiful, the math interesting, and the explanations are chopped down into more digestible chunks.
Thanks so much!
I dunno exactly what font MANIM defaults to (TNR offshoot?) but it always makes me feel warm inside. 😌
This is such a great video. Can't wait to see the CS content you will come up with next.
Thanks so much! I'm looking forward to it as well :)
1:59 the numbers being added are the first 6 digits of pi and e
Multiplied by 100,000 of course. 😁
The last term in the sum is
n*1.5^k where k=log_2(n)
=n*(2^log_2(1.5))^k
=n*2^(log_2(1.5)*k)
=n*(2^k)^log_2(1.5)
=n*n^log_2(1.5)
=n^(1+log_2(1.5))
=n^(log_2(1.5*2))
=n^log_2(3)
So the last term in the sum, the bottom row of the recursion, takes asymptotically the same amount of compute as the whole thing. (up to a factor of 3.)
Indeed! There's a special name for that (it's called a leaf-dominated recurrence).
This video beautifully illustrates how a mathematical breakthrough can change the course of technology. Karatsuba’s Algorithm remains a cornerstone in computational theory. Pairing these insights with SolutionInn’s study tools is a fantastic way to deepen understanding.
Quick question: At 12:14, shouldn't it be from 1 to log2(N)? I don't think I can summarize well my line of thought, but I will give it a try.
Suppose we have two numbers of two digits (N = 2) being multiplied. To calculate it, we divide them into 3 subproblems of size N=1, so the total work would be 3 * c * 1. As the stated summation goes from K=0 to K = 1, there would be two steps (3 * c * 1 + c * 2), while in reality only one is actually made, at K = 0 we already have the result, so no work is done in that bit.
Well, this is my counter example. Now as for logically speaking, the algorithm's last step is summing up the calculated subproducts obtained from the step before, said last step is of size 3^1*N/2, after that we have the result and there is no step further to add for the work. If we made one more step, its work would be c * N, as if we had to do some calculations with the result.
I know it is just semantics, and in the end everything will be eaten up by the bigO notation, but I just want to make sure I got everything right.
13:14 you can also use the master theorem.
It's T(n) = 3 T(n / 2) + cn.
And I c is just 3n, or maybe 5n with inefficient caries.
Indeed, the master theorem is a very powerful tool. I chose to do the analysis from scratch so it's more transparent where the result is coming from, without an additional generalization added in.
Most Python implementations use Comba multiplication rather than the schoolboy method for numbers too small to resort to the Karatsuba method. It has the same big-O time as the schoolboy method, but takes an aggressive approach to carry propagation that reduces the constant factor significantly. (The recursion in the Karatsuba logic also bottoms out into Comba, rather than going all the way down to single precision.)
Lovely explanation! Subscribed.
The more official name for "the schoolboy method" is "long multiplication".
Time Complexity is such a difficult topic. This video doesn't do it justice, but it's an entire field of math that specializes in optimizing math.
It's weird how a field of math based on "being practical" could become so theoretical, it stops being practical...
17:05
But why does the blue line's slope not change instantly at 2^7 digits? If Python's algorithm switches at that point, then shouldn't we see an instant snap in slope to match Karatsuba's algorithm? Why does it curve for a little bit?
Edit: Ok so using background knowledge, Python has a similar case with sorting algorithms. I know there is a naive O(n^2) sorting approach: insertion sort, and a refined O(n*log(n)) sorting approach: merge sort. However, insertion sort is significantly faster at small array lengths compared to merge sort. To take advantage of this, Tim Peters implemented Timsort into Python, an ingenious algorithm that essentially executes insertion and merge sorts in parallel.
This implementation is, interestingly, almost identical to Python's built-in multiplication. I want to ask a question regarding this, and while I know it is 100% answered somewhere online, I have other things to do. I can't take another rabbit hole. This might be a terrible question, but here goes.
In that curved region after 2^7 digits, does Python use the naive and Karatsuba algorithms _simultaneously?_
OH MY GOD YOU WRITE A SET INCLUSION SYMBOL INSTEAD OF EQUALITY WHEN TALKING ABOUT TIME COMPLEXITY, FINALLY SOMEONE THAT DOES IT PROPERLY!!!!!!!!!! THANK YOU!!!!!!!!
🤓
😃😃😃😃
🎉🎉
One minor nitpick though: big O notation is an upper bound, so not all algorithms that are O(n) look like a line. For example, any algorithm that is O(log n) is also O(n), but it doesn't look like a line
@@spicybaguette7706wrong. It's worst case does look like a line which is exactly what big O is addressing. You completely missed the point.
18:27 "And _this_ is where the record stands today." 🎵
This video reminds me how I got good at multiplying by 9. Just add a 0 at the end and then subtract the original number! Somehow just trying to multiply by 9 without that was too much headache.
Zero at the end of what? Youboeft it kind of vague sorry..or isnt itneasier or maybe as easy and as clear as ypur method is just replace 9nwith 10 minus 1..that's what i think of and Inthink as effective as your method..
@@leif1075 they meant the same thing
9 * x = (10 - 1) * x = 10x - x
5:45 Count Dracula! :D
Great video btw. I really enjoyed it.
Nice improvement on an older video someone else made, just like what karatsuba did.
I am not sure about the initial behaviour of the builtin multiplication at 17:00. 2**-22 s is about 10**-7 s, which is horribly slow for a simple multiplication, which should be done in ns (10**-9 s).
Probably some function calling overhead is the bottle neck at that point, where it also is important how you benchmark functions (measuring around or in loop? which time function? threading? etc)
In Python I definitely would compile the functions with numba or alike, to reduce the calling overhead.
The math is presented really well though :)
My guess is this measurement is biased because of incompressible time of API calls handling datastructure transfert or transformation from the interpretive language to the machine code.
10-7 is 100 ns, which is to be expected when you're working at arbitrary precision, even at less than a machine word of precision, an arbitrary-precision multiplication can't just do a hardware multiply and nothing else: you have to follow a pointer to each number, determine the length of each, decide what algorithm to use, retrieve the actual value from memory (if it indeed fits in a machine word and can thereby be loaded into a register), and only then can you perform the hardware multiplication, and then you have to store it back into memory.
Even without any function call overhead, the bookkeeping will be expensive, as will the memory accesses if they miss in cache (especially if they get punted all the way down to L3 or RAM. A single access that misses in cache can easily eat 100 cycles).
This doesn't matter, as the discrepancy will be constant. The demo is only looking at orders of magnitude here, hence the validity of the O() notation
@@trueriver1950 I am not doubting the validity of the O notation, but people might draw wrong conclusions from the example shown. For example the location of the "kink" will be shifted due to the offset, leading to wrong conclusions about where the transition happens, which actually does impact the presented explanation.
I did not expect an in-depth performance tutorial, but hinting at caveats would have been nice. I am just trying to provide a clearer picture, point out pit-falls and what should be expected when people try to do similar studies out of curiosity.
@@JonBrase As the presented code does not show the data type used, it is absolutely possible more book keeping is done than I expected, plus I am not familiar with performance levels of arbitrary precision modules in Python.
However, Python does have a very heavy footprint when it comes to calling functions, I would not even look at different cache levels for this issue. Also keep in mind, that even for a single-digit multiply the cost is allegedly 100ns, which to me is absurd, so I still doubt this without seeing more details.
However, it would be important to know how the performance is measured, otherwise we are just guessing. I shot myself into the foot once, because the rng for the data was part of the loop, and as the algorithm got faster (rewriting and compiling) that was the limiting factor at some point. The amount of input data pre-calculated and stored then also affects the cache levels used during the loop, assuming we measure the sequential throughput.
This makes me more interested in division algorithms. I wrote my own division algorithm for arbitrarily large integers decades ago, but it was not efficient at all. It would have been quicker to subtract logarithms and then ant-log the result.
@19:30 It is like sorting. Qsort and mergesort have a complexity of O(n log n) but sleepsort has a complexity of O(n). The latter is a lot slower than the former.
2:16 pi and e
Amazingly, in five decades of professionally writing code, i never needed to know any of this. Interestingly, low-level math works differently on numbers you can express using binary bits rather than strings of digits.
Maybe you could be interested in the paper "addition is all you need" which describes an optimized multiplication for floats.
In the eighties there was a considerable (though minority) interest in doing everything in interest arithmetic. The language Forth is probably the best place to start if you want to understand how that worked back in the day.
Numbers like pi and e would be represented as rationals, that is as ordered pairs of integers, rather than as floats.
The introduction of the floating point co processor, and the later incorporation of floating point arithmetic into the standard CPU was more programmer-friendly so soon won the battle. However with sufficient programmer skill it would be possible even now to get better performance using only integer-rational arithmetic. That would entail scientists, engineers, and accountants all learning more integer number theory than they seem to want to...
This here is where real magic exists
This reminds me how Fast Fourier Transform magically is faster than a tradition approach.
17:46 toom really cooked 🍚
2:07 nice touch with pi and e
Thanks!
Python's built-in multiplication just uses the CPU for integers that fit in 64bits. The CPU multiplies all bits in the two registers in parallel and process the result in o(1) time. See booth's algorithm.
I'd love to see a video talking about rng's and prngs, like Mersenne Twister 19937
Good to see you're getting attention
Incredible Video!! You make Algorithmic Mathematics look like art!
Thanks so much! Glad you enjoyed.
19:37 do you think they might see use with the sort of astronomical things quantum computers theoretically would work with?
That is a very surprising and interesting result. Great video
Even if we implemented long multiplication in a CPU, I am surprised that would be O(N*N). Because all it has to do, is a max of N additions for N bits. And addition itself doesn't do addition in serial as that would be way too slow. It does fast addition, with look ahead carry, to parallelize addition. That being said, CPU's still don't use long multiplication I think, as there's faster methods like this video talks about. I just don't think a CPU doing shift adds in binary with fast adders is O(N*N). Feel free to correct me.
Big-O notation describes how an algorithm behaves as N trends toward infinity. Modern CPUs do have fast multiplication and addition ***for specific finite-sized inputs*** (say, 64-bit integers). On arbitrarily large inputs, long multiplication is still O(N*N). Let's assume that some CPU can do a 64-bit multiply-shift-add in 1 clock cycle. Long multiplying 128-bit still requires 4 mult-shift-adds, 256-bit requires 16, etc. The inputs double in size, time complexity quadruples. Still O(N*N).
They have lots of improvements reducing the constant time to the minimum, by using fast multipliers and such. But the complexity is still O(N*N)
Hey. This presentation is written on Manlim right?
Yes, the math animations are made using Manim, by 3Blue1Brown.
Great channel! Please cover p vs np!
I've been wanting to cover that for a long time! I think it's likely that I'll include it as part of a future video on Algorithms and Complexity.
This was a really good explanation. Thank you!
What a dedication for this video, im definitely supporting you...
1:56 the first 6 digits of π and the first 6 digits of e
:D
please use a monospace font for code :(
omg i didn't even notice that. thanks, now the video is ruined :(
lol is it that much of a big deal? it seems about as legible as any other monospaced font ive used
Thank you for the video. Quite helpful!
This is beautifully explained. Congrats!
Amazing video. Thank you for this information.
Sir, could you please expand on how adding (a+b) and (c+d) works? 8:25
Hey! When is the voronoi diagram puzzle solution coming? I have been thinking about it for a while now and I would love to see the answer. Amazing videos btw!
Thanks so much! The Voronoi diagram puzzle is next up. Working on it now :) Did you make a submission?
2:57 you should say it looks *evermore* like a straight line *as* you put in arbitrarily larger numbers. n+cbrt(n) is O(n) but it very much isn't a line.
Yeah, that's a good point. "Proportional to n as n gets large" doesn't necessarily mean looks like an actual straight line for the whole graph.
Wrong. As n->infinity it will have its slope->1.
@@gregorymorse8423Yeah the slope approaches a constant but the function never is _actually_ a straight line for any real value.
There will always be steps in the line, due to the points where one or both numbers cross the boundary and need an extra "digit".
That means that in formal maths the slope does not come to a limit, as you can't shrink delta indefinitely small.
What you can do is to shrink (delta/n) arbitrarily small, so you can find the value of slope/n.
6:20 in. What if x and y are uneven numbers, or even prime?
Amazing channel
Thanks so much!
ohhh, ya. u remember building my own LargeInt and LargeFloat types in c++ and js. my js arbitrary math system is still open source.
multiplication is actually pretty well solved, it's division that has specific potential for improvement depending on specific inputs.
iirc I think I came up with a multiplication system that used powers of 2 and an accumulator to reach O(n) speed technically... real world testing showed that chunking multiplication was many times faster despite being O(n^2)
I originally wrote my when I wanted a fully homebrew rsa implementation for my fully home brew irc esque chat system so I could have end to end encryption. tested everything with phps gmp class
16:04 morale here, your algorithm might suck, but doing any algorithm in python sucks more
1:23 So in this case, myProgram() has a O(infinity) complexity 👀. (P vs NP is yet to be solved 😅. It's also a Millenim problem iirc 🤔?)
None of the examples shown there is O(Infinity) (That doesn't exists, by the way). They all are 'n' or 'n^k' or 'n log n'.
Lol. Maybe I should've put sub_to_PurpleMind() first :)
amazing video. very well animated and explained!
Thanks so much!
Amazing!
Insane you only have 6.4k subs with this production quality!
Well, now you have 6.4k+1 I guess :)
honestly fire video, and your explanation is actually phenomenal for.. literally everything. im subbing
9:54 oh no, math?? In a math vídeo??!
It's also got maths, but that's ok because it's also a maths video 😅
XD
@@PurpleMindCSCan you PLEWSE share how or why Laratsuba would come up with thia and hiw can I come upmwith something like thst orngresterand be a math genius. I won't settle for anything less. Thanks for sharing and hope to hear frlm you.
It is astonishing that Kolmogorov believed that multiplication would not be possible faster than quadratic time, since fast convolution via fast Fourier transform was already known. Ok, Gauss' fast Fourier method was not widely known and Good-Thomas maybe was also not widely known and Schönhage-Strassen came up only few years after Cooley-Tukey.
When messing around with how to mathmaticaly represent the number of single digit multiplications you have to do in the unoptimized method, i discovered that x^log2(n) is the same as n^log2(x) (it also works with other logs) and now I'm going to spend the next hour trying to figure out why that works
a^ln(b) = b^ln(a)
ln(b) = log_a(b^ln(a)) = ln(a) * log_a(b)
b = e^(ln(a)*log_a(b))
b = a^log_a(b)
b = b
or ln(b) = ln(a) * ln(b)/ln(a)
Well done! You've discovered the power log!
@@tiedänkö thank you!
Yeah, it's a weird-looking fact when you first see it.
@@PurpleMindCSwhat's weird about it and why??
multiplication traumatized me as a kid, im glad we have computers to do it for us
what
Me too
Love your animations man ! Coming from a Manim RUclipsr :)
Thank you so much! I remember checking out your channel too. Keep up the good work :)
do note the O(nlog(n)) of Harvey Hoven algorithm appears to have a huge constant, such that it becomes optimal only for large n on the order of 10^^2 (10^100)
I would expect the standard form of multiplication would be the fastest in binary given a number with a small bitcount (total 1s), so that might be taken into consideration when choosing Karatsuba or not. With a low enough bitcount it would end up being just a few additions in binary.
Really pleasing vid. Subbed.
Thanks so much!
This has a 3blue1brown vibe.
Very well done.
7:15 if i am understanding how this works it like separating each digits to their 10^n places? Like 1234 is broken down to 1000 200 30 4 and then individually multiplied to 5000 600 70 8?
For explanatory purposes yes.
In a computer the break down would be into powers of two rather than ten. (Exception below)
Or on some machines that are operated for 8bit manipulation the smallest growing might turn out to be in groups of 8 bits.
However, some machines also implement BCD arithmetic (or at least used to: for example the IBM 360 and 370 series, where fixed point integers of arbitrary length were stored in decimal, using 4 bits for each digit: hence the name "binary coded decimal").
Those machines could natively multiply arbitrarily long BCD integers in a single machine instruction, a non-interruptible instruction that typically took huge numbers of CPU cycles to complete.
These were much favoured in the COBOL era, as there's a direct mapping from the COBOL data types that don't tend to exist in science oriented languages.
Whether the microcode implemented the naïve algorithm or something better is an interesting question, which I never bothered to ask back in the day. Anyone with a preserved 370 is welcome to do the timing tests ...
Is there any technique that uses smaller partial products held as multiplied pairs in a look up table held in PROM?
A look up table in base 2 is trivial, and in effect already used in all binary multipliers which are hardwired that 1×1=1, and any other inputs give zero.
A look up table in base 10 is plausible, and even in base 100 world only need 10k bytes (5.5k if optimized for commutativity)
However that would still be slower for 8 digit numbers than a CPU-native 64bit multiply, giving a result of 64 bits and a carry of another 64bits.
Same for other subsets of the 64 bits
Can we store ALL the possible results of multiplying arbitrary 64 bit numbers?
To store these results in a look up table would take 32 billion billion bytes, which is not practical, even with attached storage that would slow the thing down hugely.
So for bases higher than base 2 your idea, however ingenious, is out performed by crunching at the hardware level. The CPU designers will have optimised that to some extent by pipelining and parallel execution, but the time taken is typically linear.
It may, but that changes nothing about the complexity. It just makes it faster.
i need more from this channel!
Fast adders are really interesting too!
would it make sense to hash the sub-results instead of calculating them over and over again? the longer the numbers get, the higher the changes to find sub-multiplications, you've already did - is checking a dict/hash table more expensive then just calculating the result again?
how much l1 cache you got?
Isn’t there also a hardware development coupled with the theory which serves to bring the constant down, such as parallelizing all the simple multiplications?
In the real implementation, the third line on the final graph, that has been done by the folks who designed the CPU. Python almost certainly uses native CPU for "small" numbers, and also for computing the partial products of the larger numbers.
So that's kind of bundled into the almost linear time part of the third line.
YEah, but those are usually trivial implementation details. No big gains just trying to optimize that.
Reducing constant time thru Parallelization only help in very very few cases, you need especific parallel algorithms.
can you make a video on the harvey hoeven algorithm? i tried reading its article but couldnt understand a word 😅 it would be a good one!
I enjoyed atumbling upon your channel! Great video.
I got a question for you. Are there any similar smart "shortcuts" for exponentials? I.e. x^y where x and y are both large? What if only y is large? What about x=y, or in other words x^x?
now we just need a vid about division
That's a thornier problem yet :) And CPUs don't just use fast algorithms, e.g. SRT, but use lookup tables. See the infamous Pentium division bug for how that went wrong quite a long time ago.
Curious. I thought this video was gonna be about how does the hardware multiply numbers. I remember having heard somewhere that it just does a loop where it adds the bigger of the 2 numbers to itself n times, equal to the lower number. But this sounds rather inefficient. I will look it up, but I guess that at the very least you could do a bit shift left with the most significant bit of the smaller number prior to doing the loop (less times).
I actually made some calculations on paper to see if it would work, and it does. The algorithm I came up with was:
Bit shift left the bigger number amount of times, equal to position of the most significant bit of the smaller number, and turn that bit to 0 in the smaller one. Then repeat the operation and keep adding the results together until the smaller number becomes 0. For example:
21 (10101) * 23 (10111) = 483 (111100011)
1) result = 10111 111001100
3) result = result + 10111
yes that would be horribly inefficient, you can do much better by shifting which--in essence--is a kind of long multiplication
Hi, I just realized that I omitted a word in the title which I had meant to include. Now the title of the video is: The Genius Way Computers Multiply **Big** Numbers.
So yes, at a low level, hardware can multiply numbers up to a certain size in parallel. But the topic of the video is algorithmic approaches for multiplying big numbers with thousands of digits.
Yes they do things like that, in logic, superfast, on a single CPU clock.
But that's still O(n*n) complexity. THe operation still depends on the size of both numbers.
What would interest me now is what performs better on the task of multiplying a large number with a small number lets say a number with more than 1000 digits with one with less than 2. Is the school way better or karatsubas method?
It becomes an O(n) complexity problem, so use the faster for low values of n.
I'm pretty sure the python slope for numbers below 9-10 digits changes because it changes from hardware multiplication to software multiplication. The hardware one probably uses the naive method, but it may be something else depending on hardware.
20:59 purplemind: TODAY’S SPONSORIER
me: no.
Increase A by A B-1 times and then just keep adding 0s forever, there you have it, O(1)
Doing something B or B-1 times means that the number of operations is proportional to B, so this method would actually be O(n). In addition, addition is linear time, so multiplying those time complexities together yields O(n^2) time complexity.
Below 64-bits for each number, on modern CPUs, it is faster to allow the silicon to multiply (however it does it), than to implement any sort of algorithm. Likely, that is what python is doing.