*My takeaways:* *1. What is optimizing work **0:44* *2. Bentley optimization rules **2:42* *3. Data structures* 3.1 Packing and encoding: reduce memory access 4:57 3.2 Argumentation: make common operations faster 10:50 3.3 Precomputation: perform calculations in advance 12:48 3.4 Caching: store results that have been accessed recently 23:30 3.5 Sparsity: avoid storing and computing zeros 26:32 *4. Logic* 4.1 Constant folding and propagation: evaluate and substitute constant expressions during compilation 38:55 4.2 Common-subexpression elimination: avoid computing the same expression multiple times by evaluating them once and storing them 41:55 4.3 Algebraic identities: replace expensive algebraic expressions with algebraic equivalents that require less work 44:17 4.4 Short-Circuiting: when performing a series of tests, stop evaluating as soon as you know the answer 48:03 4.5 Ordering tests: when ordering a series of tests, put the more often "successful" ones or computationally inexpensive ones first 52:00 4.6 Creating a fast path: creating computationally inexpensive alternatives 54:33 4.7 Combining tests: replace s series of tests with one test 57:38 *5. Loops* 5.1 Hoisting: avoid computing loop-invariant code 1:00:32 5.2 Sentinels: are special dummy values placed in a data structure to simplify the logic of boundary conditions, and in particular, the handling of loop-exit tests 1:02:45 5.3 Loop unrolling: combining several consecutive iterations of a loop into a single iterations 1:07:38 5.4 Loop fusion: combining multiple loops over the same index range into a single loop body, thereby saving the overhead of loop control 1:11:55 5.5 Eliminating wasted iterations: avoiding essentially empty loop bodies 1:13:45 *6. Functions* 6.1 Inlining: avoid the overhead of a function call by replacing a call to the function with the body of the function itself 1:16:30 6.2 Tail-recursion elimination: replace a recursive call that occurs as the last step of a function with a branch, saving function-call overhead (not covered) 6.3 Coarsening recursion: increase the size of the base case and handle it with more efficient code that avoids function-call overhead (not covered) *7. Closing advice **1:19:15*
The binomial coefficient table (18:25) is very naive with respect to integer size. To represent those rows up to 100, one needs at least 96 bits for the largest ones, which is 6 times as much as C guarantees for an 'int', and3 times as much as one actually gets nowadays. The int type is good only up to and including row 33 of Pascal's triangle (unsigned int would give you one more row)..
At 30:20, in Compressed Spare Row format (CSR) , matrix has few elements wrong. cols[11] = 4 (instead of 0 ] cols[12] = 3 (instead of 4) cols[13] = 4 (instead of 3) cols[14] = 5 (instead of 4) I discovered the issue when I coded up the CSR representation and wrote unit test based on classmaterial.
Uhh many thanks for sharing, I also noticed it but thought it should be intentional assuming they would fix it in the video, and have been trying to find a reasoning that would make that part correct for 10 mins :D
I came here to say the same thing, my mind exploded knowing it was wrong, but not knowing the algorithm I thought I had miscalculated; thanks for clarifying
1:04:00 this is broken. Signed integer overflow is undefined behavior. Some optimizing compilers could remove the ( sum < A[i] ) check, because they’re allowed to assume undefined behavior doesn’t happen
Well spotted! I sure hope students were not confused by this if/when they ran the code on their own computer because anything could happen! Kind of feel the fact that the this innocent-looking defensive programming is actually undefined behaviour that made it into an MIT lecture pretty much proves that the rules around overflow in C/C++ are ridiculous….
There seems to be an error in the example @ 31:18 for Row 4, Its shown at Col 0 and value 5, should be Col 4 and value 5. and for the next one too ? 8,9,7
In embedded system where execution must halt while the next data set is pending, I compute the partial computations instead of waiting example computing a 3D vector - sqrt ( x*x + y*y + z*z) Start data reception initialise output value to 0 read x start data reception calculate result = x*x by table lookup Read y start data reception calculate result += y*y by table lookup read z calculate result += z*z by table lookup calculate square root by custom routine tailored to the in-bound limits of 'result' I eliminate all the wast of checking for result ready and insert useful code.
1:07:00 what if A has only negative numbers? Shouldnt you rather add twice INT64_MAX? Even then we could get unlucky, if the sum at the end was equal to INT64_MIN
1:05:00 But what if A is like { 0, 1, 1000, 10 }? After sum += 0, sum += 1, sum will be 1 and A[i] = 1000 so the loop breaks, also i is less than n and function return true which means it does overflow, but it doesn’t!
sum += A[i++] The A[i++] grabs the next elem in A. This is equivalent to sum += A[i+1]. So the first iteration it to will the 1, then the second it will at the 1000, then the third it will add the the Maxint, overflow at i=n, then the check if i < n will return false.
B/c processors also have instructions cache that will be polluted and could eviction of useful entries could make the code slower, not faster. In C++ there exist an issue called "template bloat" - all templated code is implicitly inline, so each template instantiation generates a totally new instance of a code. If a library author isn't careful to design their templates class hierarchy to move as much template parameter independent code into base classes, there'll be too much unnecessary code generated, which could slow down the program.
Also it should be noted, that in C++ (just don't know about C) inline keyword doesn't even make compiler to inline the code anymore 😁 Instead it's more a tool to fix ODR violations 😁
At 57:00 , can we expect the compiler to always pick up on those common sub-expression eliminations, or at what point are we expected to do that ourselves?
If it's something you'd expect to make a significant performance difference, then it's probably worth checking the assembly output of that section during debugging to make sure it's doing what you expect, then gaining a sense of what it can recognize and optimize and what it needs some help with
In general, no, because a potential aliasing could require a compiler to issue re-reads/re-computations if an object of the same type was modified and the compiler wasn't able to prove it's a different object. So doing CSE manually is still a good habit to have
It's not maths. Assume b = 1 and c = 2. Then what this will do is : c = b + c or c = 3. This happens like this : (Note &a : address of variable a in memory. ) A new register pops up goes to the memory location of b using &b and copies the 64 bit value of b in itself. (Say this register is X). Similarly a second register pops up goes to the memory location of 0c using &c and copies the 64 bit value of c in itself. (Say this register is Y). Now these two register go to the ALU or Arithmetic and Logic Unit that knows how to do addition. The ALU takes the contents of X and Y and does Y
These optimizations are absolutely relevant regardless. Fundamentally, compilers can only do so much optimization by itself - so many edge cases and possibilities to consider for the compiler.
I'm learning some assembly as well as computer organization recently, and it greatly helped me to appreciate these effort, because, regardless how advanced the compiler is, our code is compiled to a sequence of instructions(add, multiply, move stuff from memory to register), and the optimization mentioned here are basically reducing the unnecessary / expensive instructions that we will feed to CPU. (like the test mentioned in the two dots example, your compiler is not going to be that smart to put these checks before the sqrt)
This is good stuff BUT it is Low-Level. Programmers will think all they have to do is rummage through their code looking for things like this and then apply fixes like these. Again, this is good stuff, but when you're looking at a mega-line app, first you need to _find out_ what needs to be fixed. For example, you could see some O(n^2) code and try to replace it with something O(n log n). But if the PC is never there, and n never gets bigger than 3, the problem is elsewhere. The method I've used for a long time is _random pausing._ It sounds like the "poor man's profiler", but it's different. All it needs is a way to get a small number of stack traces at times uncorrelated with what the program is doing. There are tools to do this, but there is always a debugger that can be manually paused at random. Assume there is a problem that needs to be fixed. (There always is!) Assume that, when fixed, it will save X% of time. Then the probability that a pause will land in it is _at least_ X%. It could be anything - low-level algorithm, memory allocation, reading/writing a file, etc. The reason you look at the stack is to see _why_ it's being done. If the reason why is a poor reason, if it could be improved or eliminated, you're looking at a speedup. And it's always a surprise! It's never something you could see by just eyeballing the code. Typically X is in the range of 10% to 90%. So you take pauses and stack traces until you see the problem, and 10 is usually enough. I've never done more than 20. Remember, the object is to _find_ the problem, not to _measure_ it with precision, which is why a large number of samples is not needed, especially when the large number of samples causes not finding the problem. There are almost always multiple problems, in a range of sizes. If problem A is 50%, B is 25%, and C is 12.5%, fixing A doubles the speed. After fixing A, now B and C are doubled, to 50% and 25%, so if now you find and fix B, speed is doubled again. Now, of course, C is 50%, and fixing it doubles the speed yet again! But, if any one of them is not found, you see that a heavy price will be paid. This is a very old trick. I just fell into it some 50 years ago, and I met various old-timers who've used it and thought nothing of it. At the same time, programmers are being taught about various profiling tools that, honestly, can't hold a candle to the random pausing method. Thanks again for good stuff. By the way, I'm a big fan of Jon Bentley. P.S. Granted, there are profilers that take thousands of stack traces. But then they _don't show them to you,_ because there are too many. Instead they try to summarize into things like hot paths, flame graphs, etc. Those summaries have the effect of _hiding_ what would be obvious if you simply examined a small random selection of them and matched them to the corresponding lines of source code. Then, to make matters worse, they suppress taking samples during blockage like I/O (making it invisible), and pretend that's an advantage by calling them "CPU profilers". P.P.S. What about multi-threading? Then what you do is grab the stack traces of all relevant threads at the same time. If the app could be faster, it's because some thread is being held up waiting for some other thread. So you find the other thread and see what it's doing, and so on. This process is very much like debugging. You have to apply your mental faculties to the moment in time.
MIT OCW. Best thing that ever happened on the internet.
I know right? How is he so smart? 👍👍
true
Church!
*My takeaways:*
*1. What is optimizing work **0:44*
*2. Bentley optimization rules **2:42*
*3. Data structures*
3.1 Packing and encoding: reduce memory access 4:57
3.2 Argumentation: make common operations faster 10:50
3.3 Precomputation: perform calculations in advance 12:48
3.4 Caching: store results that have been accessed recently 23:30
3.5 Sparsity: avoid storing and computing zeros 26:32
*4. Logic*
4.1 Constant folding and propagation: evaluate and substitute constant expressions during compilation 38:55
4.2 Common-subexpression elimination: avoid computing the same expression multiple times by evaluating them once and storing them 41:55
4.3 Algebraic identities: replace expensive algebraic expressions with algebraic equivalents that require less work 44:17
4.4 Short-Circuiting: when performing a series of tests, stop evaluating as soon as you know the answer 48:03
4.5 Ordering tests: when ordering a series of tests, put the more often "successful" ones or computationally inexpensive ones first 52:00
4.6 Creating a fast path: creating computationally inexpensive alternatives 54:33
4.7 Combining tests: replace s series of tests with one test 57:38
*5. Loops*
5.1 Hoisting: avoid computing loop-invariant code 1:00:32
5.2 Sentinels: are special dummy values placed in a data structure to simplify the logic of boundary conditions, and in particular, the handling of loop-exit tests 1:02:45
5.3 Loop unrolling: combining several consecutive iterations of a loop into a single iterations 1:07:38
5.4 Loop fusion: combining multiple loops over the same index range into a single loop body, thereby saving the overhead of loop control 1:11:55
5.5 Eliminating wasted iterations: avoiding essentially empty loop bodies 1:13:45
*6. Functions*
6.1 Inlining: avoid the overhead of a function call by replacing a call to the function with the body of the function itself 1:16:30
6.2 Tail-recursion elimination: replace a recursive call that occurs as the last step of a function with a branch, saving function-call overhead (not covered)
6.3 Coarsening recursion: increase the size of the base case and handle it with more efficient code that avoids function-call overhead (not covered)
*7. Closing advice **1:19:15*
My takeaways: *guuulp* 💦🥤
can i start this course with only basic python programming skill as a prerequisite??
Thanks a lot to MIT. Forever Grateful to you guys.
The binomial coefficient table (18:25) is very naive with respect to integer size. To represent those rows up to 100, one needs at least 96 bits for the largest ones, which is 6 times as much as C guarantees for an 'int', and3 times as much as one actually gets nowadays. The int type is good only up to and including row 33 of Pascal's triangle (unsigned int would give you one more row)..
At 30:20, in Compressed Spare Row format (CSR) , matrix has few elements wrong.
cols[11] = 4 (instead of 0 ]
cols[12] = 3 (instead of 4)
cols[13] = 4 (instead of 3)
cols[14] = 5 (instead of 4)
I discovered the issue when I coded up the CSR representation and wrote unit test based on classmaterial.
Uhh many thanks for sharing, I also noticed it but thought it should be intentional assuming they would fix it in the video, and have been trying to find a reasoning that would make that part correct for 10 mins :D
I came here to say the same thing, my mind exploded knowing it was wrong, but not knowing the algorithm I thought I had miscalculated; thanks for clarifying
1:04:00 this is broken. Signed integer overflow is undefined behavior. Some optimizing compilers could remove the ( sum < A[i] ) check, because they’re allowed to assume undefined behavior doesn’t happen
Well spotted! I sure hope students were not confused by this if/when they ran the code on their own computer because anything could happen!
Kind of feel the fact that the this innocent-looking defensive programming is actually undefined behaviour that made it into an MIT lecture pretty much proves that the rules around overflow in C/C++ are ridiculous….
@@constantijndekker8343 ridiculous or not, these are the rules.
Didn't expect to see such a mistake in a course made by MIT...
There seems to be an error in the example @ 31:18 for Row 4, Its shown at Col 0 and value 5, should be Col 4 and value 5. and for the next one too ? 8,9,7
this lecture is so cool it hurts
In embedded system where execution must halt while the next data set is pending, I compute the partial computations instead of waiting
example computing a 3D vector - sqrt ( x*x + y*y + z*z)
Start data reception
initialise output value to 0
read x
start data reception
calculate result = x*x by table lookup
Read y
start data reception
calculate result += y*y by table lookup
read z
calculate result += z*z by table lookup
calculate square root by custom routine tailored to the in-bound limits of 'result'
I eliminate all the wast of checking for result ready and insert useful code.
1:07:00 what if A has only negative numbers? Shouldnt you rather add twice INT64_MAX? Even then we could get unlucky, if the sum at the end was equal to INT64_MIN
Never wait too long to prematurely optimize
23:55 reminded me of that sketch of key & peele
edit: just search key and peele hypotenuse totally worth the five minutes and twenty six seconds
1:05:00
But what if A is like { 0, 1, 1000, 10 }? After sum += 0, sum += 1, sum will be 1 and A[i] = 1000 so the loop breaks, also i is less than n and function return true which means it does overflow, but it doesn’t!
Note the prefixed ++i in the body.
It would add A[2] then compare sum >= A[2]
sum += A[i++]
The A[i++] grabs the next elem in A. This is equivalent to sum += A[i+1]. So the first iteration it to will the 1, then the second it will at the 1000, then the third it will add the the Maxint, overflow at i=n, then the check if i < n will return false.
1:17:40 so why not make all functions static inline? Does it only work for one-liner functions?
@@jean-claudearbaut7322interesting, thanks!
B/c processors also have instructions cache that will be polluted and could eviction of useful entries could make the code slower, not faster.
In C++ there exist an issue called "template bloat" - all templated code is implicitly inline, so each template instantiation generates a totally new instance of a code. If a library author isn't careful to design their templates class hierarchy to move as much template parameter independent code into base classes, there'll be too much unnecessary code generated, which could slow down the program.
Also it should be noted, that in C++ (just don't know about C) inline keyword doesn't even make compiler to inline the code anymore 😁 Instead it's more a tool to fix ODR violations 😁
At 57:00 , can we expect the compiler to always pick up on those common sub-expression eliminations, or at what point are we expected to do that ourselves?
If it's something you'd expect to make a significant performance difference, then it's probably worth checking the assembly output of that section during debugging to make sure it's doing what you expect, then gaining a sense of what it can recognize and optimize and what it needs some help with
In general, no, because a potential aliasing could require a compiler to issue re-reads/re-computations if an object of the same type was modified and the compiler wasn't able to prove it's a different object. So doing CSE manually is still a good habit to have
I foud out it is really good course.
thank you mit
For the full adder, or more specifically for any truth table: isn't it better to reduce it to POS or SOP form and test against those?
22:30 or just use constexpr from modern C++ compilers
Yeah I also thought, there should be a better wat to do this. But maybe this constexpr stuff is only in C++ and that’s why he didn’t mention it?
Amazing work, thanks
24:00 guys phone alarm goes off
'if (expr) return true; return false;'
@44:02
Why c=b+c
Is not equals to b=0?
It's not maths.
Assume b = 1 and c = 2.
Then what this will do is : c = b + c or c = 3.
This happens like this : (Note &a : address of variable a in memory. )
A new register pops up goes to the memory location of b using &b and copies the 64 bit value of b in itself. (Say this register is X).
Similarly a second register pops up goes to the memory location of 0c using &c and copies the 64 bit value of c in itself. (Say this register is Y).
Now these two register go to the ALU or Arithmetic and Logic Unit that knows how to do addition. The ALU takes the contents of X and Y and does Y
Thanks MIT 🙏
00:32:34 rest in peace teabag paper
Lol the instructor rushing to finish at the end because the class is leaving.
thanks MIT
I like this course
How relevant are these optimizations when writing code in modern compiled languages such as C# or Rust?
These optimizations are absolutely relevant regardless. Fundamentally, compilers can only do so much optimization by itself - so many edge cases and possibilities to consider for the compiler.
I'm learning some assembly as well as computer organization recently, and it greatly helped me to appreciate these effort, because, regardless how advanced the compiler is, our code is compiled to a sequence of instructions(add, multiply, move stuff from memory to register), and the optimization mentioned here are basically reducing the unnecessary / expensive instructions that we will feed to CPU. (like the test mentioned in the two dots example, your compiler is not going to be that smart to put these checks before the sqrt)
51:29
He shouldn’t use single letter variable names, makes it unnecessary confusing
This is good stuff BUT it is Low-Level. Programmers will think all they have to do is rummage through their code looking for things like this and then apply fixes like these. Again, this is good stuff, but when you're looking at a mega-line app, first you need to _find out_ what needs to be fixed. For example, you could see some O(n^2) code and try to replace it with something O(n log n). But if the PC is never there, and n never gets bigger than 3, the problem is elsewhere.
The method I've used for a long time is _random pausing._ It sounds like the "poor man's profiler", but it's different. All it needs is a way to get a small number of stack traces at times uncorrelated with what the program is doing. There are tools to do this, but there is always a debugger that can be manually paused at random.
Assume there is a problem that needs to be fixed. (There always is!) Assume that, when fixed, it will save X% of time. Then the probability that a pause will land in it is _at least_ X%. It could be anything - low-level algorithm, memory allocation, reading/writing a file, etc. The reason you look at the stack is to see _why_ it's being done. If the reason why is a poor reason, if it could be improved or eliminated, you're looking at a speedup. And it's always a surprise! It's never something you could see by just eyeballing the code.
Typically X is in the range of 10% to 90%. So you take pauses and stack traces until you see the problem, and 10 is usually enough. I've never done more than 20. Remember, the object is to _find_ the problem, not to _measure_ it with precision, which is why a large number of samples is not needed, especially when the large number of samples causes not finding the problem.
There are almost always multiple problems, in a range of sizes. If problem A is 50%, B is 25%, and C is 12.5%, fixing A doubles the speed. After fixing A, now B and C are doubled, to 50% and 25%, so if now you find and fix B, speed is doubled again. Now, of course, C is 50%, and fixing it doubles the speed yet again! But, if any one of them is not found, you see that a heavy price will be paid.
This is a very old trick. I just fell into it some 50 years ago, and I met various old-timers who've used it and thought nothing of it. At the same time, programmers are being taught about various profiling tools that, honestly, can't hold a candle to the random pausing method.
Thanks again for good stuff. By the way, I'm a big fan of Jon Bentley.
P.S. Granted, there are profilers that take thousands of stack traces. But then they _don't show them to you,_ because there are too many. Instead they try to summarize into things like hot paths, flame graphs, etc. Those summaries have the effect of _hiding_ what would be obvious if you simply examined a small random selection of them and matched them to the corresponding lines of source code. Then, to make matters worse, they suppress taking samples during blockage like I/O (making it invisible), and pretend that's an advantage by calling them "CPU profilers".
P.P.S. What about multi-threading? Then what you do is grab the stack traces of all relevant threads at the same time. If the app could be faster, it's because some thread is being held up waiting for some other thread. So you find the other thread and see what it's doing, and so on. This process is very much like debugging. You have to apply your mental faculties to the moment in time.
Great lecture, but... Oh my, C is such an awful and error prone language. It should have never been invented.