*My takeaways:* *1. Why do we study performance engineering **0:57* 1.1 Why we can't improve the frequency of hardware anymore 10:00, because static power consumption dominates the power consumption in modern hardware, and it is hard to reduce it. 1.2 Parallelism is used to increase performance 11:49, but the performance is no longer free since we have to write programs that can run in parallel. *2. An example of performance engineering: Matrix multiplication **15:35* 2.1 Hyperthreading can improve performance a little bit, but it makes performance measurement difficult 17:17 2.2 Python code running time: *~6 hours* 19:38 2.2 Java code running time: *~46 minutes* 23:44 2.3 C code running time: *~19 minutes* 24:23 2.4 Why Python is slow 25:30, because Python is interpreted, C is compiled and Java is in between. 2.5 Further improvement on the C code. - Change the order of loops: *~3 minutes* running time, because cache locality is exploited 29:50, and we can use measurement tool $ cachegrind 33:54 *-- In C, a matrix are stored row by row in memory, therefore access data column by column can exploit cache locality.* *-- In C[i][j] += A[i][k] * B[k][j], we want to maximise the access to the column index which are [j], [k] and [j], and because there are two [j], put [j] in the inner loop could maximise cache locality.* *-- We also want to minimise the access to row index because it can break cache locality, and since there are two[i] and one [k], put [i] in the outer loop could maximise cache locality.* *-- Put [k] in the middle loop.* - Change the compiler flag: *~54s* running time 34:59 - Use more cores: *~3s* running time 36:24, rule of thumb is parallel outer loops rather than inner loops. - Manage cache for data reuse, tiled matrix multiplication with parallelism: *~1.74s* running time 40:10 - Tiling for two-level cache, recursive matrix multiplication with parallelism: *~94s* running time 45:45, because the overhead of recursion is large, but we can coarsening the recursion to get *~1.3s* running time 49:00 - Using vector hardware, SIMD with compiler: *~0.7s* running time 51:35 - Using vector instruction directly: *~0.4s* running time 55:55 *3. Matrix multiplication is a good example, but we generally won't see such magnitude of improvement in other cases **58:58*
@@awsumpersun321 Only trivial an an absurdly high level abstraction. Just try to implement either. If you take those two out as not everything can be parrallel(or alternately the task is already parallel), and if you are not totally daft(regarding high demand computation) and assumed a compiled language from the start; then you are still looking at a 92x improvement. Now consider super compute machines that consume $10'000 per day and many scientific processes that need many days or even months of runtime. Then even 2% is a substantial savings, let alone cutting it to 1/92 (~1%) of the original.
I have not been expecting it to be a performance optimization class (i didn't read the description before, i was looking for game dev related stuff), but i have found very interesting. I never had this kind of advanced concept before about how computers REALLY works.
As to why we cannot cross 40% peak, is the reason memory accesses? Since the whole matrix of such size cannot be cached, I am curious if there are more reasons as well
Great class. I had to laugh at how much just two years have given over to specs as he describes the AWS large machine as a "honkin' big machine". Enterprise VD environments are really pushing what constitutes a honkin' big machine upward.
There were bigger AWS options even in 2018, the point is that most folks are using less than 1% of it and think they are really cooking or that they just need a faster machine. Seeing what 1980s programmers could push through an 8bit console with 2k of memory (like the NES) with various optimization tricks, now those machines had a good utilization ratio.(Yes, they were simpler than modern workstation architectures and thus easier to optimize.)
Because it's the A[i][k] * B[k][j] so k is common between A and B and needs to be executed sequentially for obtaining the result C, this is my understanding of the reason
Was this video unlisted until just 30-ish minutes ago? I saw the title and was excited to see another updated 6.172, but then I noticed that the thumbnail showed that I'd already watched it before.
I went through this very engaging lecture and thought I understood it. Maybe I'm missing something, because I ran the matrix multiplication on my PC and my results came back very fast. Without optimization, C performed the operation in just 0.03 seconds. Please am I getting something wrong as to why it took longer to perform in this lecture
These optimizations are amazing, python code isn't great though. I managed to write a pure python version that runs in 300s(using pypy). No optimizations other than loop reordering. Still slow, but it's just 5 loc that were written in 3 minutes.
@@miguelescutia5556 yes, it's an amazing lecture. Just wanted to highlight that barely optimized python is less than 2x slower than C at the same level of optimization.
In Performance Engineering all aspects of Computer Science come together: Programming, Operating Systems, Memory models, (Parallel) Algorithm designed, Computer/Chip Architecture, Infrastructure Architecture etc. Performance optimization is about going from one layer (to optimize) to the next layer of the architecture of your application and continue cycling through the layers till goals are met. Always measure, qualify, quantify before change something. Ok, sometimes you do not have enough understanding or some parts might be a black box, that allows for experiencing.
what is absolutely bonkers is python's numpy.matmul holds it's own up until #9 on my 5 year old macbook pro, and it looks like, C = np.matmul(A,B). Lol. So yes, plain python is slow but it can be fast. Good Lecture.
@Aaron, The core logic of NumPy that actually does the multiplication is written in C and not Python. github.com/numpy/numpy/tree/master/numpy/core/src
Did he jus say "well i took this 1.3hr class to prove you don't need performance, just write your god damn code in such a way that it makes some fucking sense" 😂
If I wanted to make an Amish-like computer, using very simple and primitive parts, and if my goal was to produce all of it in the United States instead of in China, it might make sense to build computers and appliances that had less power, less memory, a frugal approach to programming. You might make a goal of building a computer that doesn't use certain kinds of rare earth elements or something, using only elements that could be mined locally, so you would give up a lot of the advanced, modern hardware.
*My takeaways:*
*1. Why do we study performance engineering **0:57*
1.1 Why we can't improve the frequency of hardware anymore 10:00, because static power consumption dominates the power consumption in modern hardware, and it is hard to reduce it.
1.2 Parallelism is used to increase performance 11:49, but the performance is no longer free since we have to write programs that can run in parallel.
*2. An example of performance engineering: Matrix multiplication **15:35*
2.1 Hyperthreading can improve performance a little bit, but it makes performance measurement difficult 17:17
2.2 Python code running time: *~6 hours* 19:38
2.2 Java code running time: *~46 minutes* 23:44
2.3 C code running time: *~19 minutes* 24:23
2.4 Why Python is slow 25:30, because Python is interpreted, C is compiled and Java is in between.
2.5 Further improvement on the C code.
- Change the order of loops: *~3 minutes* running time, because cache locality is exploited 29:50, and we can use measurement tool $ cachegrind 33:54
*-- In C, a matrix are stored row by row in memory, therefore access data column by column can exploit cache locality.*
*-- In C[i][j] += A[i][k] * B[k][j], we want to maximise the access to the column index which are [j], [k] and [j], and because there are two [j], put [j] in the inner loop could maximise cache locality.*
*-- We also want to minimise the access to row index because it can break cache locality, and since there are two[i] and one [k], put [i] in the outer loop could maximise cache locality.*
*-- Put [k] in the middle loop.*
- Change the compiler flag: *~54s* running time 34:59
- Use more cores: *~3s* running time 36:24, rule of thumb is parallel outer loops rather than inner loops.
- Manage cache for data reuse, tiled matrix multiplication with parallelism: *~1.74s* running time 40:10
- Tiling for two-level cache, recursive matrix multiplication with parallelism: *~94s* running time 45:45, because the overhead of recursion is large, but we can coarsening the recursion to get *~1.3s* running time 49:00
- Using vector hardware, SIMD with compiler: *~0.7s* running time 51:35
- Using vector instruction directly: *~0.4s* running time 55:55
*3. Matrix multiplication is a good example, but we generally won't see such magnitude of improvement in other cases **58:58*
You're God's gift to humankind. Thanks a ton for this Lei Xun! Much appreciated!
@@vishnusingh4118 You are welcome
Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?
@@Kingpin.7 Mostly in order
Great takeaways!
From 6 hrs in python down to less than half a second. Indeed impressive optimization!
And you said that exactly 6 months back from today.
spoiler warning zzz
spoiler warning zzz
most of it attributed to parallelization and language though (trivial changes)
@@awsumpersun321 Only trivial an an absurdly high level abstraction. Just try to implement either. If you take those two out as not everything can be parrallel(or alternately the task is already parallel), and if you are not totally daft(regarding high demand computation) and assumed a compiled language from the start; then you are still looking at a 92x improvement. Now consider super compute machines that consume $10'000 per day and many scientific processes that need many days or even months of runtime. Then even 2% is a substantial savings, let alone cutting it to 1/92 (~1%) of the original.
The material taught here is gold. Zero dislikes, just how it's supposed to be!
I think Cuz of your comment it got 2 dislikes🥲🥲
@@merlinarthur4661 plausible
still no dislikes a year later
@@larry_the RUclips turned off dislikes a couple months ago, its rainbows and unicorns forever....
u fucking jinxed it
This is a truly amazing class.
Glad to see Prof Leiserson again.
Btw, CLRS is always awesome!
hha invoker
Thank you MIT! Once again a great video and lecture.
I have not been expecting it to be a performance optimization class (i didn't read the description before, i was looking for game dev related stuff), but i have found very interesting. I never had this kind of advanced concept before about how computers REALLY works.
One of the best classes I have attended. Absolutely fabulous. Thank you for the upload.
2:59 Changed my whole view of SE
38:16 because possible racing condition? since the use of +=
Wow! Thank you… a lot of the question marks that I’ve encountered before was answered here…
I really appreciate this open source courses
This is incredible. Thank you so much for this course MIT.
26:12 can it be faster if you run it on assembly? And if yes, how much?
As to why we cannot cross 40% peak, is the reason memory accesses? Since the whole matrix of such size cannot be cached, I am curious if there are more reasons as well
Thanks MIT, amazing lecture
A course in System Software and Programming is must for CS / CpE majors
This guy wrote Intro to Algos, OMG😱
Wow that's legendary
Great class. I had to laugh at how much just two years have given over to specs as he describes the AWS large machine as a "honkin' big machine". Enterprise VD environments are really pushing what constitutes a honkin' big machine upward.
There were bigger AWS options even in 2018, the point is that most folks are using less than 1% of it and think they are really cooking or that they just need a faster machine.
Seeing what 1980s programmers could push through an 8bit console with 2k of memory (like the NES) with various optimization tricks, now those machines had a good utilization ratio.(Yes, they were simpler than modern workstation architectures and thus easier to optimize.)
@14:30 - 2000 "OH OHS" = oughts
Can someone explain why the k loop can't be parallelized?
Because it's the A[i][k] * B[k][j] so k is common between A and B and needs to be executed sequentially for obtaining the result C, this is my understanding of the reason
Amazing lecture! Why does it go only up to only 41.6% of peak?
what a great lecture!!
What is the source/paper/study behind the sldies @ 14:00?
Visit ocw.mit.edu/6-172F18 for the course materials. Best wishes on your studies!
Was this video unlisted until just 30-ish minutes ago?
I saw the title and was excited to see another updated 6.172, but then I noticed that the thumbnail showed that I'd already watched it before.
Good catch! Yes it was. We're glad you got to enjoy it anyways :D
I went through this very engaging lecture and thought I understood it. Maybe I'm missing something, because I ran the matrix multiplication on my PC and my results came back very fast. Without optimization, C performed the operation in just 0.03 seconds. Please am I getting something wrong as to why it took longer to perform in this lecture
Did you use 4096x4096 matrix?
Thanks MIT!!! keep up good work...
Thanks for sharing this excelent lesson!
Marking my attendance 🙋🏻♀️
Anyone else?
Thank you MIT
how long does a numpy matrix multiplication take?
Thank you for sharing great lessons.
These optimizations are amazing, python code isn't great though. I managed to write a pure python version that runs in 300s(using pypy). No optimizations other than loop reordering. Still slow, but it's just 5 loc that were written in 3 minutes.
You used concepts from his lesson anyway.
@@miguelescutia5556 yes, it's an amazing lecture. Just wanted to highlight that barely optimized python is less than 2x slower than C at the same level of optimization.
what is the pescribed reading for this course ?.
6.172 has no required readings. There is a list of supplemental readings on MIT OpenCourseWare at: ocw.mit.edu/6-172F18. Best wishes on your studies!
Thanks @@mitocw
In Performance Engineering all aspects of Computer Science come together: Programming, Operating Systems, Memory models, (Parallel) Algorithm designed, Computer/Chip Architecture, Infrastructure Architecture etc.
Performance optimization is about going from one layer (to optimize) to the next layer of the architecture of your application and continue cycling through the layers till goals are met. Always measure, qualify, quantify before change something. Ok, sometimes you do not have enough understanding or some parts might be a black box, that allows for experiencing.
It's a cool lecture!
7:00 so michael jackson is a computer scientist huh
Great resource.
Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?
It is recorded class lectures, as it clearly says in the description. They are a series, not independent.
That was amazing
what was changed in this re-upload?
This was not a re-upload. We just forgot to make this lecture public. Everything else was public, except this one lecture. 🤦♂️
Sehr gut
Chalkboard is so cool
what is absolutely bonkers is python's numpy.matmul holds it's own up until #9 on my 5 year old macbook pro, and it looks like, C = np.matmul(A,B). Lol. So yes, plain python is slow but it can be fast. Good Lecture.
@Aaron, The core logic of NumPy that actually does the multiplication is written in C and not Python. github.com/numpy/numpy/tree/master/numpy/core/src
@@sauravtiwary8513 Yep, can't get around the laws of physics. (Same algorithms pre-compiled vs interpreted in real time.)
Why not try Forth programing language; supper fast.
Thankyou
VAHAB BEYAAZ AHMET ÇAKAAR
Do someone knows some book of software? I wanna study an other engineering but I would like to learn software too.
program vs software vs application 🤔
Bro why does this have so many views??
Did he jus say "well i took this 1.3hr class to prove you don't need performance, just write your god damn code in such a way that it makes some fucking sense" 😂
If I wanted to make an Amish-like computer, using very simple and primitive parts, and if my goal was to produce all of it in the United States instead of in China, it might make sense to build computers and appliances that had less power, less memory, a frugal approach to programming. You might make a goal of building a computer that doesn't use certain kinds of rare earth elements or something, using only elements that could be mined locally, so you would give up a lot of the advanced, modern hardware.
You can make computers with water and mechanical valves. So what?
Run it in Rust though
check
Web monkeys mentioned
i wonder what he's drinking?
TIL: web monkey concept 😂😅
Rank(0)(1,0)matrix permutix.
Basic stuff I learned early on in my first year.
This is a 100 level class ...so your point is?
@@mytech6779 I think he thought of the 3×3 matrix multiplication in C taught in mostly 1st year of most University 😂😂