1. Introduction and Matrix Multiplication

Поделиться
HTML-код
  • Опубликовано: 9 янв 2025

Комментарии • 99

  • @leixun
    @leixun 4 года назад +200

    *My takeaways:*
    *1. Why do we study performance engineering **0:57*
    1.1 Why we can't improve the frequency of hardware anymore 10:00, because static power consumption dominates the power consumption in modern hardware, and it is hard to reduce it.
    1.2 Parallelism is used to increase performance 11:49, but the performance is no longer free since we have to write programs that can run in parallel.
    *2. An example of performance engineering: Matrix multiplication **15:35*
    2.1 Hyperthreading can improve performance a little bit, but it makes performance measurement difficult 17:17
    2.2 Python code running time: *~6 hours* 19:38
    2.2 Java code running time: *~46 minutes* 23:44
    2.3 C code running time: *~19 minutes* 24:23
    2.4 Why Python is slow 25:30, because Python is interpreted, C is compiled and Java is in between.
    2.5 Further improvement on the C code.
    - Change the order of loops: *~3 minutes* running time, because cache locality is exploited 29:50, and we can use measurement tool $ cachegrind 33:54
    *-- In C, a matrix are stored row by row in memory, therefore access data column by column can exploit cache locality.*
    *-- In C[i][j] += A[i][k] * B[k][j], we want to maximise the access to the column index which are [j], [k] and [j], and because there are two [j], put [j] in the inner loop could maximise cache locality.*
    *-- We also want to minimise the access to row index because it can break cache locality, and since there are two[i] and one [k], put [i] in the outer loop could maximise cache locality.*
    *-- Put [k] in the middle loop.*
    - Change the compiler flag: *~54s* running time 34:59
    - Use more cores: *~3s* running time 36:24, rule of thumb is parallel outer loops rather than inner loops.
    - Manage cache for data reuse, tiled matrix multiplication with parallelism: *~1.74s* running time 40:10
    - Tiling for two-level cache, recursive matrix multiplication with parallelism: *~94s* running time 45:45, because the overhead of recursion is large, but we can coarsening the recursion to get *~1.3s* running time 49:00
    - Using vector hardware, SIMD with compiler: *~0.7s* running time 51:35
    - Using vector instruction directly: *~0.4s* running time 55:55
    *3. Matrix multiplication is a good example, but we generally won't see such magnitude of improvement in other cases **58:58*

    • @vishnusingh4118
      @vishnusingh4118 4 года назад +10

      You're God's gift to humankind. Thanks a ton for this Lei Xun! Much appreciated!

    • @leixun
      @leixun 4 года назад +2

      @@vishnusingh4118 You are welcome

    • @Kingpin.7
      @Kingpin.7 3 года назад +1

      Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?

    • @leixun
      @leixun 3 года назад +2

      @@Kingpin.7 Mostly in order

    • @user-mr-m12312
      @user-mr-m12312 3 года назад +2

      Great takeaways!

  • @hengchunsong301
    @hengchunsong301 5 лет назад +100

    From 6 hrs in python down to less than half a second. Indeed impressive optimization!

    • @takasurazeem
      @takasurazeem 4 года назад +6

      And you said that exactly 6 months back from today.

    • @jiaming5269
      @jiaming5269 4 года назад +2

      spoiler warning zzz

    • @jiaming5269
      @jiaming5269 4 года назад

      spoiler warning zzz

    • @awsumpersun321
      @awsumpersun321 4 года назад +2

      most of it attributed to parallelization and language though (trivial changes)

    • @mytech6779
      @mytech6779 2 года назад +4

      @@awsumpersun321 Only trivial an an absurdly high level abstraction. Just try to implement either. If you take those two out as not everything can be parrallel(or alternately the task is already parallel), and if you are not totally daft(regarding high demand computation) and assumed a compiled language from the start; then you are still looking at a 92x improvement. Now consider super compute machines that consume $10'000 per day and many scientific processes that need many days or even months of runtime. Then even 2% is a substantial savings, let alone cutting it to 1/92 (~1%) of the original.

  • @ajayrajan8882
    @ajayrajan8882 4 года назад +33

    The material taught here is gold. Zero dislikes, just how it's supposed to be!

    • @merlinarthur4661
      @merlinarthur4661 3 года назад +8

      I think Cuz of your comment it got 2 dislikes🥲🥲

    • @ajayrajan8882
      @ajayrajan8882 3 года назад +1

      @@merlinarthur4661 plausible

    • @larry_the
      @larry_the 2 года назад

      still no dislikes a year later

    • @mytech6779
      @mytech6779 2 года назад +4

      @@larry_the RUclips turned off dislikes a couple months ago, its rainbows and unicorns forever....

    • @sarantsogtmunkhbaatar6974
      @sarantsogtmunkhbaatar6974 2 года назад

      u fucking jinxed it

  • @windmaple
    @windmaple 5 лет назад +50

    This is a truly amazing class.

  • @MIT60346
    @MIT60346 5 лет назад +31

    Glad to see Prof Leiserson again.
    Btw, CLRS is always awesome!

    • @奇l
      @奇l 3 года назад +1

      hha invoker

  • @josephgibson2548
    @josephgibson2548 5 лет назад +27

    Thank you MIT! Once again a great video and lecture.

  • @OTadashi162
    @OTadashi162 2 года назад +1

    I have not been expecting it to be a performance optimization class (i didn't read the description before, i was looking for game dev related stuff), but i have found very interesting. I never had this kind of advanced concept before about how computers REALLY works.

  • @BinuVargheseamd
    @BinuVargheseamd 2 года назад

    One of the best classes I have attended. Absolutely fabulous. Thank you for the upload.

  • @npe_exception
    @npe_exception 3 года назад +3

    2:59 Changed my whole view of SE

  • @Originalimoc
    @Originalimoc 3 года назад +1

    38:16 because possible racing condition? since the use of +=

  • @mandcali7613
    @mandcali7613 Год назад

    Wow! Thank you… a lot of the question marks that I’ve encountered before was answered here…

  • @takshpatel8109
    @takshpatel8109 2 года назад

    I really appreciate this open source courses

  • @kythrathesuntamer9715
    @kythrathesuntamer9715 2 года назад +1

    This is incredible. Thank you so much for this course MIT.

  • @backpackmusician
    @backpackmusician 5 месяцев назад

    26:12 can it be faster if you run it on assembly? And if yes, how much?

  • @thehardikraheja
    @thehardikraheja 2 года назад +1

    As to why we cannot cross 40% peak, is the reason memory accesses? Since the whole matrix of such size cannot be cached, I am curious if there are more reasons as well

  • @gabrielagalindo6339
    @gabrielagalindo6339 4 года назад +2

    Thanks MIT, amazing lecture

  • @Er.Sunil.Pedgaonkar
    @Er.Sunil.Pedgaonkar Год назад

    A course in System Software and Programming is must for CS / CpE majors

  • @hanw544
    @hanw544 4 года назад +21

    This guy wrote Intro to Algos, OMG😱

    • @tear728
      @tear728 3 года назад

      Wow that's legendary

  • @MysterySemicolon
    @MysterySemicolon 3 года назад +5

    Great class. I had to laugh at how much just two years have given over to specs as he describes the AWS large machine as a "honkin' big machine". Enterprise VD environments are really pushing what constitutes a honkin' big machine upward.

    • @mytech6779
      @mytech6779 2 года назад +3

      There were bigger AWS options even in 2018, the point is that most folks are using less than 1% of it and think they are really cooking or that they just need a faster machine.
      Seeing what 1980s programmers could push through an 8bit console with 2k of memory (like the NES) with various optimization tricks, now those machines had a good utilization ratio.(Yes, they were simpler than modern workstation architectures and thus easier to optimize.)

  • @Adit86
    @Adit86 5 лет назад +2

    @14:30 - 2000 "OH OHS" = oughts

  • @asdfghjkl36958
    @asdfghjkl36958 2 года назад +3

    Can someone explain why the k loop can't be parallelized?

    • @thehardikraheja
      @thehardikraheja 2 года назад +2

      Because it's the A[i][k] * B[k][j] so k is common between A and B and needs to be executed sequentially for obtaining the result C, this is my understanding of the reason

  • @codefast93
    @codefast93 Год назад

    Amazing lecture! Why does it go only up to only 41.6% of peak?

  • @veramentegina
    @veramentegina 5 лет назад +4

    what a great lecture!!

  • @lange22
    @lange22 3 года назад

    What is the source/paper/study behind the sldies @ 14:00?

    • @mitocw
      @mitocw  3 года назад

      Visit ocw.mit.edu/6-172F18 for the course materials. Best wishes on your studies!

  • @MathNerdGamer
    @MathNerdGamer 3 года назад +1

    Was this video unlisted until just 30-ish minutes ago?
    I saw the title and was excited to see another updated 6.172, but then I noticed that the thumbnail showed that I'd already watched it before.

    • @mitocw
      @mitocw  3 года назад +1

      Good catch! Yes it was. We're glad you got to enjoy it anyways :D

  • @stephenodogwu7359
    @stephenodogwu7359 9 месяцев назад

    I went through this very engaging lecture and thought I understood it. Maybe I'm missing something, because I ran the matrix multiplication on my PC and my results came back very fast. Without optimization, C performed the operation in just 0.03 seconds. Please am I getting something wrong as to why it took longer to perform in this lecture

    • @ankushm3t
      @ankushm3t 5 месяцев назад +1

      Did you use 4096x4096 matrix?

  • @pravinjoshi8978
    @pravinjoshi8978 5 лет назад +2

    Thanks MIT!!! keep up good work...

  • @FernandoJavierSosa
    @FernandoJavierSosa 5 лет назад +2

    Thanks for sharing this excelent lesson!

  • @ayahdilara10
    @ayahdilara10 Месяц назад

    Marking my attendance 🙋🏻‍♀️
    Anyone else?

  • @bob-the-constructors9912
    @bob-the-constructors9912 3 года назад

    Thank you MIT

  • @FreedomIsNeverEverFree
    @FreedomIsNeverEverFree 3 года назад

    how long does a numpy matrix multiplication take?

  • @seunghwacho5889
    @seunghwacho5889 5 лет назад +1

    Thank you for sharing great lessons.

  • @serkardis292
    @serkardis292 3 года назад +1

    These optimizations are amazing, python code isn't great though. I managed to write a pure python version that runs in 300s(using pypy). No optimizations other than loop reordering. Still slow, but it's just 5 loc that were written in 3 minutes.

    • @miguelescutia5556
      @miguelescutia5556 Год назад

      You used concepts from his lesson anyway.

    • @serkardis292
      @serkardis292 Год назад

      @@miguelescutia5556 yes, it's an amazing lecture. Just wanted to highlight that barely optimized python is less than 2x slower than C at the same level of optimization.

  • @regulus8518
    @regulus8518 5 лет назад +2

    what is the pescribed reading for this course ?.

    • @mitocw
      @mitocw  5 лет назад +18

      6.172 has no required readings. There is a list of supplemental readings on MIT OpenCourseWare at: ocw.mit.edu/6-172F18. Best wishes on your studies!

    • @maloxi1472
      @maloxi1472 4 года назад

      Thanks @@mitocw

    • @erikremmelzwaal1225
      @erikremmelzwaal1225 Месяц назад

      In Performance Engineering all aspects of Computer Science come together: Programming, Operating Systems, Memory models, (Parallel) Algorithm designed, Computer/Chip Architecture, Infrastructure Architecture etc.
      Performance optimization is about going from one layer (to optimize) to the next layer of the architecture of your application and continue cycling through the layers till goals are met. Always measure, qualify, quantify before change something. Ok, sometimes you do not have enough understanding or some parts might be a black box, that allows for experiencing.

  • @유현수-s8o
    @유현수-s8o 3 года назад

    It's a cool lecture!

  • @JP-vg8vl
    @JP-vg8vl 3 года назад +3

    7:00 so michael jackson is a computer scientist huh

  • @chessimprovement8804
    @chessimprovement8804 2 года назад

    Great resource.

  • @Kingpin.7
    @Kingpin.7 3 года назад

    Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?

    • @mytech6779
      @mytech6779 2 года назад +1

      It is recorded class lectures, as it clearly says in the description. They are a series, not independent.

  • @RonMartinJr
    @RonMartinJr 5 лет назад +1

    That was amazing

  • @21fad24
    @21fad24 3 года назад

    what was changed in this re-upload?

    • @mitocw
      @mitocw  3 года назад +1

      This was not a re-upload. We just forgot to make this lecture public. Everything else was public, except this one lecture. 🤦‍♂️

  • @manfredbogner9799
    @manfredbogner9799 5 месяцев назад

    Sehr gut

  • @SFXFF
    @SFXFF 2 года назад

    Chalkboard is so cool

  • @aaronrenfroe5199
    @aaronrenfroe5199 4 года назад +3

    what is absolutely bonkers is python's numpy.matmul holds it's own up until #9 on my 5 year old macbook pro, and it looks like, C = np.matmul(A,B). Lol. So yes, plain python is slow but it can be fast. Good Lecture.

    • @sauravtiwary8513
      @sauravtiwary8513 3 года назад +8

      @Aaron, The core logic of NumPy that actually does the multiplication is written in C and not Python. github.com/numpy/numpy/tree/master/numpy/core/src

    • @mytech6779
      @mytech6779 2 года назад

      @@sauravtiwary8513 Yep, can't get around the laws of physics. (Same algorithms pre-compiled vs interpreted in real time.)

  • @abpccpba
    @abpccpba 3 года назад

    Why not try Forth programing language; supper fast.

  • @Soulixs
    @Soulixs 4 года назад

    Thankyou

  • @tolga1503
    @tolga1503 3 года назад

    VAHAB BEYAAZ AHMET ÇAKAAR

  • @diego44551
    @diego44551 4 года назад

    Do someone knows some book of software? I wanna study an other engineering but I would like to learn software too.

  • @p0187
    @p0187 Год назад

    program vs software vs application 🤔

  • @DistortedV12
    @DistortedV12 3 года назад +1

    Bro why does this have so many views??

  • @This.Object
    @This.Object Год назад

    Did he jus say "well i took this 1.3hr class to prove you don't need performance, just write your god damn code in such a way that it makes some fucking sense" 😂

  • @eagledove9
    @eagledove9 3 года назад +1

    If I wanted to make an Amish-like computer, using very simple and primitive parts, and if my goal was to produce all of it in the United States instead of in China, it might make sense to build computers and appliances that had less power, less memory, a frugal approach to programming. You might make a goal of building a computer that doesn't use certain kinds of rare earth elements or something, using only elements that could be mined locally, so you would give up a lot of the advanced, modern hardware.

    • @mytech6779
      @mytech6779 2 года назад +1

      You can make computers with water and mechanical valves. So what?

  • @rileydavidjesus
    @rileydavidjesus 2 года назад

    Run it in Rust though

  • @xiaoxiashirleywu2799
    @xiaoxiashirleywu2799 3 года назад

    check

  • @Pabloparsil
    @Pabloparsil 6 месяцев назад

    Web monkeys mentioned

  • @gp10020
    @gp10020 3 года назад

    i wonder what he's drinking?

  • @BaptisteCanton1
    @BaptisteCanton1 11 месяцев назад

    TIL: web monkey concept 😂😅

  • @kenichimori8533
    @kenichimori8533 5 лет назад

    Rank(0)(1,0)matrix permutix.

  • @Matowix
    @Matowix 2 года назад

    Basic stuff I learned early on in my first year.

    • @mytech6779
      @mytech6779 2 года назад +3

      This is a 100 level class ...so your point is?

    • @johnwolves2705
      @johnwolves2705 2 года назад +3

      @@mytech6779 I think he thought of the 3×3 matrix multiplication in C taught in mostly 1st year of most University 😂😂