Apple M1 Ultra & NUMA - Computerphile

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024

Комментарии • 389

  • @50PullUps
    @50PullUps 2 года назад +720

    This entry is pure gold. Please make more vids where the latest tech is a jumping off point for the main topic.

    • @Stopinvadingmyhardware
      @Stopinvadingmyhardware 2 года назад

      Where did I apply for that?

    • @oskrm
      @oskrm 2 года назад +12

      That's the thing, this is not the latest tech.

    • @nezbrun872
      @nezbrun872 2 года назад +4

      NUMA's not new, it's been a facet of multi socket Xeon systems for many years for example, and other architectures too before that. The battle has always been to make the interconnect interfaces (QPI/UPI in Intel speak) as quick as possible to maximise performance. Software like RDBMSs are NUMA aware to optimise workload across sockets (and hence memory domains).

    • @darkidz24
      @darkidz24 2 года назад +1

      It could really take this channel to the next level!! Explaining modern day tech

    • @SilentlyContinue
      @SilentlyContinue 2 года назад

      Yes! Helps with understanding real world application.

  • @TechTechPotato
    @TechTechPotato 2 года назад +313

    Intel's EMIB, similar to ultra fusion, in Sapphire Rapids adds additional latency of 5-8 nanoseconds. This makes the core-to-core latency go from 54 worst case to 70 worst case. Apple's situation is similar, with similar bandwidth per connection. We expect the latency to be an additional 5-8 nanoseconds also. Ultrafusion is using TSMC's InFO_LSI manufacturing.

    • @Joseph_Roffey
      @Joseph_Roffey 2 года назад +70

      But the difference is one is called “random string of letters” and the other is called “Ultra Fusion” 😍

    • @eddyecho
      @eddyecho 2 года назад +44

      @@Joseph_Roffey huh? More like one is a "stupid marketing name that really doesn't describe the underlying mechanism" and the other is called "embedded multi-die interconnect bridge"

    • @landspide
      @landspide 2 года назад +18

      @@Joseph_Roffey And begins with "We call this..." and is filled with "... only at Apple can we ..."

    • @shunyaatma
      @shunyaatma 2 года назад +3

      Any numbers for AMD (Zen 2 and 3) 2-socket systems with and without xGMI cables?

    • @egor1g
      @egor1g 2 года назад +14

      yeah, but it is ARM vs x86, 256 channel memory against 6 and also efficiency cores, also video memory... so not really the same!

  • @markholm7050
    @markholm7050 2 года назад +66

    Can one still purchase green lined, perforated line printer paper or are you working off an old stock? That stuff was great for physics homework. Worked pretty well in line printers, too.

    • @sajukkhar
      @sajukkhar 2 года назад +5

      Dot matrix paper is still sold.

    • @rabidbigdog
      @rabidbigdog 2 года назад +22

      I'm convinced there is warehouse in Nottingham that is full of nothing but that tractor paper, just for Computerphile.

    • @davidgillies620
      @davidgillies620 2 года назад +5

      You can buy a couple of thousand feet of the green ruled stuff for about forty quid from any wholesale stationery supply store.

    • @arpanmajumdar617
      @arpanmajumdar617 2 года назад +12

      I think they are still available at Dunder Mifflin.

    • @heisen9460
      @heisen9460 2 года назад +2

      @@arpanmajumdar617 lol

  • @as-qh1qq
    @as-qh1qq 2 года назад +105

    Why does making the interconnect (distributed shared memory) super-fast not bring back the original problem that we were trying to solve - increased memory access collision with increased CPUs? After all, if far away CPUs can access memory in nearly the same time as the nearby ones, how is it any different than just one memory with all near and far CPUs connected to it ?

    • @ssvis2
      @ssvis2 2 года назад +11

      It probably would reintroduce the problem. However, I would suspect there is some trickery under the hood of the OS working with the hardware to optimize data locality to keep the data on the "near" memory for any core. It's possible that part of it is memory mapping in the data interconnect so that memory on the "far" chunk could still be viewed as local to a core, and the super fast interconnect effectively negates the performance penalty that a traditional NUMA system would have.

    • @samuie2
      @samuie2 2 года назад +20

      I agree that it was not super clear in the video. I think you could still have that issue. however, it happens half as often since you have 2 banks of memory.

    • @davidgillies620
      @davidgillies620 2 года назад +6

      I would guess that it means you don't _have_ to tune data affinity (which makes development/deployment easier and therefore cheaper) but you _can_ if you want (which gives you the benefits of an optimised NUMA configuration).

    • @ssvis2
      @ssvis2 2 года назад +3

      @@davidgillies620 I'm thinking the same thing. By optimizing specific parts of the system, Apple has theoretically designed something that will perform really well in 99% of use cases. There's always more performance to squeeze out, but with severely diminishing returns.

    • @gajbooks
      @gajbooks 2 года назад +4

      UltraFusion is really just a memory... Fusion. Their memory gets twice as fast since they have twice as many banks, they just need a way to combine the M1 chips so that both of them can use the other's memory at high speeds. There was probably some tradeoff with the memory controller or packaging which made them need 2x64 rather than having external 128 GB. I imagine their real Mac Pro replacement will have external memory and GPU.

  • @paulledak291
    @paulledak291 2 года назад +195

    Nice explanation of how NUMA architecture is implemented. However, you stated that the reason for moving to this architecture is because as you add more and more cores, you increase the probability of memory collisions. But then you completely forgot to explain how having 2 memory banks reduces the probability of the memory collisions that you would still get as you add the more processors. It would seem to be the most essential element needed for this video which is completely missing. (Yes I understand that now there are 2 memory banks with twice the bus bandwidth but this is never explained. And there are different interleaved memory architectures which could increase the memory bandwidth without resorting to NUMA)

    • @bberakable
      @bberakable 2 года назад +2

      Agree 100%

    • @mytech6779
      @mytech6779 2 года назад +21

      Its not bandwidth at issue, simultaneous access is the issue, this allows the banks to be accessed in parallel. Its like using a network bridge to make two ethernet sub-nets. Which I just realized is a really outdated reference as nobody uses shared media networks anymore.
      But basically all computers on a subnet could hear all packets on that subnet as it was physically one solid wire, and as more nodes were added you would get more chance of collisions and congestion,(non-linear increase) so chop it in two with a bridge(like a filter of sorts) so only about half of the total traffic can be seen, because only packets addressed to the other subnet are passed through the bridge.

    • @Sandeep-cz7ls
      @Sandeep-cz7ls 2 года назад +2

      @@mytech6779 wait im still confused, how does this allow the banks to be accessed in parallel? is it due to the interconnect?

    • @valshaped
      @valshaped 2 года назад +9

      @@Sandeep-cz7ls Each bank can be accessed by one CPU at a time
      More banks -> more CPUs at a time

    • @MaulikParmar210
      @MaulikParmar210 2 года назад +8

      @@Sandeep-cz7ls to keep it simple in modern day CPUs or lets say CPU cluster - there's memory controller inside each CPU cluster that makes request on behalf of physical CPU die, but in numa, there are multiple clusters acting on it's own so there are multiple access point to access different or same memory banks by different cpus.
      When two controllers try to access same bank and location it is going to be parallel access and cause lot of data inconsistencies when read and write at same time from different CPUs, unless it is handled on software level so that software is aware of such architecture. OS knows memory space and kernel is generally responsible to make sure each cpus request is translated in proper order and proper physical location by making use of translation tables or other hardware means that boosts this process depending on what's available. In NUMA these are much complex as each node has to communicate and coordinate exactly what they need, that's where connecting febric comes in, which provides crucial functions to get data in and out of foreign clusters.
      Keep that in mind when we talk about software, it's mostly OS level softwares and not consumer APIs, as consumer APIs make abstraction of these traits, your software would never know or has to care, if it's running on 1 core, 4 core or 12 cores 2x CPU sockets, in the eye of usespace resources are unified, unless you want to optimise then ofc you can request system to allocate memory near resource, that's the job of OS to maintain and abstract hardware and allow controlled access via syscalls or driver APIs.

  • @RegitYouTuber
    @RegitYouTuber 2 года назад +1

    Favourite bit of this was the chaotic side-angle crash zoom - really compliments the desperate addition of “well of course it’s more complex than this, but” that seems necessary these days

  • @BenjyP.
    @BenjyP. 2 года назад +26

    I read ml instead of m1 so I thought this would be a video of how the neural cores work. I would love a video on how to use the apple neural cores for machine learning as they already take up 20% space of the entire chip

  • @doctorpex6862
    @doctorpex6862 2 года назад +3

    Netflix gains most of speed by "video is not available in your country"

  • @zilog1
    @zilog1 2 года назад +6

    as much as i love their hardware, i still cant bring myself to get one. If they cant get it through their thick heads that people actually need to fix things and do what they want with their device without jumping through hoops just to do something as simple as downgrading the OS to an unsigned IPSW, then i dont own one nor will i ever own one. if i cant do what i want with it, I, cannot by nature own it, nor will i ever own an apple product. its not because i dont want to its because i cant. If they would just stop being so stubborn and manipulative, ill use one. until then. no thanks.

    • @baronvonschnellenstein2811
      @baronvonschnellenstein2811 2 года назад +1

      Not to mention their price-gouging outside of usa and nobbling older devices (esp. iphones) via firmware/OS updates when the new model comes out.

  • @KipIngram
    @KipIngram 6 месяцев назад

    It's worth noting that the PCI ports are usually also split into these two domains, so you want to take that into account as well.

  • @TheMrKeksLp
    @TheMrKeksLp 2 года назад +7

    Mordern CPUs are only Harvard architectures in the most pedantic classification. Instructions are still kept in main memory, they just have a separate level 1 instruction and data cache. Even level 2 and 3 are shared...

  • @sholinwright6621
    @sholinwright6621 2 года назад +6

    Don’t you still have to write code to distribute the memory hits across the two memory banks or you just have the same multi core stalling effect mentioned earlier. The speed up was the ability to partition core memory fetches into two batches preventing all of the cores stalling trying to fetch from the same bank. Side note: I work on a radar with 11 cpu cards with an 88000 on each and 2 MB of local ram with the collection tied to 2 global memory cards with 8 MB.of ram. GRAM memory fetches are really expensive.

  • @JohnnyWednesday
    @JohnnyWednesday 2 года назад +19

    Thank you kindly Dr. Bagley for sharing your knowledge with us. I'm quite surprised that Intel and AMD have not yet pushed for on-die memory given the M1's impressive demonstration

    • @SimonVaIe
      @SimonVaIe 2 года назад +12

      It does have some negative consequences. More expensive to produce, not expandable, if one thing breaks the whole thing is broken. I also don't know how much expertise would be required in ram design/production (keep in mind that Apple is far bigger than intel, which is far bigger than AMD) seeing there is a very well established ecosystem of memory manufacturers (they do have quite extensive cache systems on their CPUs already, don't know how well that translates). And not every task profits as much from faster ram. No idea if those are major reasons for amd and intel, but like for everything else it's just a matter of finding what best fits a job.

    • @dotted1337
      @dotted1337 2 года назад +12

      On-die RAM is rather limiting, so it wont really work well for either AMD or Intel to make such a product as such kind of RAM is much too slow, in terms of both bandwidth and latency, for use as a cache or if used as RAM you'd have the same problem as this video is talking about. But Intel had the i7-5775C back in 2015 with 128MB of EDRAM for the on board GPU, but was also used as a L4 cache, and Intel's upcoming Sapphire Rapids Xeon will have a version with 64GB on-package HBM2E with a bandwidth of well over 1TB per second. And finally you have AMD with their V-Cache supposedly having a bandwidth of about 2TB per second. tl;dr Apple can do on-die memory because they know exactly who their customers are and can make almost tailor made SoCs for them, where as AMD and Intel has customers much too diverse to make on-die memory viable.

    • @JohnnyWednesday
      @JohnnyWednesday 2 года назад

      @@dotted1337 - Thank you for your detailed reply, I was unaware of the I7-5775C - that smells like it could have been designed for use in a console given the perceived similarity to previous xbox memory layouts. It is my understanding that a large part of the M1s 'boost' above other ARM designs is the lower latency access to system memory?
      Perhaps naive but if such performance can be gained for an ARM chip, then should not a similar ratio of performance be seen with a similarly designed x86 chip?
      With ultra-fast streaming devices and multi-channel pardigms like the PS5's SSD controller? could we not see a slowing of average memory capacity for users? perhaps the time for a fixed 16gb of memory on a CPU is now? especially given the console generations are locking game engine technology advancements for years at a time?

    • @harshpatel9020
      @harshpatel9020 2 года назад +3

      I think this is because they uses DDR in their desktop models (and not laptops because laptop come in both)and not lpddr as used in apple's
      M1 line up.
      In mobile processor where DDR and LPDDR , both are being used - ram is mounted on the pcb.(these are soldered on motherboard and not on die itself as you said is in the case of apple)
      Note - many things I said may turn out to be wrong so it will be better if one cross checkes things first before getting any conclusion.I would be happy to know where I am wrong and Learn something new. Thank you)

    • @mytech6779
      @mytech6779 2 года назад +1

      On die memory is called L1 cache, sometimes L2 and L3 care often placed on die as well. In fact over 80% of late generation CPU silicon area is taken up by on-die memory.
      (NB4: yes the 386 had off-die L1, but it was 1986)

  • @kuroexmachina
    @kuroexmachina 2 года назад

    this channel is gold. always has been

  • @IceMetalPunk
    @IceMetalPunk 2 года назад +17

    Apple: "M1 ULTRA FUSION!"
    Reality: "It's a fast wire junction."

    • @G5rry
      @G5rry 2 года назад +10

      Reality: No, it's a bit more than that.

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu 2 года назад

      If it’s so easy everyone would make 10TB/s interconnect 😂
      It’s a lot more complex than that.

    • @giornikitop5373
      @giornikitop5373 2 года назад

      @@RunForPeace-hk1cu it IS actually fairly straightforward to make a 10TB/s interconnect. but the cost is beyond crazy. besides, your need a cpu of such power to take advantage of it, so the cost makes even less sense. so the reason is not they cannot make it, the reason is they don't need to, at least not yet.

  • @JJ-fq3dh
    @JJ-fq3dh 2 года назад

    Great video, brings back memories of codiding on an sgi origin 2000 and irix

  • @danielsilva158
    @danielsilva158 2 года назад +7

    Would’ve been good to touch on how this memory system interfaces with the gpu!!

  • @petrilaakso7927
    @petrilaakso7927 2 года назад +1

    Excellent explanation of NUMA, excellent work🙏🏼

  • @X_Baron
    @X_Baron 2 года назад

    Ultra Fusion is basically Blast Processing, but more extreme and rad.

  • @tcornell05
    @tcornell05 2 года назад +3

    This might be the most informative video i've come across in years on youtube. You have an amazing way of articulating topics like this to the ADHD & Dyslexic programming community, like myself xD. Now I'm dying for a fellow up on how exactly they managed to make the distributed shared memory link so fast. Any resources you recommend?

  • @tomdchi12
    @tomdchi12 2 года назад +5

    Doesn't Apple provide the compilers (and IDE) so couldn't they be baking in the modifications to the code that is required to manage the non-uniformness of memory access times? (Regardless, early benchmarks indicate that performance is scaling only a little short of linearly with the number of cores, so we can infer that memory access across the two halves of the "fused" CPU isn't creating major delays.)

  • @mysteriousm1
    @mysteriousm1 2 года назад +4

    Was there an earthquake during filming or why is it so shaky?

  • @dembro27
    @dembro27 2 года назад +4

    Cool stuff. But now I have "Numa Numa" in my head...

  • @jfmezei
    @jfmezei 2 года назад +2

    Great to find someone who remembers NUMA !!
    BTWk you forgot to deal with cache coherence. Core 1 modifying contents at a memory location that is also in core 2's cache.
    In the 1990s, Digital tried to scale its Alpha computers to have many cores with its Wildfire class machines. They found that 4 cores was the max the memory controller could handle before performance increments stopped beingf interestiung. So they created the Wildfires with 4 CPU "QBB" that were boards, connected by what Digital called a switch. NUMA access between these QBBs was atrocious.
    This was dealth with at the operating system level, less so at application. You could pre-load shareable images onto a specific QBB and then launch processes that use them on that QBB so they would use local memory for shareable images etc. But this was nowhere enough.
    Digital then worked on the next generation alpha the EV7 which was delayed as long as they could because Compaq/HP who had bought Digital didn't want EV7 to beat the pants off the Intel Itanium heat generator.
    The EV7 introduced a totally new memory controller that remained state of the art beyond the death of Alpha. HP donated Alpha IP to Intel which used it for its CSI interconnect (later called Quickpath) and which evolved from there. ex-Alpha engineers went to AMD who developped their own version, and many ex-Alpha engineers formed PA-Semiconductors which was purchased by Apple to create its own ARM chips. The EV7 had coherent cache (and I beleliev only IBM's Power had this until AMD matched it. Intel's Quickpath did not implemnent coherent cache initially (despite having all the IP from DEC).
    If you google for Alpha Wildfire NUMA, you will find a result "Optimizing for Performance on Alpha Systems - Semantic..." by Norm Lastovica. It provides some then ciurrent memory accesses showing differences between direct and NUMA accesses in the Wildfires. But at page 26 also provides the EV7 memory archicteeture in a fabric. (21364 is the EV7 CPU, the first generation was 21064). Each CPU controlled a part of RAM. But because CPU 1 could request memory from CPU2 at same time as CPU3 requested from CPU4, CPU5 from 6 etc, it ended up having huge performance advantage when scaling number of cores.
    There was also an issue of CPU speed vs memory speed. Alpha came to surpass memory speed easily hence the 4 core limit Digital found in the 1990s. But when you increase memory speed (and it has increased tremendously since then), it lets you increase number of cores that have direct access (especially in last littel while when "Moore's Law" was more about adding cores than making each core faster.
    Before their death, Digital engineers would present at DECUS comferences and provide much information about Alpha advancements and how they improved thinsg etc. It is a real shame that Apple hides all the real information and only rpovides marketing gobledeegook that is useless.

    • @andybaldman
      @andybaldman 2 года назад

      Nobody cares, man.

    • @RogerBarraud
      @RogerBarraud 2 года назад

      @@andybaldman You are wrong on the Internet.

    • @andybaldman
      @andybaldman 2 года назад

      @@RogerBarraud Nope you are

  • @wile123456
    @wile123456 2 года назад +1

    Maybe you've done it before but I would love a video explaining video games vs rendering/productivity workloads.
    Games get a big performance boost with more cache, the 5800X3D 8 core cpu increased performance a lot from over doubling level 3 cache with 3D stacking. But why does it mostly only benefit games and not other workloads?

  • @grahmn886
    @grahmn886 2 года назад +1

    Lesson of the day, Thanks as always Steve :)

  • @newburypi
    @newburypi 2 года назад +7

    Think I missed something here. Totally got the "was slow but Apple made it fast." However I think there's a promise of "won't need to change the software." The NUMA method requires knowledge of which memory block has the desired data. Hence, a change to software. So... did they also build a way to hide the fact of two memory blocks?

    • @elliott8175
      @elliott8175 2 года назад +11

      The reason NUMA systems usually require the software developers to be aware of the positioning of CPUs and memory is because of the slower speeds when fetching data from memory that is farther away. However, the new M1 chip claims to make fetching data fast enough for the worst-case RAM position to still not cause any slow-down.
      I assume this means that the difference in time to fetch memory that is close, compared to memory that is far away, is less than a clock cycle. So from the core's point-of-view they have the same latency.

    • @newburypi
      @newburypi 2 года назад

      @@elliott8175 great. Thanks for the clarification. Thought I missed something.

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu 2 года назад +2

      @@elliott8175 the “trick” is literally the hardest part that no one could solve 😂

  • @user-cx2bk6pm2f
    @user-cx2bk6pm2f 2 года назад

    Finally!! I understand NUMA.. thank you !

  • @NinjaAdorable
    @NinjaAdorable 10 месяцев назад +2

    This has been one of the most intuitive and elegant explanations for NUMA I have ever heard!! Kudos

  • @SimonJentzschX7
    @SimonJentzschX7 2 года назад +4

    Great video. I learned something new! Just one question: Could the operating system optimize my code when exexcuting? So when I allocate memory, the OS should know which CPU this process is running and allocate the memory in a RAM faster to access. This way the code does not need to change, just the OS.

    • @mr_waffles_the_dog
      @mr_waffles_the_dog 2 года назад +4

      OS's already tend to do this :D
      The problem is what happens when you have multithreaded code (e.g. running on multiple cores/cpus at once), there is no one ideal block of memory for the OS to allocate to. The Apple claim is that their system is non-NUMA, or at least sufficiently fast to be indistinguishable, so developers don't have to rearchitect things to maximize performance.

  • @OscarBerenguerPV
    @OscarBerenguerPV 2 года назад

    This was a great video

  • @salmiakki5638
    @salmiakki5638 2 года назад +13

    *It's only the firsts two generations of threadripper CPUs that have 2 NUMA nodes.
    The last one and both generations of threadripper pro have unified the memory Access

    • @romevang
      @romevang 2 года назад

      Threadripper 2990wx has 4 NUMA nodes. 2950x i think has 2.

    • @salmiakki5638
      @salmiakki5638 2 года назад

      @@romevang thanks, i though I remembered it was the same throughout the range

  • @radutopor8389
    @radutopor8389 2 года назад +3

    I still don't get why splitting the RAM in two wouldn't cause the same collisions problem with the high number of CPUs, given they effectively still share just one bus, albeit connected by some black box in the middle.

    • @YeOldeTraveller
      @YeOldeTraveller 2 года назад +2

      Because the two NUMA regions are separate, any access in one region does not impact access in another region. Even without coding for it, you reduce the likelihood of collision.

  • @bumbixp
    @bumbixp 2 года назад +12

    Doesn't the OS scheduler largely handle this? Even if you make a single threaded app, Windows will move it around on different cores but it stays within the same NUMA node.

    • @Pyroblaster1
      @Pyroblaster1 2 года назад +1

      Lets say you allocate a buffer and load data into memory in a single thread and then start many threads to process that data, which is perfectly reasonable and usual way to do things with uniform memory access. Then if you saturate the system with threads, half or more of the threads will run on NUMA nodes that are different from where the data buffer was allocated, incuring the longer access times. You have to explicitly handle the allocation and data loading so that the data is distributed in a way that the threads processing each part of data are on the same NUMA node as the data they are processing.

    • @Piktogrammdd1234
      @Piktogrammdd1234 2 года назад +1

      Yes and no. There are mitigations on every level to compensate for problems, but every solutions is just bad in comparison to an idealistic system with endless Memory, zero latency, no collisions. OS schedulers try to localize data and corresponding processes, but limits are still there. Every time processes on a node are in need for more memory than available locally or processes are relocated to other nodes will be problematic.

    • @ivanskyttejrgensen7464
      @ivanskyttejrgensen7464 2 года назад

      The OS tries to handle this, but it's not perfect. Eg. last time I dealt with this the OS tried serve memory allocations from the nearest memory, but wouldn't move it around afterwards. So we ended up using processor sets to direct processes to be started at the "right" part of the CPUs so the subsequent memory allocations could all be served from the local memory. That gave a 10-15% speedup compared to leaving it to the OS to figure things out.

  • @Derbauer
    @Derbauer 2 года назад

    Nicely explained!

  • @asmerhamidali9679
    @asmerhamidali9679 2 года назад

    Please make some videos on RISC-V. Lately it has been a hot topic.

  • @marklonergan3898
    @marklonergan3898 2 года назад +4

    Maybe i'm not understanding the problem correctly, but couldn't you just have a rudementry controller sitting between the 2 that uses the most-significant bit of the address to determine which ram chip has the data? That way by having the controller between the chips and as the central access point, all queries would take the same amount of time to fetch the data. By having this logic at hardware level you would have minimal latency added.
    I know this would only work on chips that are the same size but you could combine composites with singles (i.e. 2x 32s connected with a controller could be combined with an actual 64 with a controller)

    • @Addlibs
      @Addlibs 2 года назад +6

      This suffers the same slowdown which result from physically separate RAM locations, close to individual groups of CPU cores but not as close to others; even if the most significant bits picked the RAM module without any fancy chips in the way, fetching data from a CPU farther down the line is going to be generally slower, and it's easy to double or triple the tiny amount of time it takes to fetch data with computers this compact and fast, that is, 4 nanoseconds is twice as long as 2 nanoseconds -- both are incredibly fast though.

    • @katbryce
      @katbryce 2 года назад +1

      @@Addlibs Remember that a 4GHz CPU completes 4 instructions every nanosecond, and in a nanosecond, light travels about 30cm. Electricity is slower, so any round trip of more than about 3cm isn't going to happen within a clock cycle.

  • @fiddley
    @fiddley 2 года назад

    Everyone: Distributed Shared Memory Access Is Slow
    Apple: Let's make a fast one
    How did no-one else think of that?

  • @TheGTP1995
    @TheGTP1995 2 года назад +2

    To be fair, most programmers wouldn't have had to care about how the memory worked anyway because the compiler would have done the job for them. This spared Apple the cost of writing new optimization code for the compiler, at the cost of more hardware engineering

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu 2 года назад

      You’ve never written kernel drivers have you? 😂

    • @K4nj
      @K4nj 2 года назад +2

      @@RunForPeace-hk1cu he clearly hasn't worked within a c++ development environment

  • @kirtanmusica1999
    @kirtanmusica1999 2 года назад

    Namaskar gracias por la educación, gracias por la luz

  • @hoagy_ytfc
    @hoagy_ytfc 2 года назад +8

    Given what they claim for "unified memory", the GPUs should be factored into this description, IMO

  • @JCBOOMog
    @JCBOOMog 2 года назад +4

    Hi steve

  • @prla5400
    @prla5400 2 года назад +5

    Back to you, Steve

  • @Sierra-Whisky
    @Sierra-Whisky 2 года назад +2

    What an excellent explanation! And what a coincidence too. I tried to explain NUMA and the potential performance hog on the exact same day this video was published but obviously my explanation was nowhere near as clear as this one. 🤣
    Thanks! I'll share it with my colleagues.

  • @whatthefunction9140
    @whatthefunction9140 2 года назад +1

    Numa numa guy was way ahead of his time

  • @MrMiryks
    @MrMiryks 2 года назад +1

    why is the camera so wobbly and shaky? it is very irritating and causes that i look moost of the time away from the screen and just listening to the video like a podcast.

  • @InspektorDreyfus
    @InspektorDreyfus 2 года назад +1

    15 minutes of avoiding the term bus gateway.

  • @jbf81tb
    @jbf81tb 2 года назад +5

    I would like to know what the architecture of the GPU is like after seeing this. I believe GPUs have thousands of cores, but I don't think I see numbers for cache in their marketing materials, just memory, cores, and clock speeds.

    • @TDRinfinity
      @TDRinfinity 2 года назад +2

      I don't know about GPUs with 100% certainty, but I design SOCs with hundreds of cores and we still have individual caches for each core, as well as shared cluster caches

    • @Edekje
      @Edekje 2 года назад +3

      It's quite interesting how GPUs work actually. They function in a completely different way to CPUs, preventing this problem. Each core in a multi-core CPU system functions as a completely independent entity, accessing whatever memory it needs to complete its calculations. Different cores working on separate tasks can therefore end up accessing the same memory, thereby slowing down each other's progress. The gist of what happens in a GPU is that the entire GPU is dedicated to doing just one single task. Each one of its cores executes exactly the same task, piece of code, in lockstep. The key difference here is that each core performs the same actions, but on different pieces of memory. Often these pieces of memory are adjacent. So a (sensible) GPU program will never have different cores trying to lock the same piece of memory simultaneously. The GPU has chopped up one big task into 1000s of equal bite-sized chunks.
      Hope that explanation helps!

    • @TDRinfinity
      @TDRinfinity 2 года назад

      @@Edekje so like vector/data level parallelism vs thread level parallelism?

    • @TDRinfinity
      @TDRinfinity 2 года назад

      @@Edekje like is a GPU usually just executing a single thread of vector instructions at a time, or can it split across clusters of execution units to execute different threads? I have really no GPU experience

    • @jbf81tb
      @jbf81tb 2 года назад +1

      @@Edekje Thank you, that is a helpful explanation. Is there some synchronizer that keeps the cores on task? I'm thinking like Tom, I know the GPU is basically a matrix multiplication fiend. Is that because it can easily distribute all the multiplications to a bunch of cores and then there's a synchronizer that can grab all those products and sum them together in the appropriate way to return the expected matrix?
      Rereading your answer, I'm wondering if it's like a pass-the-bucket situation. Like a core takes a bucket of memory from the core "to its left", it does some operation on that memory, and then hands it off to the core "on its right", looking left for its next bucket. Or is it more like "i've got 10 buckets and I need 10 cores to work on them", hands those off, and then next "I've got 37 buckets," hands them off, etc.

  • @michaellatta
    @michaellatta 2 года назад

    I would guess ram attached to each die and cache is on that die. Interconnect used for off-die access to the other die’s cache/ram.

  • @nameunknown007
    @nameunknown007 2 года назад

    Love you man!

  • @debojitmandal8670
    @debojitmandal8670 2 года назад

    Wait but apple isnt using a distributed shared memory Like u mentioned.
    But rather a cpu from one block can access the memory of other cpu block directly without even going through the distributed shared memory lane atleast that what i have understood from their presentation.
    There is no middle man like the shared distributed memory lane as u mentioned.
    Please correct me if i am wrong

  • @autohmae
    @autohmae 2 года назад +1

    I wonder if Linux scheduler already has a variable for the latency, so no new code is needed. My guess would be yes.

  • @TomekSw
    @TomekSw 2 года назад

    Great video. Came 10 years late for me. :( :)

  • @zaco-km3su
    @zaco-km3su 2 года назад

    I wonder if Apple just didn't eliminate the distributed shared memory system and fused the 2 systems together.

  • @willemvdk4886
    @willemvdk4886 2 года назад +2

    Isn't this something an OS should abstract away from the software developer? Seems a bit strange to me that the programmer should be aware of any of this.

    • @loknathshankar5423
      @loknathshankar5423 2 года назад

      Not if you want performance when you have 2 saperate CPUs

    • @YeOldeTraveller
      @YeOldeTraveller 2 года назад

      There are methods to restrict processes to NUMA regions, but that also reduces the available memory and cores. Like a lot of things, it depends on how much effort is required for the benefit. If your code is fast enough on a particular architecture, none of this matters. However, if you need to optimize for memory access, then you need to care about the layout. Apples's claim is that their NUMA cost is low enough that you don't need to worry about it. Of course, their marketing is actually claiming that this is actually UMA because the cost is so low. I'm sure it is low, but it is not zero. This does mean that the cases where it will actually matters are reduced, but it is not magic.

  • @henrikjensen3278
    @henrikjensen3278 2 года назад +2

    Good explanation, but I would like some explanation about write/read, i.e. two threads reading and writing to the same memory location. This would be easy enough to handle between the two sides, but what two cpus on the same side with their own cache, it sounds like a lot of circuit to handle that.
    Are there some smart solutions?

    • @ClarkCox
      @ClarkCox 2 года назад +2

      That is indeed a problem that must be contended with. Look up "cache coherence"

  • @SproutyPottedPlant
    @SproutyPottedPlant 2 года назад +1

    That was great! When you showed the bus arbiter it reminded me of the Sega Mega Drive! It’s got one of those??

  • @fernandoblazin
    @fernandoblazin 2 года назад

    when is the last time i saw that type of paper

  • @circuitgamer7759
    @circuitgamer7759 2 года назад +1

    Video idea (because I don't know where to look for this) - some of the finer details of caching implementation. I understand the idea behind caching, and the structure behind it, but not how it's actually implemented. I want to learn the actual control logic for reading/writing cache lines, and when and how it gets updated to/from RAM or a higher level cache. Do the CPU cores control the caches directly, or is there some control logic for each cache that isn't a part of a specific core?
    I think it would be an interesting video, but if there's already one that exists that I missed, can someone reply with a link? I've only been able to find high-level explanations so far.

  • @vladomaimun
    @vladomaimun 2 года назад +2

    Does application software needs to be NUMA-aware or does the OS kernel handle everything NUMA related?

    • @JamesClarkUK
      @JamesClarkUK 2 года назад +3

      The OS could do scheduling to keep your application on one numa node. You can use numactl on Linux to tell the kernel what you want to happen

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu 2 года назад

      The whole point is it’s HW implementation and no software need to be changed.

  • @Xiaomi_Global
    @Xiaomi_Global 2 года назад

    How about the same architecture but different fab interconnect process? Does it affect performance?

  • @Manuel-j3q
    @Manuel-j3q 2 года назад +4

    Numa numa hey
    Nume numa numa hey

    • @rommysoeli
      @rommysoeli 2 года назад

      Not that kind of NUMA

  • @genhen
    @genhen 2 года назад +2

    I've always wondered if we accessed more than one NUMA nodes worth of memory, how does the memory get chunked up? Take half and half? Take most from one? Is it hardware dependent? Software/OS dependent?

    • @katbryce
      @katbryce 2 года назад +1

      On my Threadripper motherboard, there is the CPU, and either side of it, there are four memory slots for a total of 8. The 4 slots on one side are one NUMA node, and the 4 slots on the other side are the other NUMA node.

    • @PoseidonDiver
      @PoseidonDiver 2 года назад +1

      Also, there is no true virtual to physical CPU affinity. And the hypervisor generally allocates the compute to the VM as needed, when running performance graphs you can see big spike across the sharing CPUs when its allocating compute from another node. (hope that actually answers your question :p )

  • @centerfield6339
    @centerfield6339 2 года назад

    I don't really understand this - if the NUMA architecture lets you access the other memory as fast as local CPUs, then doesn't the original contention problem become an issue again? I thought that's what the video would end with, given it was teed up like that.

  • @larrystone654
    @larrystone654 2 года назад

    So if I understand this correctly, the Ultra architecture makes it so developers don’t *have to* split their code across cores, but I suppose they *could* in order to achieve even more performance?

    • @magicmark3309
      @magicmark3309 2 года назад +1

      Unless you really really really know what you’re doing you’d probably just cause more bottlenecks. No real point with these Macs, unless you were clustering them, which would be cool but probably not a great use of resources when it comes to all the enterprise solutions.

  • @kelvinluk9121
    @kelvinluk9121 2 года назад +1

    is it possible to address the ram access conflict issue between different cpus by introducing more memory channels?

  • @texmanro
    @texmanro 2 года назад

    I think Dr. Steve Bagley was a bruiser (scars on his fist). Respect! :)

  • @jurabondarchook2494
    @jurabondarchook2494 2 года назад

    Hmmm.
    But if you make distributed shared memory system super fast, you will end up with the same problem as in the beginning.
    When distributed shared memory system need to access memory, CPUs attached to that memory have to wait, aren't they?
    So probability of collision increases again.

  • @utubekullanicisi
    @utubekullanicisi 2 года назад +2

    It's been mentioned that the most special thing about the M1 Ultra is not that the 'two' 10-core CPUs act as one big 20-core CPU, but the fact that that's true of the two GPUs as well, which, the M1 Ultra is literally the *first* example of this being done successfully in the industry, since pulling off a fast connection between two GPUs is a lot harder. Definitely an interesting video nonetheless.

    • @ariellubonja7856
      @ariellubonja7856 2 года назад +1

      But those gpus are on the chip itself, they use the same ram as the cpu does. I don’t think connecting the gpus was the main challenge

    • @utubekullanicisi
      @utubekullanicisi 2 года назад

      @@ariellubonja7856 It was.

    • @reinhardtwilhelm5415
      @reinhardtwilhelm5415 2 года назад

      I mean, RDNA 3 will do this at a much higher level of performance in about six months.

    • @utubekullanicisi
      @utubekullanicisi 2 года назад

      @@reinhardtwilhelm5415 I mean, I don't think RDNA3 will beat the M1 Ultra (or any M1-series) or at least leapfrog them by a lot when it comes to performance/watt. This GPU only draws about 100W at peak and sits around 60-70W for 99% of the workloads out there.
      I'm sure RDNA3 will be a great generation though. What both AMD and Apple are struggling with against Nvidia (from what I'm understanding) is software compatibility, in professional productivity apps like Blender that take advantage of both Nvidia's RT cores with OptiX, and CUDA. But Apple are right now working closely with those software teams and they will reach an optimal level of efficiency with the M1-series sooner or later (Apple has joined Blender's development fund in late 2021).

  • @eksadiss
    @eksadiss 2 года назад +1

    I can't tell if he's 30 or 60

  • @1idd0kun
    @1idd0kun 2 года назад +1

    No matter how fast the interconnect is, it's never gonna behave like a UMA system. If a core in die 1 tries to access the memory pool attached to die 2, there will be a latency penalty. We won't know how big that latency penalty is and how much of an impact in performance will have until the system is properly tested. I'm hoping Anandtech will test it since they usually do memory latency tests.

    • @bobo-cc1xw
      @bobo-cc1xw 2 года назад

      Ian cutruss formerly of anandtech said above 5 to 7 NS for just interconnect Vs 54ns total. So call it 15 percent more latency

  • @BR-lx7py
    @BR-lx7py 2 года назад +2

    Can the operating system take care of always allocating memory from the block that is closer to where the process that is requesting it is running? I know it's not perfect, but would work 90% of the time

    • @moritzhedtke8139
      @moritzhedtke8139 2 года назад +1

      Linux actually does as far as I know

    • @ssvis2
      @ssvis2 2 года назад +1

      To a certain extent, yes the OS can. However, in order to effectively to that, it needs some information about memory requirements and usage patterns for a process. Some can be gleaned from the raw byte code, especially if there are hints placed by the programmers, but a lot will come from actually running the process, then dynamically remapping and moving memory as needed. It'll work better for long-running processes, but is by no means optimal. That's why most super high performance programs, such as many video games, manually set CPU core affinity and utilize custom memory allocators to provide direct control over memory locality. They'll even go as far as detecting which are fast and slow cores and prioritize from there.

    • @shunyaatma
      @shunyaatma 2 года назад +1

      Yes, the Linux kernel can take care of this. The default memory policy (MPOL_DEFAULT) makes the page allocator always try to allocate memory from the local node but if that's not possible, it uses a different node. Over time, even if pages get scattered across NUMA nodes, Automatic NUMA Balancing will either try to move the pages to the node from where they were accessed the most or try to move the program itself to run on a CPU that is close to the memory that it accesses the most.

  • @philmarsh7723
    @philmarsh7723 4 месяца назад

    Hmmm.... if only one CPU/and/or its cache can access the system RAM, then why are mutixes and/or atomic variables necessary in say C++ and other languages? If only one CPU could access the RAM, then all operations should effectively be atomic and you would never have RAM corrupted with indeterminate values when multiple threads share data.

  • @johongo
    @johongo 2 года назад +1

    I want to learn more about this stuff, but it seems very distant, even as someone who programs for work. Any advice?

    • @MrPBJTIME12
      @MrPBJTIME12 2 года назад +1

      Computer Organization & Architecture - William Stallings

  • @dustinmorrison6315
    @dustinmorrison6315 2 года назад

    Hopefully my programs are not fetching instructions from RAM often enough for it to matter. Hopefully they're somewhere in the L1i,2,3,4 caches.

  • @soundcheck6885
    @soundcheck6885 2 года назад +4

    If you have a massive high-speed low-latency die-to-die interconnect between two dies, accessing the memory on the other die could have a latency penalty of only 10-20ns. In a system using multi-level cache hierarchy, having 10-20% extra latency for remote memory access is irrelevant for most applications. What would be more interesting is if Apple can extend the same interconnect architecture to higher numbers of CPU clusters (e.g. 4 or 8) and how much extra latency would be involved for memory access in those cases. One interesting option to achieve this may be stacking dies vertically in addition to the planar configuration in M1 Ultra.

  • @DarkKnightLives
    @DarkKnightLives 2 года назад +3

    Wow Philip Hoffman is reborn a Computer Scientist!!

  • @Ruptured_AU
    @Ruptured_AU 2 года назад

    Reapeated the same thing LOT and myself saying ahh I get it move on already. First 12 minutes could have been 2 minutes and then ending with just "M1's link is fast". 😑

  • @caffedinator5584
    @caffedinator5584 2 года назад

    My naive understanding of cpu architecture leads me to believe that the core to core memory interconnect is the lesser of the problem vs the GPU core kernel/instruction execution.
    Do you have any insight into that?

  • @JansthcirlU
    @JansthcirlU 2 года назад +24

    Since the M1 got its own episode, Microsoft's Pluton processor should also get one! It was created with security in mind in collaboration with AMD and it's used in the SteamDeck!

    • @pixelatedsethtube1271
      @pixelatedsethtube1271 2 года назад

      Whoa!

    • @G5rry
      @G5rry 2 года назад +3

      The M1 didn't get its own episode. This was specifically about the interconnect. You couldn't cover the whole processor in one episode.
      What *specifically* about the Pluton processor should they cover?

  • @qm3ster
    @qm3ster 2 года назад

    No CPU gets data "before it needs it" :v
    Going to main memory is really, REALLY slow (compared to anything else CPUs spend time doing these days).
    So, are any cache layers shared between the chiplets?

  • @sclabhailordofnoplot2430
    @sclabhailordofnoplot2430 Год назад

    Did anyone else search the Easter egg. @6:05 "3.14 happy birthday to you capital letters" Since pie day was yesterday I did.

  • @l.matthewblancett8031
    @l.matthewblancett8031 2 года назад

    WHERE DID YOU FIND THAT 1972 printer paper??!?!! lol.

  • @sevilnatas
    @sevilnatas 2 года назад

    Does Computerphile often use greenbar paper for their illustrations, because they are still using greenbar a lot, so it is handy, or is it because they don't use it anymore, so they have a bunch of it sitting around, unused, so they might as well use it for illustrations?

  • @Benny-tb3ci
    @Benny-tb3ci 2 года назад

    We, the people in chemistry and any other science that relies heavily on chemistry, have a very nice phrase for these kinds of things. It's called the "rate-limiting step" (in a chain of reactions).

  • @dafoex
    @dafoex 2 года назад

    All this is to say that people who think like the demoscene will write their programmes as if the M1 Ultra's distributed shared memory system was slow, just to squeeze out a little more speed

  • @peterhindes56
    @peterhindes56 Год назад

    Why have a memory interconnect at all then? Unless this was not intended to solve the problem mentioned about memory access getting clogged up.

  • @user-cc8kb
    @user-cc8kb 2 года назад +1

    Great explanation. Thanks!

  • @DEVAXTATOR-1
    @DEVAXTATOR-1 2 года назад

    the video is moving arround too much in the background... maybe it is just me?

  • @Ojisan642
    @Ojisan642 2 года назад +1

    Was this filmed on board a ship at sea?

    •  2 года назад

      😭😭

  • @acanalesc
    @acanalesc 2 года назад +1

    12:55 is that what is known as "Numa Aware"?

    • @genhen
      @genhen 2 года назад

      yes it is

  • @tramsgar
    @tramsgar 2 года назад

    A bit redundant explanation. No need to say it all twice. Point got a bit lost in the end, that is, why numa in the first place.

  • @Supreme_Lobster
    @Supreme_Lobster 2 года назад

    This sounds very similar to what I understand Tesla is doing in their Dojo chips

  • @KipIngram
    @KipIngram 6 месяцев назад

    Those problems aren't REALLY that hard.

  • @coolcat23
    @coolcat23 2 года назад

    You completely failed to explain why M1 Max units would have faster access to a specific areas of the main(!!!)-memory respectively, and how the NUMA approach addresses access clashes better (without forcing programmers to write specific software; remember that the latter is your claim regarding what "Apple achieved").

  • @Rubacava_
    @Rubacava_ 2 года назад

    Even if it did require additional programming, it wouldn't make much difference because no one will ever try to bother with this device for very high computation intensive applications.

    • @katbryce
      @katbryce 2 года назад

      Video editing is a very high computation intensive application, and is exactly what this is designed for.

  • @andredejager3637
    @andredejager3637 2 года назад

    wow thanks 😊

  • @5urg3x
    @5urg3x 2 года назад

    I very clearly remember the days of dual socket (like multiple physical CPUs with their own memory) workstations. It looked cool on paper, but in the real world, it usually didn't work out very well. Many times, even with software optimizations, it was more efficient (and simpler logistically) to just use one physical processor, rather than to attempt to have them both working together on the same task or set of tasks, and having to swap data in and out of cache and memory, etc. For servers, it could work, but most workstation workloads just aren't going to benefit from that type of an architecture.

  • @marcomaida1731
    @marcomaida1731 2 года назад

    I don't understand how the collision topic fits with the explanation of NUMA and the M1. Looks like once we have this very fast DSM, we are back at the problem of the beginning, that is, we will have many collisions

  • @zxuiji
    @zxuiji 2 года назад +2

    numa still seems slow to me, I have in mind a faster means that throws away numa for a single block of ram/cache. You start with a single dedicated chip which is just and endlessly running loop of "If bit N of internal is set, read address & instruction registers of core N, if read, fill internal cpu value register with contents from address read previously & clear bit N of internal register, if is write read value register, clear bit N of internal register & send the value read to address previously read.
    Each core only needs to wait on the bit it set to be cleared before continuing, since this relies on 1 internal register of the dedicated chip to distinguish when a core wants to do something that needs to be synchronised there's no need for any complicated timing or probability logic, they just defer the operation to the chip that is doing looped checking, when the number of cores is a power of 2 (2,4,8 16 etc) the loop operation can be optimised by only giving the N register enough bits to just loop straight back to 0 after incrementing from the max core index (so for 2 cores 1 would flip to 0, 4 cores 11 to 00, 8 cores 111 to 000, etc) since cores tend to support 2 or more threads natively the logic can be scaled to map directly to the thread registers instead which saves the core from waiting for one operation to be done before doing the next thread's operation.
    **Edit:** For the peops who'd rather see some pseudo code for what the logic chip would do, here's what I have in mind:
    uchar n;
    ulong b, i, a, v, s;
    THREAD *thread = NULL, fake = {0};
    fake.id = -1;
    for ( n = 0; 1; ++n )
    {
    b = bits & (1u i == INST_WR )
    {
    // Instruction
    fake.i = thread->i;
    // Address
    fake.a = thread->a;
    // Value
    fake.v = thread->v;
    // Size of value
    fake.s = thread->s;
    bits ^= b;
    write_ram( &fake );
    }
    else
    {
    // Reading the values 1st would just slow us down, pass them directly to avoid issues
    read_ram( thread );
    bits ^= b;
    }
    }