*Reasons to go for M1* - Adobe suite softwares - Xcode - Power Consumption is important to you *Reasons to go for desktop PC* - 3D Modeling - GPU intensive workloads - Visual Studio (still issues with preview on M1) - Gaming - Future Upgradability - Maintenance (say if SSD/ Ram etc fails, can easily replace with new ourself)
@@obrien8228 we are talking here about running specialized libraries as tensorflow or pytorch, so i don't see how that comment is relevant in this instance
thanks for the doing the benchmarks, but short tests with small datasets don't really tell you how well a system works for doing ML training. I've been doing benchmarks with Pascal and Coco datasets to get a more realistic measurement. The downside is it takes 2hours+ to run the benchmarks. If you run with different batch settings, doing a series of benchmarks ends up taking several days. The biggest limitation to training performance is memory. As long as your dataset fits in memory, RTX card will be much faster than m1 GPU. Keep in mind CudaNN is highly optimized and uses a lot more tricks than other hardware on the market. Google TPU and Tesla D1 are catching up to Nvidia hardware, but they are still behind on the software tooling side.
aww come on man, not going to share results? So does the rtx card still win by a similar margin when running larger data sets or more epochs or does that change and in which direction? Also if data set doesn't fit in vram how much penalty for swapping are we talking about here? Also also, and I'm just throwing this out here, under that circumstance does 3090 make up a bit of lost time with pcie5.0?
@@redocious8741 but time is money so 10 minitues less time to wait for the model makes a huge difference if you to it 10-20 times a day so it's not the Maschine I would want for these kinds of workloads even it it draws less power work time is mir valuble
Just saw a review of what looks like a very similar PC to yours over at GamersNexus (pre-built Alienware Aurora RF-13, i9-12900KF, RTX 3090) - they mention very bad airflow design on that machine. Could be the reason why the temperatures on it are so high and the fans so loud.
Yes, it depends on the test (e.g. AI training types, video post work, 3D modeling, etc), as well as the software being used. We use RTX through H100’s for most of our AI development- at least on the training side. However for coding, data sci work, inference and UI/UX we all use our favorite OS, whichever that is. One thing to keep in mind for pro level large parameter/data set AI dev, you will often be using a dedicated server running in the kilowatts with AI grade GPU’s (e.g. V100’S, H100’s, etc). Whether owned, hosted or otherwise, few jobs will be run locally.
few jobs CAN run locally.. especially when you need 80GB for your model. That's why the 96GB M1 (or 128GB M3) is interesting. But only if they test in a RAM constrained environment. The NVidia 4090 only has up to 16GB.
I wonder what the M1 machine learning cores are for, when they would be used, and if the benchmarks take advantage of them (and if it would even make sense to do so).
The ML cores are for running models, not training. I read that since they're limited in floating point resolution (I forgot the #, but it could be that it only supports 16bit floats, whereas training requires 32 or 64 bit floats )
As far as I’m aware, if I’m recalling what my professor mentioned once correctly, machine learning cores are simply called that because they’re very specialized pieces of hardware that are able to perform a small matrix multiplication or something of the sort in one instruction. That’s about as much as I know, but I’d take a guess and say maybe the machine looks for such operations/cases when running on a dataset and if it finds them, uses said cores
Interestingly, each system have eventually spent roughly the same amount of energy on completion of the task. And this gives a lot to contemplate upon, like where do you put your priorities, do you want cooler and slower, or prefer hotter and louder, all is determined by the cost of a single computational operation.
We do know that gpu do get quite a bit more efficient at lower power at least to a point(thanks to crypto miners optimizing their margin mostly). Makes one wonder whether it's possible to under-volt/under-clock a 3080ti so much it becomes practically silent...or at least sounds like a propeller plane rather than a jet.
I would like to see the m1 ultra compared to the m1 max and pro in a machine learning benchmark. I'm curious if the m1 ultra scales linearly for machine learning workloads.
What does scale linearly exactly mean? And what is the opposite in practical use? Computer science student here, am interested to learn, so hope you can teach me something, tanks
@@SimplicityForGood so if the m1 chip has 8 GPU cores, the m1 pro doubles that to 16, the m1 max doubles that to 32, and the m1 ultra doubles that to 64, do you then also get double the performance for each of the steps? That would be linear scaling. If you double the cores, but only have 80% increase in performance, then it is below linear scaling. practically speaking, it means you get diminishing returns if you invest in extra resources but don't get the extra performance commensurate with your increased investment (whatever the investment is, money, or power for example). That may be a reason to stick with lower specs because they are better value for your investment.
@@truthmatters7573 got it, and so far what have you seen? Would it be worth purchasing a 64 gb ram, 32 gb gpu vram with 4 tb right now in a Macbook pro or in a mac studio or do you think one should have a different or lower configuration from the linear performance you seen in such tests you describe? I got a chance right now to buy one apple computer , or rather get it if I develop for a company as gesture of trust instead of a salary as my first job. I want to future proof myself with making the right choice. What do you think is a good configuration to order? The above configuration was the one I made for my order, but zi have not yet signed it… Or do you think is way too much power in a too heavy laptop and one would be much better off waiting for a more lightweight macbook Air M2 in May? Thanks for getting back! 😊
I think it mostly demonstrates that tensorflow is not yet fully optimized for metal. I am sure Apple used a more low level benchmark. But nevertheless it is a useful learning from practical standpoint. Actually 3x slower with 1/3 of power requirement is not too bad. Also it would be interesting to learn if 128 gb of unified memory makes training more efficient since it can do it in larger batches than a typical gpu card can, since it’s memory is 8 gb(4080ti has 20). Unsure what this benchmark does in terms of batch size
Yeah so far the M1 Ultra has seemed to be a disappointment in GPU workflows compared to the M1 Max with relatively low performance gains for the amount of $ upcharge. CPU perf looks really good though.
still not good for everything. If you load float/matrix operations a lot, it will be slower than intel. TLDR is that this is single task box, only video edits, where its unparalleled and internet consumption, because m1 is designed for that. I am sure apple compared 3090 only in video cut/compile only, anything that fully utilizes 3090 and dont go through hardware codecs baked in m1, will be massively slower than 3090. Size and power does what they should. I guess to show that we can run python pymol and rcsb structure/6WM2 for example which is big enough I hope to fully load everything.
@@jaksvlogs7195 will be different on what you do, but in general if you work on float lists/matrix's numbers mostly, then intel will be better for money. Same general performance as 8000$ apple build costs ~4000$ for intel and ~3000$ for amd. Right now there is one golden rule, If you care about performance, dont pick old stuff. All 3 parties made big leaps in performance literally NOW, dont even look at last year stuff, not talking about older ones, differences are 15-20% each year quite literally. It really is 3 dimensional problems of budget/expectations/time, pick 2. I deal with laptops & cloud mostly and there amd + ubuntu wins by a mile, as intel is to power hungry for working on battery, and I need VM's which makes apple throw the towel. I would start with my budget and make a list of what I CAN get from apple/intel based/amd based. Second part is what you need, if you have massive data sets to process or something, make a list of what I need/what I want. Apple limits you to 8TB, while others can easly give you 80TB if needed. I would recommend you start with "pcpartpicker guide" for basic build after that and adjust to your needs. If you want to invest time into making purpose build box, that give you best value for invested cash. I recommend amd for smallest laptops, amd+nvidia for stronger ones, intel for full desktop, and apple only if you already have iphone, watch and others, as it's synergize well, and you dont have issues with your budget.
That is a matter of software optimisation. Apps have to take into account the existence of more cores, that is not always easy in algorithms. RAW cumulated performance is not that easy to take advantage of.
@@jaksvlogs7195 you would need to use cuDF instead of pandas and CuPy instead of scipy and numpy, native pandas/scipy/numpy don't utilize the gpu you are running them from the cpu.
Great video! Apple should give us a "high power" mode for M1 Ultra. This ideally would allow higher wattage for the GPU to match the 3x value of the RTX chip to reach higher performance. Currently, the monster copper cooling system is largely wasted.
Great video, thanks a lot for it. I'd love to know the price of each machine, or at least the full configuration of each of them. Mind sharing that with us?
This is very confusing. On your previous test, the M1 Max with 32 cores completed in 11 minutes. I get that the scaling isn't linear but the higher end Ultra with 48 cores completing in 14 minutes sounds like something is off. Why would it be slower than the M1 Max?
@@AZisk hmm. My intuition wants to point to a flaw in how Mac OS is handling the computation on the other half of the M1 ultra. It feels like only half of the cores are being used. The 48 core is split by two 24 core processors. So 14 minutes would make sense if the performance was for some reason capped at using at 24 cores because this would perform proportionally to the 32 core gpu. If it’s not the software then it could be an even worse issue regarding the interconnected dye tech in their processors not working on a hardware level. Every time I see people compare the 64 core Ultra to the 32 core max, I see identical performance. This would also fit the theory that only half of the cores are actually being utilized. What are your thoughts? Would it be possible to show that all cpu/gpu cores are being used but the software can only handle half of the processing?
@@AZisk I’ve had issues with software on windows in the past with machines that had dual socket CPUs. I needed software updates in order to use both CPUs in parallel. I reached out to the devs about this and they explained that trying to make an update like this is very complicated. Something like tensorflow should be accessing the hardware directly. So I have no idea why these results would happen unless there was an issue with how mac os accesses the dual cpu.
@@Khari99 I suppose that the latency is much higher across the "UltraFusion" and as the die is basically split in two, that's actually ending up harming the performance where latency is a big deal. In the PC, despite much lower bandwidth, the CPU is at one place and GPU at another, meaning data isn't traveling between (two) GPUs.
The high no. Of cuda cores on the 3080ti help a lot. Almost the same number as rtx 3090. Although rtx3090 will probably be even better since it can store much bigger models in the massive 24GB GDDR6X VRAM.
@@benjaminlynch9958 you might want to watch the video again. Mac studio was literally dragged through the mud here by an rtx 3080ti. That's a consumer card. In short, unified memory sounds great on paper but very few real world cases actually benefit from it. Granted for things like web development, ios development and android development mac studio would be a great tool to have under youe belt. But CAD, machine learning, gaming, etc are things where the rtx cards shine The cpu part of m1 family is damn impressive. GPU has been a bit of a disappointment Also remember one thing - the Alienware the Alex used is probably the worst rtx 3080ti build you can get. Custom builds have much better acoustics and thermals.
@@metallicafan5447 unified memory has its purpose such as enabling rapid communication between the CPU and GPU for tasks that have highly parallel components but also have sequential parts. A unified memory buffer would be better though so the GPU can have its own raw memory rather than managed memory which and reduce latency and the unified buffer could allow for tasks that swap between the GPU and CPU to be much faster
@@metallicafan5447 also, unified memory probably good for VRAM cost cutting hence you can get bigger VRAM with lesser price than using Tesla or stacking RTXs machine learning training is a VRAM hog task, hence Apple Silicon approach probably works in the future
M1 Ultra has 64 gb of unified memory while RTX 3080ti only has 16gb of vram. Does this mean training some large models is only possible on the m1 Ultra?
I really feel the cooling solution is overkill for the mac studio's. I wonder if it's to give it headroom for the next iteration. Every test you've done it's never seemed to pull the voltage it could use. Just should mention a 3090ti is the same cost as a m1 max. It's just no contest.
You are probably right with the next iteration, the 370W power supply also not reasonable here, M1 Max limited to use less than 100W, so even if you have 2 M1 Max here, that still not need more than 200W power supply.
In terms of energy efficiency they would be comparable assuming performance scales linearly, otherwise the 3080ti would likely win, thus the 3090 would be even better
I've been watching some other videos checking out the studio M1 performance and I must say it is very impressive, given the power consumption, thermals, noise and size (but not upgradeability). BUT you can also look at the size like it's a bad thing. I've always asked myself why my friend hat a big tower case in his music studio, even though he didn't even need all expansion slots and what not. Then it hit me: Imagine being a burglar, stealing stuff.. would you really grab that big tower or would you rather grab a notebook or a small Mac studio? :-P Sure, you can put the computer in safe(r) storage room, but the same you could do with a bigger case. In these cases I genuinely think that having a big tower is better and lowers the risk of it being stolen (given the burglar doesn't have excessive knowledge on prices of hardware). Of course you have insurance for such cases but imagine the downtime you're having when the computer is missing..
while The Ultra version of the Mac Studio falls FLAT in performance on the new Davinci and also the latest Final Cut Pro. Seems like the M1 Max is the best bang for the buck.
this just solidifies that the studio is a powerhouse for certain workflows that take advantage of built in cpu functionalities or any cpu operations in general. that is undeniable and the very low powerdraw is insane! but any heavy lifting on the gpu side it just fails to match up, the audacity to compare it to 3090 is just hilarious, maybe next iteration they will come up and just not falsely do those claim. still keeping my m1 air for mac/ios specific build operations until then
I wonder how a cheap P40 gpu would do. I think it would probably also beat the mac. For inference, mine seems to be a bit slower than half the speed of a 3090.
btw if you look at gamersnexus review of those dell alienware machines, they throttle (bad airflow), so the windows machine was handicapped... useful to know
The M1 Ultra has 8192 execution core versus 10240 cuda cores in the RTX3080Ti (are they equivalent, architecturally?) so should we just expect the 3080Ti to be faster? The margin is x3 admittedly so the gain is not in proportion to the number of cores. The power consumption for the Ultra was a third or less compared with the system with the RTX, which matters financially and in a green world. How much did the system with the RTX cost? The card alone is $1200 on Best Buy. The Ultra is $5k in total.
If the RTX draws 3 times the power but finishes 3 times faster then the Mac will not make earth greener ;) Regarding your question about architecture: they both have a completely different architecture, therefore the number of exec. units is not representative. Actually, since Apple has a more modern manufacturing process (5nm TSMC vs. 8nm Samsung on Nvidia), and far more transistors, I assumed the numbers of apple at the presentation regarding GPU performance to be realistic. But what they delivered is nothing more than a disappointment (if you are no video editor). In other fields like Blender, the performance difference is even higher, far higher (Nvidia RTX 3090 is about 10x faster rendering in blender than m1 ultra)
I expected your experiment, and I was right. But I am curious of doing ONLY Inference case. Training requires heavy throughput so also requires many gpu cores, but inference do not need opus as much as training.
In my experience I can't find any performance penalties running deep learning tasks in WSL, especially if you copy all the runtime files to the linux file system. Only when I run very heavy tasks that span several hours, WSL is a few minutes slower than native Ubuntu. I only ever boot into Ubuntu if the task I'm running will eat up ALL of the RAM. Otherwise, Windows is much easier to live with day-to-day.
The ANE is for inference (i.e. running pretrained models), while this test measures _training_ performance, for which the ANE is basically useless. Makes a lot of sense, too, since running ML models is *much* more common than training them.
That graphics card at times was pulling 5x the power! I'd actually go for the mac for development purely for the power draw and quietness, minus the cricket noise! nice vid
if you develop ml often you would never take that mac over a dedicated workstation gpu, i work with an nvidia a6000 and often run trainings for over 10 hours which would take weeks if not months on a m1 ultra
I'm getting an M1 MAX Mac Studio. Just for the low power draw. I'm moving to Berlin Germany to attend university for Nursing and electric rates are more than twice as high as the USA. I believe in saving money over performance.
I think the performance is still pretty good compared with the power consumption. Apple ~120 W vs 430 W even if the run takes 3 times longer, I guess you still save some energy, right? anyhow, both systems are incredible regarding the computing power. Nice test + nice video, thanks for that
You would spend same amount of energy. While M1 required 3x less power, training time was 3x longer, so at the end you would spend same amount of energy. Both system will cost you the same in energy, but with RTX you save on your time since you get results faster.
Have you seen the latest Gamers Nexus video reviewing the new Alienware PC? The one they reviewed has a 3090 in place of the 3080 Ti you have but the review overall is extremely negative. The cooling in that case is really terrible. As such I am not sure you are getting your money's worth out of yours either (same CPU in both - 12900KF, same liquid cooling solution with a measly 120mm fan too probably). I'd recommend selling it off at the first opportunity you get and buying yourself a better assembled machine from anyone but Alienware. From all the videos I've seen reviewing pre-built machines I think custom shops like Maingear are better than the bigger name vendors like Dell, Alienware, Asus, etc.
Question though: if you're running at scale where your power draw and cooling are important, the mac comes out ahead on that metric, if you have additional GPUs in the system does that drop the PC's normalised power consumption below that of the mac?
117W * 15min = 1755Wmin vs. 450W * 6min = 2700Wmin. I think people who do any real machine learning are going to go for the speed. And consider that the 3080Ti is only like $1400. I don't think the M1 Ultra can compete in this space. Better power efficiency doesn't go very far in a desktop.
If you care about power efficiency, you could under-clock and under-volt the 3080ti. You could probably get it to roughly 50% power at 75% performance which would still crush the M1.
Yeah what the other folks said, nvidia runs their gpus really really hot - also power to performance is usually on the amd's radeon series side mostly, but amd's software support is worse than nvidia's for machine learning (ROCM vs CUDA)
I dont get it. Why not just build a linux nvidia box? you can get way more RAM and have as many GPUs as you like? Alternatively spin up some GPU containers on the cloud. No one serious is making models on one of these things...
Well per Watts thats pretty good. However, the real interesting thing for ML on the M1 Ultra is the huge VRAM. Nvidia charges a hefty premium to be able to hold large ML models in memory. The speed difference (no memory paging, less batching) might cancel out since IO is the slow part of ML. The tooling on none-Nvidia GPUs isn’t great though - and will likely stay that way
the difference is i've run that test on an a6000 and it takes an average of 17 seconds. he's comparing a gaming gpu (rtx3080ti) over a workstation gpu like the mac ultra, if you put the mac against an a6000 is way behind.
Having watched loads of Mac performance comparison videos I have one key takeaway; we're completely locked into a discrete architecture mindset. Apple says "Apple Silicon" the tech community hears "CPU" OR "GPU". Why not CPU AND GPU AND media encoders (AND ANE AND AMX)? Apple Silicon won't shine until we take the specs off (excuse the pun) and start looking at multi-processor/APU benchmarks which engage the silicon in the same way real world apps do. Are there no AI benchmarks which use all available silicon simultaneously? The only "APU/UMA" benchmark I can find is for 2D image manipulation - Affinity Photo benchmark 11021 Combined score, this outperforms PC workstations by 3-4x. Alex - you're technical, why not show powermetrics or asitop instead of a wall meter?
Because limitations apply to these special compute blocks. For example, the ANE works really well for inference (using AI models) but not for training, and even so the insane inference performance gains with the ANE only apply to convolutional and ReLU layers with FP16. This makes the ANE essentially useless in AI training.
@@michel8847 you're still thinking discretely. With UMA it doesn't matter that the ANE isn't best at all tasks as other silicon (like AMX/SIMD or GPU) can pick up the compute functions (initialisation overhead permitting). Sadly benchmarks and many applications are wedded to this architectural view - that was my point.
se le memorie su Mac sono uguali a quelle su pc ce da considerare che le memorie su gpu non vengono valutate ? poi il test essendo sotto windows e in emulazione su Mac?
Mac Studio's performance was impressive, considering its power draw. Though, it's probably even more impressive on the lower-wattage chip inside the 16" MacBook Pro.
It is not impressive at all, if you ask me, at least for ML tasks. 4 times the energy draw for 3 times the performance is a clear win for Nvidia, since performance/watt is no linear function. I would not be surprised if the RTX at 100 Watts would still perform better than the Mac on Tensorflow
if the rtx3080ti is undervolted, pretty sure that will beat m1ultra as well - nvidia runs their cards really really hot (in search of the performance crown of course)
The benchmark should probably be optimized to make use of the neural engine of the M1. When Apple compared the M1 Ulltra against the RTX 3090 the message was that at a certain point in the performance/power curve the M1 Ultra takes 200W less than the 3090 while having the same performance. What they comveniently didn’t say is that the 3090 has a lot more headroom and can draw even more power and be a lot faster than the M1 Ultra.
Youn Sunggu(201810740) m1 is a one-chip design, it would be good to handle the training dataset with strength in bottlenecks, but it is shocking that there is a difference of nearly 3 times. How much diffrent between 64 core and 48 core version?
*Reasons to go for M1*
- Adobe suite softwares
- Xcode
- Power Consumption is important to you
*Reasons to go for desktop PC*
- 3D Modeling
- GPU intensive workloads
- Visual Studio (still issues with preview on M1)
- Gaming
- Future Upgradability
- Maintenance (say if SSD/ Ram etc fails, can easily replace with new ourself)
very unbiased. love it.
Mac is more plug and play
Yep I love macbooks in general but sadly I am married to Visual Studio for now
> "- Xcode"
Shouldn't it be in the "Reason to go for desktop PC" list? xD
@@obrien8228 we are talking here about running specialized libraries as tensorflow or pytorch, so i don't see how that comment is relevant in this instance
That wsl reveal animation was so nice!! I wonder what the neural engines inside the chips do....
Nice video btw!
thanks for the doing the benchmarks, but short tests with small datasets don't really tell you how well a system works for doing ML training. I've been doing benchmarks with Pascal and Coco datasets to get a more realistic measurement. The downside is it takes 2hours+ to run the benchmarks. If you run with different batch settings, doing a series of benchmarks ends up taking several days.
The biggest limitation to training performance is memory. As long as your dataset fits in memory, RTX card will be much faster than m1 GPU. Keep in mind CudaNN is highly optimized and uses a lot more tricks than other hardware on the market. Google TPU and Tesla D1 are catching up to Nvidia hardware, but they are still behind on the software tooling side.
aww come on man, not going to share results? So does the rtx card still win by a similar margin when running larger data sets or more epochs or does that change and in which direction?
Also if data set doesn't fit in vram how much penalty for swapping are we talking about here? Also also, and I'm just throwing this out here, under that circumstance does 3090 make up a bit of lost time with pcie5.0?
Based on the average power usage over time it would seem the total power used was quite similar. 300w x 5 mins versus 100w x 15 mins.
Ur comment is very biased towards the PC
Its more like 400 w x 6 mins vs 110 w x 15
It’s not 300w. It’s 400w with 500w peaks
@@redocious8741 but time is money so 10 minitues less time to wait for the model makes a huge difference if you to it 10-20 times a day so it's not the Maschine I would want for these kinds of workloads even it it draws less power work time is mir valuble
This is an absolutely stupid comment. ML benchmarking is simply about performance. That's ALL that matters.
I agree we should focus on performance when it comes to ML benchmarking, but how is this comment stupid? OP was just sharing a simple observation.
Just saw a review of what looks like a very similar PC to yours over at GamersNexus (pre-built Alienware Aurora RF-13, i9-12900KF, RTX 3090) - they mention very bad airflow design on that machine. Could be the reason why the temperatures on it are so high and the fans so loud.
As a Zulu man I am really impressed with how you pronounced "Ubuntu".
Thanks, Alex. I really appreciate what you have done.
Yes, it depends on the test (e.g. AI training types, video post work, 3D modeling, etc), as well as the software being used. We use RTX through H100’s for most of our AI development- at least on the training side. However for coding, data sci work, inference and UI/UX we all use our favorite OS, whichever that is. One thing to keep in mind for pro level large parameter/data set AI dev, you will often be using a dedicated server running in the kilowatts with AI grade GPU’s (e.g. V100’S, H100’s, etc). Whether owned, hosted or otherwise, few jobs will be run locally.
few jobs CAN run locally.. especially when you need 80GB for your model. That's why the 96GB M1 (or 128GB M3) is interesting. But only if they test in a RAM constrained environment. The NVidia 4090 only has up to 16GB.
Yes please compare to other Macs! Also curious if there are other AI/ML tests that would show the improvement, vs the 3080ti
I wonder what the M1 machine learning cores are for, when they would be used, and if the benchmarks take advantage of them (and if it would even make sense to do so).
npu is used only in CoreML. Tensorflow right now is gpu only using metal.
@@ben.scanlan no not for spying but CoreML and Reslove 18 takes advantage of them.
The ML cores are for running models, not training. I read that since they're limited in floating point resolution (I forgot the #, but it could be that it only supports 16bit floats, whereas training requires 32 or 64 bit floats )
As far as I’m aware, if I’m recalling what my professor mentioned once correctly, machine learning cores are simply called that because they’re very specialized pieces of hardware that are able to perform a small matrix multiplication or something of the sort in one instruction. That’s about as much as I know, but I’d take a guess and say maybe the machine looks for such operations/cases when running on a dataset and if it finds them, uses said cores
ml cores are only used for predicting
not fitting
Very cool comparison. Thanks Alex
Interestingly, each system have eventually spent roughly the same amount of energy on completion of the task. And this gives a lot to contemplate upon, like where do you put your priorities, do you want cooler and slower, or prefer hotter and louder, all is determined by the cost of a single computational operation.
We do know that gpu do get quite a bit more efficient at lower power at least to a point(thanks to crypto miners optimizing their margin mostly). Makes one wonder whether it's possible to under-volt/under-clock a 3080ti so much it becomes practically silent...or at least sounds like a propeller plane rather than a jet.
Is this utilising the Neural Engine/ML cores of the M1 Ultra or not? If not, how can you run this kind of workloads on those cores?
We need same tests with the M3 Max 👏
I believe last time I looked WSL2 is 94% as fast as a native box, well worth using and so convenient.
didn’t know that - thanks!
Thanks, this is what I was looking for!
I would like to see the m1 ultra compared to the m1 max and pro in a machine learning benchmark. I'm curious if the m1 ultra scales linearly for machine learning workloads.
What does scale linearly exactly mean? And what is the opposite in practical use? Computer science student here, am interested to learn, so hope you can teach me something, tanks
@@SimplicityForGood Scaling linearly means 2X the chip gives 2X performance. And that's not CS, just basic maths :)
@@SimplicityForGood so if the m1 chip has 8 GPU cores, the m1 pro doubles that to 16, the m1 max doubles that to 32, and the m1 ultra doubles that to 64, do you then also get double the performance for each of the steps? That would be linear scaling. If you double the cores, but only have 80% increase in performance, then it is below linear scaling.
practically speaking, it means you get diminishing returns if you invest in extra resources but don't get the extra performance commensurate with your increased investment (whatever the investment is, money, or power for example). That may be a reason to stick with lower specs because they are better value for your investment.
@@ravenclawgamer6367 alright then!
@@truthmatters7573 got it, and so far what have you seen?
Would it be worth purchasing a 64 gb ram, 32 gb gpu vram with 4 tb right now in a Macbook pro or in a mac studio or do you think one should have a different or lower configuration from the linear performance you seen in such tests you describe?
I got a chance right now to buy one apple computer , or rather get it if I develop for a company as gesture of trust instead of a salary as my first job. I want to future proof myself with making the right choice.
What do you think is a good configuration to order? The above configuration was the one I made for my order, but zi have not yet signed it…
Or do you think is way too much power in a too heavy laptop and one would be much better off waiting for a more lightweight macbook Air M2 in May?
Thanks for getting back!
😊
I think it mostly demonstrates that tensorflow is not yet fully optimized for metal. I am sure Apple used a more low level benchmark. But nevertheless it is a useful learning from practical standpoint. Actually 3x slower with 1/3 of power requirement is not too bad. Also it would be interesting to learn if 128 gb of unified memory makes training more efficient since it can do it in larger batches than a typical gpu card can, since it’s memory is 8 gb(4080ti has 20). Unsure what this benchmark does in terms of batch size
3080ti has 12GB, and has 900+ GB/sec of memory throughput vs 800 of m1 ultra. I bet on large models m1 ultra could be quite fast
@@dotinsideacircle Any where it doesn't fit in memory on video card
Would love to see how a pure linux system does in this test.
Really interesting videos and a totally different take on performance testing. Well done Alex.
In this particular case the performance per Watt seem to be the same, you get roughly 3 times faster for roughly 3 times the power.
Does the possible 128GB RAM on GPU help on those large model
Yeah so far the M1 Ultra has seemed to be a disappointment in GPU workflows compared to the M1 Max with relatively low performance gains for the amount of $ upcharge. CPU perf looks really good though.
still not good for everything. If you load float/matrix operations a lot, it will be slower than intel. TLDR is that this is single task box, only video edits, where its unparalleled and internet consumption, because m1 is designed for that. I am sure apple compared 3090 only in video cut/compile only, anything that fully utilizes 3090 and dont go through hardware codecs baked in m1, will be massively slower than 3090. Size and power does what they should.
I guess to show that we can run python pymol and rcsb structure/6WM2 for example which is big enough I hope to fully load everything.
@@deilusi what if i do data analysis with pandas/scipy etc?
@@jaksvlogs7195 will be different on what you do, but in general if you work on float lists/matrix's numbers mostly, then intel will be better for money. Same general performance as 8000$ apple build costs ~4000$ for intel and ~3000$ for amd.
Right now there is one golden rule, If you care about performance, dont pick old stuff. All 3 parties made big leaps in performance literally NOW, dont even look at last year stuff, not talking about older ones, differences are 15-20% each year quite literally.
It really is 3 dimensional problems of budget/expectations/time, pick 2. I deal with laptops & cloud mostly and there amd + ubuntu wins by a mile, as intel is to power hungry for working on battery, and I need VM's which makes apple throw the towel.
I would start with my budget and make a list of what I CAN get from apple/intel based/amd based. Second part is what you need, if you have massive data sets to process or something, make a list of what I need/what I want. Apple limits you to 8TB, while others can easly give you 80TB if needed.
I would recommend you start with "pcpartpicker guide" for basic build after that and adjust to your needs. If you want to invest time into making purpose build box, that give you best value for invested cash.
I recommend amd for smallest laptops, amd+nvidia for stronger ones, intel for full desktop, and apple only if you already have iphone, watch and others, as it's synergize well, and you dont have issues with your budget.
That is a matter of software optimisation. Apps have to take into account the existence of more cores, that is not always easy in algorithms. RAW cumulated performance is not that easy to take advantage of.
@@jaksvlogs7195 you would need to use cuDF instead of pandas and CuPy instead of scipy and numpy, native pandas/scipy/numpy don't utilize the gpu you are running them from the cpu.
thanks 🙂
Great video! Apple should give us a "high power" mode for M1 Ultra. This ideally would allow higher wattage for the GPU to match the 3x value of the RTX chip to reach higher performance. Currently, the monster copper cooling system is largely wasted.
They would need a different form factor
Thanks 😊
what kit do you use for your audio - very Radio sounding - professional sounding
Great as always Alex.
Great video, thanks a lot for it. I'd love to know the price of each machine, or at least the full configuration of each of them. Mind sharing that with us?
This is very confusing. On your previous test, the M1 Max with 32 cores completed in 11 minutes. I get that the scaling isn't linear but the higher end Ultra with 48 cores completing in 14 minutes sounds like something is off. Why would it be slower than the M1 Max?
i don’t know, but that’s what i’ve got
it’s disappointing
@@AZisk hmm. My intuition wants to point to a flaw in how Mac OS is handling the computation on the other half of the M1 ultra. It feels like only half of the cores are being used. The 48 core is split by two 24 core processors. So 14 minutes would make sense if the performance was for some reason capped at using at 24 cores because this would perform proportionally to the 32 core gpu. If it’s not the software then it could be an even worse issue regarding the interconnected dye tech in their processors not working on a hardware level. Every time I see people compare the 64 core Ultra to the 32 core max, I see identical performance. This would also fit the theory that only half of the cores are actually being utilized. What are your thoughts? Would it be possible to show that all cpu/gpu cores are being used but the software can only handle half of the processing?
@@AZisk I’ve had issues with software on windows in the past with machines that had dual socket CPUs. I needed software updates in order to use both CPUs in parallel. I reached out to the devs about this and they explained that trying to make an update like this is very complicated. Something like tensorflow should be accessing the hardware directly. So I have no idea why these results would happen unless there was an issue with how mac os accesses the dual cpu.
@@Khari99 I suppose that the latency is much higher across the "UltraFusion" and as the die is basically split in two, that's actually ending up harming the performance where latency is a big deal. In the PC, despite much lower bandwidth, the CPU is at one place and GPU at another, meaning data isn't traveling between (two) GPUs.
Nice, this is what I am looking for since M1 was comming
Wow, and I thought I would be the other way around, thanks again
Nice video. Thanks. It took almost as long as the wait for my new Macbook lol 😢
I'm curious if this is making use of the Apple Neural Engine, or only the GPU?
The high no. Of cuda cores on the 3080ti help a lot. Almost the same number as rtx 3090. Although rtx3090 will probably be even better since it can store much bigger models in the massive 24GB GDDR6X VRAM.
You know that the Mac Studio can be configured with > 100GB of unified memory that the GPU has direct access to, right?
@@benjaminlynch9958 you might want to watch the video again. Mac studio was literally dragged through the mud here by an rtx 3080ti. That's a consumer card.
In short, unified memory sounds great on paper but very few real world cases actually benefit from it. Granted for things like web development, ios development and android development mac studio would be a great tool to have under youe belt.
But CAD, machine learning, gaming, etc are things where the rtx cards shine
The cpu part of m1 family is damn impressive. GPU has been a bit of a disappointment
Also remember one thing - the Alienware the Alex used is probably the worst rtx 3080ti build you can get. Custom builds have much better acoustics and thermals.
@@metallicafan5447 unified memory has its purpose such as enabling rapid communication between the CPU and GPU for tasks that have highly parallel components but also have sequential parts. A unified memory buffer would be better though so the GPU can have its own raw memory rather than managed memory which and reduce latency and the unified buffer could allow for tasks that swap between the GPU and CPU to be much faster
@@metallicafan5447 also, unified memory probably good for VRAM cost cutting hence you can get bigger VRAM with lesser price than using Tesla or stacking RTXs
machine learning training is a VRAM hog task, hence Apple Silicon approach probably works in the future
M1 Ultra has 64 gb of unified memory while RTX 3080ti only has 16gb of vram. Does this mean training some large models is only possible on the m1 Ultra?
You are correct
Thanks for your previous reviews. Why did it take you this long to do this test?
Keep up the great work, this series is super interesting. Seriously tempted by the M1 studio.
I really feel the cooling solution is overkill for the mac studio's. I wonder if it's to give it headroom for the next iteration. Every test you've done it's never seemed to pull the voltage it could use. Just should mention a 3090ti is the same cost as a m1 max. It's just no contest.
You are probably right with the next iteration, the 370W power supply also not reasonable here, M1 Max limited to use less than 100W, so even if you have 2 M1 Max here, that still not need more than 200W power supply.
In terms of energy efficiency they would be comparable assuming performance scales linearly, otherwise the 3080ti would likely win, thus the 3090 would be even better
great test~
Hey, may I know which mic you use? Much appreciated :D
I've been watching some other videos checking out the studio M1 performance and I must say it is very impressive, given the power consumption, thermals, noise and size (but not upgradeability). BUT you can also look at the size like it's a bad thing. I've always asked myself why my friend hat a big tower case in his music studio, even though he didn't even need all expansion slots and what not. Then it hit me: Imagine being a burglar, stealing stuff.. would you really grab that big tower or would you rather grab a notebook or a small Mac studio? :-P
Sure, you can put the computer in safe(r) storage room, but the same you could do with a bigger case. In these cases I genuinely think that having a big tower is better and lowers the risk of it being stolen (given the burglar doesn't have excessive knowledge on prices of hardware). Of course you have insurance for such cases but imagine the downtime you're having when the computer is missing..
He probably got that big tower in order to be able to use good cooling solution that no noise. Usually those coolers are pretty damn large.
please do a vid on External NVIDIA GPUs Compatible With M1 Macs
[ via Thunderbolt Connection ]. Thank You !
...
That was a shot at apple by Nvidia , but what was the price comparison too performance ?
while The Ultra version of the Mac Studio falls FLAT in performance on the new Davinci and also the latest Final Cut Pro. Seems like the M1 Max is the best bang for the buck.
Don’t think you’ve done a test I haven’t appreciated yet.
Please compare it with the 64 core ultra.
this just solidifies that the studio is a powerhouse for certain workflows that take advantage of built in cpu functionalities or any cpu operations in general.
that is undeniable and the very low powerdraw is insane!
but any heavy lifting on the gpu side it just fails to match up, the audacity to compare it to 3090 is just hilarious, maybe next iteration they will come up and just not falsely do those claim.
still keeping my m1 air for mac/ios specific build operations until then
I wonder how a cheap P40 gpu would do. I think it would probably also beat the mac. For inference, mine seems to be a bit slower than half the speed of a 3090.
Alex! Such a great video!! Could you please make a video setting up the Alienware for Linux environment? Thank you so much for reading this message
thanks. i tried for many hours and gave up. afraid the alienware machine i have just doesn’t let you install linux
Thank you.
btw if you look at gamersnexus review of those dell alienware machines, they throttle (bad airflow), so the windows machine was handicapped... useful to know
yep, seen that. pretty bad
The M1 Ultra has 8192 execution core versus 10240 cuda cores in the RTX3080Ti (are they equivalent, architecturally?) so should we just expect the 3080Ti to be faster? The margin is x3 admittedly so the gain is not in proportion to the number of cores. The power consumption for the Ultra was a third or less compared with the system with the RTX, which matters financially and in a green world. How much did the system with the RTX cost? The card alone is $1200 on Best Buy. The Ultra is $5k in total.
If the RTX draws 3 times the power but finishes 3 times faster then the Mac will not make earth greener ;)
Regarding your question about architecture: they both have a completely different architecture, therefore the number of exec. units is not representative. Actually, since Apple has a more modern manufacturing process (5nm TSMC vs. 8nm Samsung on Nvidia), and far more transistors, I assumed the numbers of apple at the presentation regarding GPU performance to be realistic. But what they delivered is nothing more than a disappointment (if you are no video editor). In other fields like Blender, the performance difference is even higher, far higher (Nvidia RTX 3090 is about 10x faster rendering in blender than m1 ultra)
I expected your experiment, and I was right. But I am curious of doing ONLY Inference case. Training requires heavy throughput so also requires many gpu cores, but inference do not need opus as much as training.
That noise is kind of scary from Mac Studio -- I just placed an order for the Ultra chip.
I am curious what the results will be on the upcoming new Mac Pro
they were comparing at the same watts
Compare it to the other mac's pls!!!!
That cricket in your m1 ultra is hurting my brain
Could you make a PyTorch test for both the systems
How much unified memory did you have on your machine?
What's surprising is that the Ultra did that well. The 3080Ti is a beast. I'd be interested in seeing this on an M1 Mac Mini.
thank you, now i save some money on M1 ultra.. and probably go all linux setup
is wsl2 really slowing the training? i cannot think of a reason why it would.
In my experience I can't find any performance penalties running deep learning tasks in WSL, especially if you copy all the runtime files to the linux file system. Only when I run very heavy tasks that span several hours, WSL is a few minutes slower than native Ubuntu. I only ever boot into Ubuntu if the task I'm running will eat up ALL of the RAM. Otherwise, Windows is much easier to live with day-to-day.
RPCS3 recently got native macOS version, can you test the performance on M1 Ultra?
great video as always. Can you find some coding benchmark that hits the neural engine specifically ? maybe something in swift ?
Currently you can use the neural network only for doing inference
Sorry, I don't know much about Machine Learning, but isn't this what the Neural Engine is for?
The ANE is for inference (i.e. running pretrained models), while this test measures _training_ performance, for which the ANE is basically useless.
Makes a lot of sense, too, since running ML models is *much* more common than training them.
you should redo the test with M3 Max or M3 Ultra
I waited so long for this video that the universe was destroyed and built 3 times.
That graphics card at times was pulling 5x the power! I'd actually go for the mac for development purely for the power draw and quietness, minus the cricket noise! nice vid
if you develop ml often you would never take that mac over a dedicated workstation gpu, i work with an nvidia a6000 and often run trainings for over 10 hours which would take weeks if not months on a m1 ultra
Doing machine learning on machine for 5 minutes hardy conclude anything if I can retrained a 16GB BERT model to do something useful.
Please compare RTX 4090 Laptop with M3.
do you know the Apple silicon NeuroEngine cores on the are design to do machine learning, you use these cores as well as the gpu cores
I'm getting an M1 MAX Mac Studio. Just for the low power draw. I'm moving to Berlin Germany to attend university for Nursing and electric rates are more than twice as high as the USA. I believe in saving money over performance.
Unless you plan on running your system on full tilt for several hours a day, the difference in power draw is going to be insignificant.
I think the performance is still pretty good compared with the power consumption. Apple ~120 W vs 430 W even if the run takes 3 times longer, I guess you still save some energy, right? anyhow, both systems are incredible regarding the computing power. Nice test + nice video, thanks for that
You would spend same amount of energy. While M1 required 3x less power, training time was 3x longer, so at the end you would spend same amount of energy. Both system will cost you the same in energy, but with RTX you save on your time since you get results faster.
Does that mean if he ran it on a Mac Studio base model… would it have taken 30 minutes? Will most likely rerun this test on my Mac Studio myself…
do this for the new ultra
and also run the leaked llama full parameters i hear it is memory limited
Have you seen the latest Gamers Nexus video reviewing the new Alienware PC? The one they reviewed has a 3090 in place of the 3080 Ti you have but the review overall is extremely negative. The cooling in that case is really terrible. As such I am not sure you are getting your money's worth out of yours either (same CPU in both - 12900KF, same liquid cooling solution with a measly 120mm fan too probably). I'd recommend selling it off at the first opportunity you get and buying yourself a better assembled machine from anyone but Alienware. From all the videos I've seen reviewing pre-built machines I think custom shops like Maingear are better than the bigger name vendors like Dell, Alienware, Asus, etc.
yes, i haven’t had great results with that machine. and it’s not a cheap one.
Would be cool to see a comparison with other macs and would be cool to make ml tests when Pytorch will be ported to m1 GPU
Question though: if you're running at scale where your power draw and cooling are important, the mac comes out ahead on that metric, if you have additional GPUs in the system does that drop the PC's normalised power consumption below that of the mac?
117W * 15min = 1755Wmin vs. 450W * 6min = 2700Wmin. I think people who do any real machine learning are going to go for the speed. And consider that the 3080Ti is only like $1400. I don't think the M1 Ultra can compete in this space. Better power efficiency doesn't go very far in a desktop.
If you care about power efficiency, you could under-clock and under-volt the 3080ti. You could probably get it to roughly 50% power at 75% performance which would still crush the M1.
Yeah what the other folks said, nvidia runs their gpus really really hot - also power to performance is usually on the amd's radeon series side mostly, but amd's software support is worse than nvidia's for machine learning (ROCM vs CUDA)
Why is the GPU rated in machine learning and not the Neural Engine? Isnt that transcoding Video on CPU in stead of the Media Engines?
I dont get it. Why not just build a linux nvidia box? you can get way more RAM and have as many GPUs as you like? Alternatively spin up some GPU containers on the cloud. No one serious is making models on one of these things...
Well per Watts thats pretty good. However, the real interesting thing for ML on the M1 Ultra is the huge VRAM. Nvidia charges a hefty premium to be able to hold large ML models in memory. The speed difference (no memory paging, less batching) might cancel out since IO is the slow part of ML. The tooling on none-Nvidia GPUs isn’t great though - and will likely stay that way
the difference is i've run that test on an a6000 and it takes an average of 17 seconds. he's comparing a gaming gpu (rtx3080ti) over a workstation gpu like the mac ultra, if you put the mac against an a6000 is way behind.
Having watched loads of Mac performance comparison videos I have one key takeaway; we're completely locked into a discrete architecture mindset. Apple says "Apple Silicon" the tech community hears "CPU" OR "GPU". Why not CPU AND GPU AND media encoders (AND ANE AND AMX)? Apple Silicon won't shine until we take the specs off (excuse the pun) and start looking at multi-processor/APU benchmarks which engage the silicon in the same way real world apps do. Are there no AI benchmarks which use all available silicon simultaneously? The only "APU/UMA" benchmark I can find is for 2D image manipulation - Affinity Photo benchmark 11021 Combined score, this outperforms PC workstations by 3-4x.
Alex - you're technical, why not show powermetrics or asitop instead of a wall meter?
Because limitations apply to these special compute blocks. For example, the ANE works really well for inference (using AI models) but not for training, and even so the insane inference performance gains with the ANE only apply to convolutional and ReLU layers with FP16. This makes the ANE essentially useless in AI training.
@@michel8847 you're still thinking discretely. With UMA it doesn't matter that the ANE isn't best at all tasks as other silicon (like AMX/SIMD or GPU) can pick up the compute functions (initialisation overhead permitting). Sadly benchmarks and many applications are wedded to this architectural view - that was my point.
Nice video.
thanks!
Maybe that weird sound is coil whine on the gpu
se le memorie su Mac sono uguali a quelle su pc ce da considerare che le memorie su gpu non vengono valutate ? poi il test essendo sotto windows e in emulazione su Mac?
Try to make swift ml test, with ml cores
Mac Studio's performance was impressive, considering its power draw. Though, it's probably even more impressive on the lower-wattage chip inside the 16" MacBook Pro.
It is not impressive at all, if you ask me, at least for ML tasks. 4 times the energy draw for 3 times the performance is a clear win for Nvidia, since performance/watt is no linear function. I would not be surprised if the RTX at 100 Watts would still perform better than the Mac on Tensorflow
@@rapidfan92 Yeah in general GPU tasks that don't use any special codecs or hardware to accelerate workloads RTX and RX GPUs are still faster.
can you compare the mac mini m2?
Would there be any benefit in comparing both GPU times to the same routine run natively on the neural engines?
Hi Alex! Quick question, whats the name of the app that puts the rpm in the menu bar?
Machine learning content for Mac ecosystem videos!!! Please :-)
would be interesting to compare the results when normalised by power draw
I don't think it matters lol. The cost to your electric bill isn't being offset by the cost difference between the m1 ultra and an "equivalent" build.
if the rtx3080ti is undervolted, pretty sure that will beat m1ultra as well - nvidia runs their cards really really hot (in search of the performance crown of course)
Is there any way to run that test using both the GPU of the M1 and its Neural Engine?
Did you get the Numpy with np_veclib support ?
The benchmark should probably be optimized to make use of the neural engine of the M1. When Apple compared the M1 Ulltra against the RTX 3090 the message was that at a certain point in the performance/power curve the M1 Ultra takes 200W less than the 3090 while having the same performance. What they comveniently didn’t say is that the 3090 has a lot more headroom and can draw even more power and be a lot faster than the M1 Ultra.
My 5900X + 3090Ti can do it between 3:50~4:00. But I do not think 3090Ti has that much boost from 3080Ti.
Youn Sunggu(201810740) m1 is a one-chip design, it would be good to handle the training dataset with strength in bottlenecks, but it is shocking that there is a difference of nearly 3 times. How much diffrent between 64 core and 48 core version?