Could you please try DrawThings or other app and try any image-generation LLM to generate images on macs? Really curious how it handles speed. I have old intel mac and it takes ~5 min for one image generation
Hey Alex, excellent video! Just to add a quick note-Apple chips support INT8 and FP16 instructions, but not INT4. It might seem counterintuitive, but Q8 models actually run much faster than Q4 on Apple processors. This is because, with smaller models, more computations are required to adjust the weights.MLX 4-bit models are faster because they optimize Apple's Neural Engine and AMX, avoiding CPU/GPU bottlenecks. This makes Q4 in MLX faster than Q8 and other non-optimized Q4 models.
Wait, you lost me at the end; Q8 models are faster than Q4 models on Apple silicon [when choosing gguf, I'm assuming?], but Q4 models become faster than Q8 models when choosing MLX. Did I get that right? I also noticed he only installed LM Studio, but didn't go over installing MLX from GitHub; is installing MLX not necessary? I downloaded MLX Community's DeepSeek R1 Distill Qwen 14B 4bit Running on a slightly upgraded 24gb unified memory M4 Mac mini. It gets around 11 tokens per second. My use case is for coding assistance, so I'm not too concerned about speed, but I still have no idea what I'm doing yet. If the download size relates to RAM or unified memory, would that mean I should be able to download up to a 16gb model, and Q8 should perform better if it is gguf, or Q4 performs better if it is MLX?
Ever since the model was released, i knew this vid was coming. tonight I checked the feed for this vid on ur channel, couldn't find it and five mins later, it's on my home feed.
We were waiting for this Alex! BTW DeepSeek told me this (as an example comparing quantization vs parameters): Choose the 3B model with Q6_K if: - You prioritize response quality over model size. - You have limited hardware. - You need fast inference. Choose the 14B model with Q2_K only if: - You need a larger model for tasks that require greater generalization capability. - You have sufficient hardware to handle the model size. - You can tolerate a potential loss in quality due to low quantization. In most cases, the 3B model with Q6_K will be a more balanced and practical choice.
Excited to see you use MLX and talk about quantizisation, as a macbookpro m3 Pro chip, i want to look a little into MLX and quantisiation, would be amazing if you did some videos on those!
FINALLY incredible video I love AI on mac + software engineer with deepseek R1 to make a local AI work is really worth it next time try to destroy your M4 max 128gb with the largest llm deepseek R1 you can put
I would have loved to see the performance difference of the 70b model between GGUF and MLX on the M4 Max. I had my fingers crossed at the end of the video, but alas, no joy.
Yes ! The best way to use this is with 2 maxxed out mac studios with 192 * 2 GB VRAM and using Exo to run parallel on multiple machines ! The 2nd best is having access to H200 GPUS
I tested Deepseek-R1-Distill-Qwen-32B-4bit MLX version on M1 Max 64GB and got 16 tok/sec. Not bad, but not great either. I didn't like the model's output, albeit I didn't test it extensively. Love your videos. Keep it up.
If you could test between different B parameters of models I think it could be a nice addition, to see hoy small of a model we can use without loosing much quality like 7b vs 14b. Good video as always 💪
jep I am also eying a M3 Max, either the binned 36GB or full 48GB or maybe even 64GB. I do Stable Diffusion and my 32Gb M1 Max already struggles a bit with SDXL upscaling, and I am interested in LLMs…
Wonderful video as always. Would you cover the disk read/write extensive usage when loading and running models and how they would wear the SSD or should be even be worried about. A single 70b model would read and write easily few hundreds GBs in one session. Great idea for content.
Alex, I love you man, but for people who don't understand local installs already. You are going FAR into the weeds here. But it's a good tutorial. Most people should just download and install the mistral o3 mini model and they should be good. Lol. Good job man, love you. Keep up the good work. Keep it Mac ;)
The "true" model of DeepSeek was not trained on a curated data set but also created by distillation. Had DeepSeek been trained it too would have taken years and cost billions.
I’m running a 14 billion parameter model on my M1 Mac mini it does it just fine. The memory pressure is in the yellow not the red. It’s not using swap memory. I’m running it with open web UI. I don’t think those recommended settings in that app you’re using are accurate. 19:33
Hi Alex, great comparison video and insights into quantization, up for a fun challenge?, i think a lot of us is puzzeling with getting a good local code-assist setup, to save API-cash you know :), so the challenge is, to find the most optimal code-assist server setup, testing with M3/M4, Mac mini, Pc mini, PC with a 16gb and 24GB gpu, running Ollama, Mistral-code and Deepseek models with different billion variants that fit the hardware, to get a proper test result, it should be tested with a bunch of code challenges in JS+html+css, like a flabby bird clone, snake game, rotating triangle with a ball bouncing inside it, a couple of web app's etc. :D, the CODE-ASSIST SHOWDOWN :D
nice work, could you please create a video on how to set up a personal server with R1 and connect a mobile app or webpage to it, if possible? I apologize if this request is inconvenient.
So the bottomline when it comes to running these models locally either need lots of unified memory n the case the Mac, Pi, Orrin etc. or lots of VRAM with video cards. Even with 3090 or 4090's they only come with 24 and 32gb of memory so it would take several of them to do larger models; it's been a while but I thought there was some sort of problem getting CUDA to see VRAM on multiple cards as one big memory pool.
Finally. I usually do pull command first and then run with verbose option. Now, I will download the same models and see how my config compares to yours 1,5b model Radeon 7900 XTX - 215 tokens/sec , 225 tokens/sec with LM studio 32B model, I am getting 24.5 tokens / sec
@@AZisk Yeah, not bad, will try on Nvidia 3070 later. Interesting thing happened. After i run the first time, i wrote two more times the same prompt (write ma 1000 word story) but it got slower, as it was thinking: user is not content with first answer so it tried harder !
Thanks for info - Bye the way, have you tried to copy ollama model from old Mac to new Mac, saving you from downloading it again? And does it work? I tried and it doesn't show up in ollama ls! I believe this is a common problem. The only work around is delete the manifest and run ollama where it downloads a new manifest, but the model acts really strange...
maybe it's worth showing people how llm actually works on apple silicon? for example, give him the source code so that the context is 32k tokens and the model is from 30B.
So, 8 m4 Pro minis with 64GB of memory = 512GB of memory, so couldn't you use petals to distribute the full-sized f32 671b DeepSeek R1 model across all of them on a system that costs under $20,000? And when can we expect you to do that for a video?
Hey Alex, long time watcher first time commenter here. Any chance you could show us a comparison of passwords being cracked on M chips vs ARM vs Intel? My uni assignments from 20 years ago are trapped in a zip file from WinZip 2.0ish, I want to show my kids and have a trip down memory lane but I can't find the best hardware to crack zip password.
Yo Alex, if the M4 Max is out here breaking speed limits at 182 tokens/sec, but the M3 is crawling at 7.5 with the 8B model-was the M3 just built to keep my coffee warm while the M4 writes novels? 🖋️☕ Or is there hope for us mere mortals without maxed-out GPUs and overflowing RAM? 😂
Our current computers are not designed for this task. It's best to save the money and invest in future processors that will run these models in full mode with a fraction of their current power. These machines are already outdated. My M1 will be my last PC before the AI era.
RAG: just experimenting with different embeddings and context sizes 🥸 You‘ve got some advice here between sentence and semantic embeddings? Any ideas are welcome
I wish I could own as many laptops as you! I'm currently struggling with my HP Pavilion x360, which has a dead battery. The replacement battery from the manufacturer costs almost $100, but I could buy a new laptop for just $300. I'm feeling quite confused about what to do 🥲
No sponsor on this video, well, except the amazing members of the channel. You can also join here: ruclips.net/channel/UCajiMK_CY9icRhLepS8_3ugjoin
Could you please try DrawThings or other app and try any image-generation LLM to generate images on macs? Really curious how it handles speed. I have old intel mac and it takes ~5 min for one image generation
Hey Alex, excellent video! Just to add a quick note-Apple chips support INT8 and FP16 instructions, but not INT4. It might seem counterintuitive, but Q8 models actually run much faster than Q4 on Apple processors. This is because, with smaller models, more computations are required to adjust the weights.MLX 4-bit models are faster because they optimize Apple's Neural Engine and AMX, avoiding CPU/GPU bottlenecks. This makes Q4 in MLX faster than Q8 and other non-optimized Q4 models.
Wait, you lost me at the end; Q8 models are faster than Q4 models on Apple silicon [when choosing gguf, I'm assuming?], but Q4 models become faster than Q8 models when choosing MLX.
Did I get that right?
I also noticed he only installed LM Studio, but didn't go over installing MLX from GitHub; is installing MLX not necessary?
I downloaded MLX Community's DeepSeek R1 Distill Qwen 14B 4bit
Running on a slightly upgraded 24gb unified memory M4 Mac mini.
It gets around 11 tokens per second.
My use case is for coding assistance, so I'm not too concerned about speed, but I still have no idea what I'm doing yet.
If the download size relates to RAM or unified memory, would that mean I should be able to download up to a 16gb model, and Q8 should perform better if it is gguf, or Q4 performs better if it is MLX?
This , mate you should be making the vids and getting the views
Ever since the model was released, i knew this vid was coming. tonight I checked the feed for this vid on ur channel, couldn't find it and five mins later, it's on my home feed.
Right on!
We were waiting for this Alex!
BTW DeepSeek told me this (as an example comparing quantization vs parameters):
Choose the 3B model with Q6_K if:
- You prioritize response quality over model size.
- You have limited hardware.
- You need fast inference.
Choose the 14B model with Q2_K only if:
- You need a larger model for tasks that require greater generalization capability.
- You have sufficient hardware to handle the model size.
- You can tolerate a potential loss in quality due to low quantization.
In most cases, the 3B model with Q6_K will be a more balanced and practical choice.
My God!! that's a good editing(Zooming in & Zooming out focusing on text). Keep up! Great video!
The brain bit is the cutest thing I’ve ever seen 😊
You are truly doing gods work! I’ve been thinking about exactly this since the models dropped.
Excited to see you use MLX and talk about quantizisation, as a macbookpro m3 Pro chip, i want to look a little into MLX and quantisiation, would be amazing if you did some videos on those!
Awesome, was just looking for this an hour ago! Thanks, Alex!
You read my mind and made the video I was looking for.
I am from India
its 4am here saw your video and just installed Lm studio
Man thanks for making this video
I'm just going to binge watch all your llmv videos now
FINALLY incredible video I love AI on mac + software engineer with deepseek R1 to make a local AI work is really worth it next time try to destroy your M4 max 128gb with the largest llm deepseek R1 you can put
Hey, thanks. Great introduction to running LLMs on Mac hardware.
thanks i still own M1 air , gives me good insight about upgrade decisions
Great video as always!
Thanks for the breakdown !
hi bro i love your videos, keep it up
I appreciate it!
Excellent topic! Thanks!
Glad you liked it!
I would have loved to see the performance difference of the 70b model between GGUF and MLX on the M4 Max. I had my fingers crossed at the end of the video, but alas, no joy.
next time
Yes ! The best way to use this is with 2 maxxed out mac studios with 192 * 2 GB VRAM and using Exo to run parallel on multiple machines ! The 2nd best is having access to H200 GPUS
Finally, was waiting for this one 👨🏻💻
I tested Deepseek-R1-Distill-Qwen-32B-4bit MLX version on M1 Max 64GB and got 16 tok/sec. Not bad, but not great either. I didn't like the model's output, albeit I didn't test it extensively. Love your videos. Keep it up.
@@sh0me14 erase are you using to run them? LM Studio?
@ Yes, I used LM Studio.
Hello Alex,
It's great video running deepseek on different apple silicon
If you could test between different B parameters of models I think it could be a nice addition, to see hoy small of a model we can use without loosing much quality like 7b vs 14b. Good video as always 💪
Thanks!
thanks so much!
I’ve run 14b version ( almost 10 Gb) ollama in a MacBookPro M2 16Gb . It runs slow but ok. Impressive results, to be honest
I appreciate your channel so much, I watch your videos sometimes just to give you views and thumbs up!
Great video! I’m be curious to know how a Mac Mini M4 Pro 64Gb would perform compared to the M4 Max of this video and maybe an older M3 Max
jep I am also eying a M3 Max, either the binned 36GB or full 48GB or maybe even 64GB. I do Stable Diffusion and my 32Gb M1 Max already struggles a bit with SDXL upscaling, and I am interested in LLMs…
Wonderful video as always. Would you cover the disk read/write extensive usage when loading and running models and how they would wear the SSD or should be even be worried about. A single 70b model would read and write easily few hundreds GBs in one session. Great idea for content.
Saved for tomorrow. My job involves selling laptops with AI capabilities so I need to know how Apple silicon stacks up in DeepSeek. Thank you!
Ok, see you tomorrow
brilliant video.
Now I can run my own AI on my 16Gb system using LM studio.
Makes setting up so easy. thanks for the tip
Alex, I love you man, but for people who don't understand local installs already. You are going FAR into the weeds here. But it's a good tutorial. Most people should just download and install the mistral o3 mini model and they should be good. Lol. Good job man, love you. Keep up the good work. Keep it Mac ;)
yep, this could have been two videos, but I threw it all in there
None of these are Deepseek, they are distilled models that are ok, but not the same as the 670B true model
The "true" model of DeepSeek was not trained on a curated data set but also created by distillation. Had DeepSeek been trained it too would have taken years and cost billions.
Ay ay new hand appeared in the channel 🤔
Running 14b on mini M4 basic. Its not that slow, totally usable. You can use 20b but its slow, takes minute or so.
muahahah exactly what I was looking for. thx buddy
You bet!
I think about something and this guy releases a video about it
Open source & local LLM, on my powerful PC 128GB= my real second brain (thanks Tiago Forte)
I’m running a 14 billion parameter model on my M1 Mac mini it does it just fine. The memory pressure is in the yellow not the red. It’s not using swap memory. I’m running it with open web UI. I don’t think those recommended settings in that app you’re using are accurate. 19:33
Hi Alex, great comparison video and insights into quantization, up for a fun challenge?, i think a lot of us is puzzeling with getting a good local code-assist setup, to save API-cash you know :), so the challenge is, to find the most optimal code-assist server setup, testing with M3/M4, Mac mini, Pc mini, PC with a 16gb and 24GB gpu, running Ollama, Mistral-code and Deepseek models with different billion variants that fit the hardware, to get a proper test result, it should be tested with a bunch of code challenges in JS+html+css, like a flabby bird clone, snake game, rotating triangle with a ball bouncing inside it, a couple of web app's etc. :D, the CODE-ASSIST SHOWDOWN :D
To fix this issue "this message has no content", you can reduce GPU offload / CPU threat pool size and it might start responding.
nice work, could you please create a video on how to set up a personal server with R1 and connect a mobile app or webpage to it, if possible? I apologize if this request is inconvenient.
First. Should try this on my M3 Macbook Air 16GB.
Great video 👍 I guess it would be possible to train the ai? If so, that could be your next video showing how. No pressure 😉
I think you should test out Private LLM and compare it to Ollama and the likes
solid video, but more memory variation instead of just 8GB (3x) and 128GB would be more reslistic. Most Macbook user have 16, 24 or 32/36GB memory.
nevermind the M1 at least had 16GB…
It's 2am here in India and I am here watching this guy like I never did
wow it's late!
what does "watching this guy like I never did" mean?
@@gpreddy172 Haha, we can’t unsee the title now
So the bottomline when it comes to running these models locally either need lots of unified memory n the case the Mac, Pi, Orrin etc. or lots of VRAM with video cards. Even with 3090 or 4090's they only come with 24 and 32gb of memory so it would take several of them to do larger models; it's been a while but I thought there was some sort of problem getting CUDA to see VRAM on multiple cards as one big memory pool.
Finally.
I usually do pull command first and then run with verbose option.
Now, I will download the same models and see how my config compares to yours
1,5b model
Radeon 7900 XTX - 215 tokens/sec , 225 tokens/sec with LM studio
32B model, I am getting 24.5 tokens / sec
not bad at all
that's nice, do you need to install anything extra to make the radeon card work?
@@osman2k No, just install either ollama or LM studio. It has everything you need within
@@AZisk Yeah, not bad, will try on Nvidia 3070 later.
Interesting thing happened. After i run the first time, i wrote two more times the same prompt (write ma 1000 word story) but it got slower, as it was thinking: user is not content with first answer so it tried harder !
I wonder how it would perform on the Mac mini cluster.
With proliferation of all these models … I wonder how this may worsen local machine/network security in novel or otherwise yet-to-be imagined, ways.
Almost 10 tokens/s for a 70B on a laptop is gooood!
What is the lowest recommended cpu gen and ram size to run a decent model
Alex what do you tihnk about running multiple models together on a Mac Studio?
Great, but please test AMD ryzen AI 395
Strix halo hasn’t been released in any device yet so no one can test it
24GB m4 pro?
Thanks for info - Bye the way, have you tried to copy ollama model from old Mac to new Mac, saving you from downloading it again? And does it work? I tried and it doesn't show up in ollama ls! I believe this is a common problem. The only work around is delete the manifest and run ollama where it downloads a new manifest, but the model acts really strange...
Can you try the mlx ports for Apple Silicon?
I do in this video
26:06 But why sould we share our data with USA throught chatGPT ?
Those Quantized models don’t have any knowledge in them. You have to add RAG knowledge data sets to them. 21:06
You are not installing "DeepSeek R1" on all those Macs. You're installing a smaller model, based on a Llama or Qwen model, that is distilled from R1.
are ollama models better than mlx models on Apple silicone? Is there any way to run MLX models on Ollama Interface ?
maybe it's worth showing people how llm actually works on apple silicon? for example, give him the source code so that the context is 32k tokens and the model is from 30B.
Can you run it? Or is it a Distill. I liked you intellectually honest.
You have a son? I thought you were an irreproducible result!
teaching him to make yt videos now
So, 8 m4 Pro minis with 64GB of memory = 512GB of memory, so couldn't you use petals to distribute the full-sized f32 671b DeepSeek R1 model across all of them on a system that costs under $20,000? And when can we expect you to do that for a video?
Hey Alex, long time watcher first time commenter here. Any chance you could show us a comparison of passwords being cracked on M chips vs ARM vs Intel?
My uni assignments from 20 years ago are trapped in a zip file from WinZip 2.0ish, I want to show my kids and have a trip down memory lane but I can't find the best hardware to crack zip password.
You missed MLX! check the MLX community in a hugging face.
i think you missed mlx in my video
See 15:26 for MLX
Yo Alex, if the M4 Max is out here breaking speed limits at 182 tokens/sec, but the M3 is crawling at 7.5 with the 8B model-was the M3 just built to keep my coffee warm while the M4 writes novels? 🖋️☕ Or is there hope for us mere mortals without maxed-out GPUs and overflowing RAM? 😂
Keep running this and the government is going to put you on a list.
buy a bunch of macs or wait for nvidia digits and hope I can get a few before scalpers do their thing...
22:00 Just download more RAM, duh.... ;-)
Our current computers are not designed for this task.
It's best to save the money and invest in future processors that will run these models in full mode with a fraction of their current power.
These machines are already outdated.
My M1 will be my last PC before the AI era.
❤❤
Brother, just donate a mac :) for a aspiring dev
You're going to jail Alex ;-)
Deepseek isn't that good. It's only good if you want an annoying talketive girlfriend honestly. It still sucks in coding. Sonnet 3.5 is still the best
interesting
RAG: just experimenting with different embeddings and context sizes 🥸
You‘ve got some advice here between sentence and semantic embeddings?
Any ideas are welcome
I wish I could own as many laptops as you! I'm currently struggling with my HP Pavilion x360, which has a dead battery. The replacement battery from the manufacturer costs almost $100, but I could buy a new laptop for just $300. I'm feeling quite confused about what to do 🥲
Thanks!