DeepSeek on Apple Silicon in depth | 4 MacBooks Tested

Поделиться
HTML-код
  • Опубликовано: 4 фев 2025

Комментарии • 116

  • @AZisk
    @AZisk  3 часа назад +6

    No sponsor on this video, well, except the amazing members of the channel. You can also join here: ruclips.net/channel/UCajiMK_CY9icRhLepS8_3ugjoin

    • @kotenkostiantyn1282
      @kotenkostiantyn1282 2 часа назад

      Could you please try DrawThings or other app and try any image-generation LLM to generate images on macs? Really curious how it handles speed. I have old intel mac and it takes ~5 min for one image generation

  • @wragnini
    @wragnini 3 часа назад +60

    Hey Alex, excellent video! Just to add a quick note-Apple chips support INT8 and FP16 instructions, but not INT4. It might seem counterintuitive, but Q8 models actually run much faster than Q4 on Apple processors. This is because, with smaller models, more computations are required to adjust the weights.MLX 4-bit models are faster because they optimize Apple's Neural Engine and AMX, avoiding CPU/GPU bottlenecks. This makes Q4 in MLX faster than Q8 and other non-optimized Q4 models.

    • @Rob.Tufino
      @Rob.Tufino 33 минуты назад +2

      Wait, you lost me at the end; Q8 models are faster than Q4 models on Apple silicon [when choosing gguf, I'm assuming?], but Q4 models become faster than Q8 models when choosing MLX.
      Did I get that right?
      I also noticed he only installed LM Studio, but didn't go over installing MLX from GitHub; is installing MLX not necessary?
      I downloaded MLX Community's DeepSeek R1 Distill Qwen 14B 4bit
      Running on a slightly upgraded 24gb unified memory M4 Mac mini.
      It gets around 11 tokens per second.
      My use case is for coding assistance, so I'm not too concerned about speed, but I still have no idea what I'm doing yet.
      If the download size relates to RAM or unified memory, would that mean I should be able to download up to a 16gb model, and Q8 should perform better if it is gguf, or Q4 performs better if it is MLX?

    • @gsharrock
      @gsharrock 33 минуты назад +2

      This , mate you should be making the vids and getting the views

  • @mehregankbi
    @mehregankbi 3 часа назад +30

    Ever since the model was released, i knew this vid was coming. tonight I checked the feed for this vid on ur channel, couldn't find it and five mins later, it's on my home feed.

    • @AZisk
      @AZisk  3 часа назад +3

      Right on!

  • @el_manu_el_
    @el_manu_el_ 2 часа назад +4

    We were waiting for this Alex!
    BTW DeepSeek told me this (as an example comparing quantization vs parameters):
    Choose the 3B model with Q6_K if:
    - You prioritize response quality over model size.
    - You have limited hardware.
    - You need fast inference.
    Choose the 14B model with Q2_K only if:
    - You need a larger model for tasks that require greater generalization capability.
    - You have sufficient hardware to handle the model size.
    - You can tolerate a potential loss in quality due to low quantization.
    In most cases, the 3B model with Q6_K will be a more balanced and practical choice.

  • @tarangsharma2898
    @tarangsharma2898 Час назад +1

    My God!! that's a good editing(Zooming in & Zooming out focusing on text). Keep up! Great video!

  • @mortysmith2192
    @mortysmith2192 7 минут назад

    The brain bit is the cutest thing I’ve ever seen 😊

  • @andrewmerrin
    @andrewmerrin 2 часа назад +1

    You are truly doing gods work! I’ve been thinking about exactly this since the models dropped.

  • @JohnLark-k9y
    @JohnLark-k9y 2 часа назад +3

    Excited to see you use MLX and talk about quantizisation, as a macbookpro m3 Pro chip, i want to look a little into MLX and quantisiation, would be amazing if you did some videos on those!

  • @aarondavis156
    @aarondavis156 3 часа назад +1

    Awesome, was just looking for this an hour ago! Thanks, Alex!

  • @beauslim
    @beauslim 39 минут назад

    You read my mind and made the video I was looking for.

  • @nisarg254
    @nisarg254 Час назад

    I am from India
    its 4am here saw your video and just installed Lm studio
    Man thanks for making this video

  • @SamMeechWard
    @SamMeechWard 2 часа назад

    I'm just going to binge watch all your llmv videos now

  • @RTXON-h1h
    @RTXON-h1h 3 часа назад +6

    FINALLY incredible video I love AI on mac + software engineer with deepseek R1 to make a local AI work is really worth it next time try to destroy your M4 max 128gb with the largest llm deepseek R1 you can put

  • @davehachey3888
    @davehachey3888 2 часа назад +1

    Hey, thanks. Great introduction to running LLMs on Mac hardware.

  • @himawanthsomarowthu9897
    @himawanthsomarowthu9897 3 часа назад +1

    thanks i still own M1 air , gives me good insight about upgrade decisions

  •  Час назад

    Great video as always!

  • @owobobhere
    @owobobhere 3 часа назад

    Thanks for the breakdown !

  • @mrpro7737
    @mrpro7737 3 часа назад +2

    hi bro i love your videos, keep it up

    • @AZisk
      @AZisk  3 часа назад +1

      I appreciate it!

  • @akin242002
    @akin242002 3 часа назад +1

    Excellent topic! Thanks!

    • @AZisk
      @AZisk  3 часа назад

      Glad you liked it!

  • @johnmarshall4_
    @johnmarshall4_ 2 часа назад +2

    I would have loved to see the performance difference of the 70b model between GGUF and MLX on the M4 Max. I had my fingers crossed at the end of the video, but alas, no joy.

    • @AZisk
      @AZisk  2 часа назад +1

      next time

  • @CaimAstraea
    @CaimAstraea 20 минут назад

    Yes ! The best way to use this is with 2 maxxed out mac studios with 192 * 2 GB VRAM and using Exo to run parallel on multiple machines ! The 2nd best is having access to H200 GPUS

  • @bufuCS
    @bufuCS 2 часа назад

    Finally, was waiting for this one 👨🏻‍💻

  • @sh0me14
    @sh0me14 24 минуты назад

    I tested Deepseek-R1-Distill-Qwen-32B-4bit MLX version on M1 Max 64GB and got 16 tok/sec. Not bad, but not great either. I didn't like the model's output, albeit I didn't test it extensively. Love your videos. Keep it up.

    • @lorenzo6777
      @lorenzo6777 8 минут назад

      @@sh0me14 erase are you using to run them? LM Studio?

    • @sh0me14
      @sh0me14 3 минуты назад

      @ Yes, I used LM Studio.

  • @bhanuprakashp9616
    @bhanuprakashp9616 3 часа назад

    Hello Alex,
    It's great video running deepseek on different apple silicon

  • @emilianosteinvorth7017
    @emilianosteinvorth7017 2 часа назад

    If you could test between different B parameters of models I think it could be a nice addition, to see hoy small of a model we can use without loosing much quality like 7b vs 14b. Good video as always 💪

  • @keithdow8327
    @keithdow8327 2 часа назад +2

    Thanks!

    • @AZisk
      @AZisk  2 часа назад

      thanks so much!

  • @macsoyyo
    @macsoyyo 3 часа назад +2

    I’ve run 14b version ( almost 10 Gb) ollama in a MacBookPro M2 16Gb . It runs slow but ok. Impressive results, to be honest

  • @Adino144k
    @Adino144k Час назад

    I appreciate your channel so much, I watch your videos sometimes just to give you views and thumbs up!

  •  2 часа назад +1

    Great video! I’m be curious to know how a Mac Mini M4 Pro 64Gb would perform compared to the M4 Max of this video and maybe an older M3 Max

    • @BlueHawk80
      @BlueHawk80 Час назад

      jep I am also eying a M3 Max, either the binned 36GB or full 48GB or maybe even 64GB. I do Stable Diffusion and my 32Gb M1 Max already struggles a bit with SDXL upscaling, and I am interested in LLMs…

  • @leonardong2409
    @leonardong2409 Час назад

    Wonderful video as always. Would you cover the disk read/write extensive usage when loading and running models and how they would wear the SSD or should be even be worried about. A single 70b model would read and write easily few hundreds GBs in one session. Great idea for content.

  • @quicogil4565
    @quicogil4565 3 часа назад +2

    Saved for tomorrow. My job involves selling laptops with AI capabilities so I need to know how Apple silicon stacks up in DeepSeek. Thank you!

    • @AZisk
      @AZisk  3 часа назад

      Ok, see you tomorrow

  • @BillyNoMates1974
    @BillyNoMates1974 2 часа назад

    brilliant video.
    Now I can run my own AI on my 16Gb system using LM studio.
    Makes setting up so easy. thanks for the tip

  • @HadesTimer
    @HadesTimer 2 часа назад +1

    Alex, I love you man, but for people who don't understand local installs already. You are going FAR into the weeds here. But it's a good tutorial. Most people should just download and install the mistral o3 mini model and they should be good. Lol. Good job man, love you. Keep up the good work. Keep it Mac ;)

    • @AZisk
      @AZisk  2 часа назад

      yep, this could have been two videos, but I threw it all in there

  • @CreativeFunction
    @CreativeFunction 2 часа назад +9

    None of these are Deepseek, they are distilled models that are ok, but not the same as the 670B true model

    • @gaiustacitus4242
      @gaiustacitus4242 50 минут назад +2

      The "true" model of DeepSeek was not trained on a curated data set but also created by distillation. Had DeepSeek been trained it too would have taken years and cost billions.

  • @gpreddy172
    @gpreddy172 3 часа назад +1

    Ay ay new hand appeared in the channel 🤔

  • @Thevikingcam
    @Thevikingcam 3 часа назад +1

    Running 14b on mini M4 basic. Its not that slow, totally usable. You can use 20b but its slow, takes minute or so.

  • @cacamacaaa
    @cacamacaaa 3 часа назад +4

    muahahah exactly what I was looking for. thx buddy

    • @AZisk
      @AZisk  3 часа назад

      You bet!

  • @ish2222
    @ish2222 3 часа назад +1

    I think about something and this guy releases a video about it

  • @swissheartydogs
    @swissheartydogs 3 часа назад +1

    Open source & local LLM, on my powerful PC 128GB= my real second brain (thanks Tiago Forte)

  • @glych002
    @glych002 Час назад

    I’m running a 14 billion parameter model on my M1 Mac mini it does it just fine. The memory pressure is in the yellow not the red. It’s not using swap memory. I’m running it with open web UI. I don’t think those recommended settings in that app you’re using are accurate. 19:33

  • @madsnylarsen
    @madsnylarsen 2 часа назад

    Hi Alex, great comparison video and insights into quantization, up for a fun challenge?, i think a lot of us is puzzeling with getting a good local code-assist setup, to save API-cash you know :), so the challenge is, to find the most optimal code-assist server setup, testing with M3/M4, Mac mini, Pc mini, PC with a 16gb and 24GB gpu, running Ollama, Mistral-code and Deepseek models with different billion variants that fit the hardware, to get a proper test result, it should be tested with a bunch of code challenges in JS+html+css, like a flabby bird clone, snake game, rotating triangle with a ball bouncing inside it, a couple of web app's etc. :D, the CODE-ASSIST SHOWDOWN :D

  • @wiktor0985
    @wiktor0985 Час назад

    To fix this issue "this message has no content", you can reduce GPU offload / CPU threat pool size and it might start responding.

  • @Andrew-v2g
    @Andrew-v2g 55 минут назад

    nice work, could you please create a video on how to set up a personal server with R1 and connect a mobile app or webpage to it, if possible? I apologize if this request is inconvenient.

  • @ye6207
    @ye6207 3 часа назад +1

    First. Should try this on my M3 Macbook Air 16GB.

  • @RnRoadkills
    @RnRoadkills 2 часа назад

    Great video 👍 I guess it would be possible to train the ai? If so, that could be your next video showing how. No pressure 😉

  • @chidorirasenganz
    @chidorirasenganz 2 часа назад

    I think you should test out Private LLM and compare it to Ollama and the likes

  • @BlueHawk80
    @BlueHawk80 Час назад

    solid video, but more memory variation instead of just 8GB (3x) and 128GB would be more reslistic. Most Macbook user have 16, 24 or 32/36GB memory.

    • @BlueHawk80
      @BlueHawk80 Час назад

      nevermind the M1 at least had 16GB…

  • @gpreddy172
    @gpreddy172 3 часа назад +5

    It's 2am here in India and I am here watching this guy like I never did

    • @AZisk
      @AZisk  3 часа назад +1

      wow it's late!

    • @vintage0x
      @vintage0x 3 часа назад

      what does "watching this guy like I never did" mean?

    • @suyashbhawsar
      @suyashbhawsar 3 часа назад

      @@gpreddy172 Haha, we can’t unsee the title now

  • @lashlarue59
    @lashlarue59 21 минуту назад

    So the bottomline when it comes to running these models locally either need lots of unified memory n the case the Mac, Pi, Orrin etc. or lots of VRAM with video cards. Even with 3090 or 4090's they only come with 24 and 32gb of memory so it would take several of them to do larger models; it's been a while but I thought there was some sort of problem getting CUDA to see VRAM on multiple cards as one big memory pool.

  • @DS-pk4eh
    @DS-pk4eh 3 часа назад +1

    Finally.
    I usually do pull command first and then run with verbose option.
    Now, I will download the same models and see how my config compares to yours
    1,5b model
    Radeon 7900 XTX - 215 tokens/sec , 225 tokens/sec with LM studio
    32B model, I am getting 24.5 tokens / sec

    • @AZisk
      @AZisk  2 часа назад +1

      not bad at all

    • @osman2k
      @osman2k 2 часа назад

      that's nice, do you need to install anything extra to make the radeon card work?

    • @DS-pk4eh
      @DS-pk4eh 2 часа назад

      @@osman2k No, just install either ollama or LM studio. It has everything you need within

    • @DS-pk4eh
      @DS-pk4eh 2 часа назад

      @@AZisk Yeah, not bad, will try on Nvidia 3070 later.
      Interesting thing happened. After i run the first time, i wrote two more times the same prompt (write ma 1000 word story) but it got slower, as it was thinking: user is not content with first answer so it tried harder !

  • @Super1994jorge
    @Super1994jorge 2 часа назад

    I wonder how it would perform on the Mac mini cluster.

  • @generalbystander1631
    @generalbystander1631 Час назад

    With proliferation of all these models … I wonder how this may worsen local machine/network security in novel or otherwise yet-to-be imagined, ways.

  • @Techonsapevole
    @Techonsapevole 3 часа назад

    Almost 10 tokens/s for a 70B on a laptop is gooood!

  • @JosephAgbakpe-kafui
    @JosephAgbakpe-kafui 2 часа назад

    What is the lowest recommended cpu gen and ram size to run a decent model

  • @manwithbatsuit
    @manwithbatsuit 2 часа назад

    Alex what do you tihnk about running multiple models together on a Mac Studio?

  • @Techonsapevole
    @Techonsapevole 3 часа назад

    Great, but please test AMD ryzen AI 395

    • @nickd7935
      @nickd7935 2 часа назад

      Strix halo hasn’t been released in any device yet so no one can test it

  • @squachmode1360
    @squachmode1360 37 минут назад

    24GB m4 pro?

  • @MrOktony
    @MrOktony 3 часа назад

    Thanks for info - Bye the way, have you tried to copy ollama model from old Mac to new Mac, saving you from downloading it again? And does it work? I tried and it doesn't show up in ollama ls! I believe this is a common problem. The only work around is delete the manifest and run ollama where it downloads a new manifest, but the model acts really strange...

  • @secRaphyTwin
    @secRaphyTwin 3 часа назад

    Can you try the mlx ports for Apple Silicon?

    • @AZisk
      @AZisk  3 часа назад

      I do in this video

  • @vinoskey5243
    @vinoskey5243 26 минут назад

    26:06 But why sould we share our data with USA throught chatGPT ?

  • @glych002
    @glych002 Час назад

    Those Quantized models don’t have any knowledge in them. You have to add RAG knowledge data sets to them. 21:06

  • @chandebrec5856
    @chandebrec5856 15 минут назад

    You are not installing "DeepSeek R1" on all those Macs. You're installing a smaller model, based on a Llama or Qwen model, that is distilled from R1.

  • @EnricBatalla-dw5cp
    @EnricBatalla-dw5cp 3 часа назад

    are ollama models better than mlx models on Apple silicone? Is there any way to run MLX models on Ollama Interface ?

  • @vifvrTtb0vmFtbyrM_Q
    @vifvrTtb0vmFtbyrM_Q Час назад

    maybe it's worth showing people how llm actually works on apple silicon? for example, give him the source code so that the context is 32k tokens and the model is from 30B.

  • @JohnAlanWoods
    @JohnAlanWoods 3 часа назад

    Can you run it? Or is it a Distill. I liked you intellectually honest.

  • @keithdow8327
    @keithdow8327 2 часа назад +2

    You have a son? I thought you were an irreproducible result!

    • @AZisk
      @AZisk  2 часа назад

      teaching him to make yt videos now

  • @Enough2Choke
    @Enough2Choke Час назад

    So, 8 m4 Pro minis with 64GB of memory = 512GB of memory, so couldn't you use petals to distribute the full-sized f32 671b DeepSeek R1 model across all of them on a system that costs under $20,000? And when can we expect you to do that for a video?

  • @robinthekidd
    @robinthekidd Час назад

    Hey Alex, long time watcher first time commenter here. Any chance you could show us a comparison of passwords being cracked on M chips vs ARM vs Intel?
    My uni assignments from 20 years ago are trapped in a zip file from WinZip 2.0ish, I want to show my kids and have a trip down memory lane but I can't find the best hardware to crack zip password.

  • @devenbhatt2137
    @devenbhatt2137 2 часа назад

    You missed MLX! check the MLX community in a hugging face.

    • @AZisk
      @AZisk  2 часа назад +1

      i think you missed mlx in my video

    • @abulka
      @abulka 2 часа назад

      See 15:26 for MLX

  • @avetissian
    @avetissian 2 часа назад

    Yo Alex, if the M4 Max is out here breaking speed limits at 182 tokens/sec, but the M3 is crawling at 7.5 with the 8B model-was the M3 just built to keep my coffee warm while the M4 writes novels? 🖋️☕ Or is there hope for us mere mortals without maxed-out GPUs and overflowing RAM? 😂

  • @electronicmaji
    @electronicmaji Час назад

    Keep running this and the government is going to put you on a list.

  • @eduardomoura2813
    @eduardomoura2813 3 часа назад

    buy a bunch of macs or wait for nvidia digits and hope I can get a few before scalpers do their thing...

  • @DS-pk4eh
    @DS-pk4eh 2 часа назад

    22:00 Just download more RAM, duh.... ;-)

  • @x1989Minaro
    @x1989Minaro 2 часа назад

    Our current computers are not designed for this task.
    It's best to save the money and invest in future processors that will run these models in full mode with a fraction of their current power.
    These machines are already outdated.
    My M1 will be my last PC before the AI era.

  • @spirosgmanis
    @spirosgmanis Час назад

    ❤❤

  • @avrenos
    @avrenos 3 часа назад

    Brother, just donate a mac :) for a aspiring dev

  • @rydmerlin
    @rydmerlin 3 часа назад

    You're going to jail Alex ;-)

  • @hedgehogform
    @hedgehogform 2 часа назад +1

    Deepseek isn't that good. It's only good if you want an annoying talketive girlfriend honestly. It still sucks in coding. Sonnet 3.5 is still the best

  • @Mikoaj-ie6gt
    @Mikoaj-ie6gt 56 минут назад

    interesting

  • @espero7757
    @espero7757 Час назад +1

    RAG: just experimenting with different embeddings and context sizes 🥸
    You‘ve got some advice here between sentence and semantic embeddings?
    Any ideas are welcome

  • @HunterXron
    @HunterXron 3 часа назад

    I wish I could own as many laptops as you! I'm currently struggling with my HP Pavilion x360, which has a dead battery. The replacement battery from the manufacturer costs almost $100, but I could buy a new laptop for just $300. I'm feeling quite confused about what to do 🥲

  • @davehachey3888
    @davehachey3888 Час назад

    Thanks!