Optimize Your AI - Quantization Explained

Поделиться
HTML-код
  • Опубликовано: 31 дек 2024

Комментарии •

  • @lemniscif
    @lemniscif 4 дня назад +7

    the kid at the end is my spirit animal

  • @andikunar7183
    @andikunar7183 4 дня назад +1

    Amazing explanation, thanks!

  • @lofiurbex2511
    @lofiurbex2511 4 дня назад +4

    Great info, thanks! Also, very glad you put that clip in at the end

  • @vincentnestler1805
    @vincentnestler1805 3 дня назад

    Thanks!

  • @eyeseethru
    @eyeseethru 4 дня назад +2

    Bad Dad, using up all the emergency tape!
    But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!

  • @MikeCreuzer
    @MikeCreuzer День назад

    I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much!
    I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM!
    Thanks again!

  • @ShaneHolloman
    @ShaneHolloman 4 дня назад +1

    Absolute champion! Really appreciate you Matt. Thank you ...

  • @octopusfinds
    @octopusfinds 4 дня назад +1

    Thank you, Matt! 🙌 This was the topic I was going to ask you to cover. Great explanation and props! 👏👍

  • @yuda2207
    @yuda2207 4 дня назад

    Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!

  • @romayojr
    @romayojr 4 дня назад +2

    you may be a bad dad, but you're a great teacher!

  • @TheYuriTS
    @TheYuriTS 3 дня назад +1

    no understand how activate flash attention

  • @TheInternalNet
    @TheInternalNet 23 часа назад

    Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency

  • @TomanswerAi
    @TomanswerAi 3 дня назад

    This is a good one. Nice topic.

  • @skyak4493
    @skyak4493 4 дня назад +1

    This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.

  • @greatermoose
    @greatermoose 2 дня назад

    Hi Matt, what about the quality of responses with flash attention enabled?

  • @CptKosmo
    @CptKosmo 4 дня назад

    Nice, way to end with a smile :)

  • @cloudsystem3740
    @cloudsystem3740 4 дня назад

    thank you very much 👍👍😎😎

  • @AhmedAshraf-pw1bn
    @AhmedAshraf-pw1bn 3 дня назад

    what about the IQ quantization such as IQ3M?

  • @Noctalin
    @Noctalin 4 дня назад

    Thank you for your awesome AI videos!
    Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?

  • @ArtificialIntelligenceSP
    @ArtificialIntelligenceSP 4 дня назад

    Thank you and what is the tool name in mac os that you are using to see those memory graphs ?

  • @styxlegendgaming
    @styxlegendgaming 4 дня назад +1

    Nice information

  • @BlenderInGame
    @BlenderInGame 3 дня назад

    Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...

  • @tecnopadre
    @tecnopadre 4 дня назад +1

    It would be nice to have a video downloading a model and modifying for example for a Mac mini 16GB or 24gb as real case. Awesome as usual. Thank you

    • @technovangelist
      @technovangelist  4 дня назад

      I am using my personal machine, a M1 Max with 64gb. Pretty real case

    • @themax2go
      @themax2go 3 дня назад

      I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges

    • @technovangelist
      @technovangelist  3 дня назад

      Some have 8 or 16 or 24 or 32 gb. But the actual Mem isn’t all that important. Know what model fits in the space available is the important part.

  • @dylanelens
    @dylanelens 4 дня назад

    Matt, you blew my mind

    • @dylanelens
      @dylanelens 4 дня назад

      Flash attention is precisely what I needed.

  • @brentknight9318
    @brentknight9318 4 дня назад

    Super helpful: S, M, L … I didn’t realize that was the scheme, duh.

  • @leondbleondb
    @leondbleondb 4 дня назад

    Good info.

  • @vincentnestler1805
    @vincentnestler1805 2 дня назад

    Thanks, this was very helpful!
    Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?

    • @technovangelist
      @technovangelist  2 дня назад

      I doubt it, but its easy to find out. But I cant think of a good reason to want to.

  • @wawaldekidsfun4850
    @wawaldekidsfun4850 3 дня назад +1

    While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.

    • @technovangelist
      @technovangelist  3 дня назад +2

      Considering that you can get results very comparable to hosted models when even using q4 and q3 I’d say it certainly is worth it.

    • @themax2go
      @themax2go 3 дня назад

      GPT is a tech and not a (cloud) product

    • @technovangelist
      @technovangelist  3 дня назад

      In this context it is absolutely a cloud product

  • @60pluscrazy
    @60pluscrazy 4 дня назад

    🎉🎉🎉

  • @eric81766
    @eric81766 4 дня назад

    Yes, but where can I buy that rubber duck shirt? That is the ultimate programming shirt.

    • @technovangelist
      @technovangelist  4 дня назад +1

      Ahhh, purveyor of all things good and bad:Amazon

    • @eric81766
      @eric81766 3 дня назад

      @@technovangelist That moment of realization that amazon has *pages* of results with "men rubber duck button down shirt".

  •  4 дня назад

    Thanks

  • @ChiefTormentOfficer
    @ChiefTormentOfficer 4 дня назад +1

    I'm reporting you to the emergency tape misappropriation department.

  • @JNET_Reloaded
    @JNET_Reloaded 4 дня назад +1

    combine this with a bigger swap file and your laughing! you dont need gpu swap file is your friend!

  • @dave24-73
    @dave24-73 2 дня назад

    What am I going to do with my 300 GB dual Xeon server I have now I can do it on a laptop. LOL

  • @Talaria.School
    @Talaria.School 4 дня назад

  • @pabloescobar2738
    @pabloescobar2738 4 дня назад

    El audio😢, no problem i stand inglish, 😅 the life dev😂, thank

  • @reserseAI
    @reserseAI 4 дня назад

    I hate when viewers said "nice explanation", im absolutely no idea about this

    • @bobdole930
      @bobdole930 4 дня назад

      There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.

  • @andrei-xe7nu
    @andrei-xe7nu 4 дня назад

    Thank you, Matt.
    You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64.
    price/performance!