Bad Dad, using up all the emergency tape! But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!
I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much! I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM! Thanks again!
Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!
Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency
This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.
Thank you for your awesome AI videos! Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?
Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...
I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges
Thanks, this was very helpful! Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?
While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.
There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.
Thank you, Matt. You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64. price/performance!
the kid at the end is my spirit animal
Amazing explanation, thanks!
Great info, thanks! Also, very glad you put that clip in at the end
Thanks!
Bad Dad, using up all the emergency tape!
But really, thanks for another GREAT video that simplifies something super useful for so many. We appreciate you and your family's tape!!
I just tossed money at a new laptop with a 4070 JUST for Ollama, and with this video I was also able to throw smarts at it too to get it to do more with the 8GB of VRAM on laptop 4070s. Thanks so much!
I'd been spending a lot of time building models with various context widths and benchmarking the VRAM consumption. Deleted a bunch of them because they ended up getting me a CPU/GPU split. Time to create them again because they will now fit in VRAM!
Thanks again!
Absolute champion! Really appreciate you Matt. Thank you ...
Thank you, Matt! 🙌 This was the topic I was going to ask you to cover. Great explanation and props! 👏👍
Thank you so much! This has helped me a lot! Please keep going. I also enjoy videos that aren’t just about Ollama (but of course, I like the ones that are about Ollama too!). Thank you!
you may be a bad dad, but you're a great teacher!
no understand how activate flash attention
Thank you Matt for this amazing explanation. I had a brief understanding so I thought of this but you really helped me fully grasp how this works. Also, your videos are an emergency
This is a good one. Nice topic.
This is just the info I was looking for. My goal is get useful coarse AI running on the GPU of my gaming laptop, leaving the APU to give me full attention.
Hi Matt, what about the quality of responses with flash attention enabled?
Nice, way to end with a smile :)
thank you very much 👍👍😎😎
what about the IQ quantization such as IQ3M?
Thank you for your awesome AI videos!
Is there an easy way to always set the environment variables by default when starting ollama as I sometimes forget to set them after a restart ?
Thank you and what is the tool name in mac os that you are using to see those memory graphs ?
Nice information
Wow! This made my favorite model much faster! 🤯 I couldn't run `OLLAMA_FLASH_ATTENTION=true ollama serve` for some reason, so I set the environment variable instead. Now, if only Open WebUI used those settings...
It would be nice to have a video downloading a model and modifying for example for a Mac mini 16GB or 24gb as real case. Awesome as usual. Thank you
I am using my personal machine, a M1 Max with 64gb. Pretty real case
I think what the viewer meant was for specifically those mem availabilities. that said, it's also not that realistic because everyone has different available mem depending on what else they have running (vscode, cline, docker / podman + various containers, browser windows, n8n / langflow / ..., ...) - it all depends on what one's specific setup and use-case. people keep forgetting that it's all apple and oranges
Some have 8 or 16 or 24 or 32 gb. But the actual Mem isn’t all that important. Know what model fits in the space available is the important part.
Matt, you blew my mind
Flash attention is precisely what I needed.
Super helpful: S, M, L … I didn’t realize that was the scheme, duh.
Good info.
Thanks, this was very helpful!
Question - I have a Mac Studio M2 Ultra - 192 Unified Ram. What do you think is the largest model I could run on it? Llama 3.1 has a 405b model that at q4 is 243G. Do you think I could run it with the flash kv and context quantizing?
I doubt it, but its easy to find out. But I cant think of a good reason to want to.
While this video brilliantly explains quantization and offers valuable technical insights for running LLMs locally, it's worth considering whether the trade-offs are truly worth it for most users. Running a heavily quantized model on consumer hardware, while impressive, means accepting significant compromises in model quality, processing power, and reliability compared to data center-hosted solutions like Claude or GPT. The video's techniques are fascinating from an educational standpoint and useful for specific privacy-focused use cases, but for everyday users seeking consistent, high-quality AI interactions, cloud-based solutions might still be the more practical choice - offering access to full-scale models without the complexity of hardware management or the uncertainty of quantization's impact on output quality.
Considering that you can get results very comparable to hosted models when even using q4 and q3 I’d say it certainly is worth it.
GPT is a tech and not a (cloud) product
In this context it is absolutely a cloud product
🎉🎉🎉
Yes, but where can I buy that rubber duck shirt? That is the ultimate programming shirt.
Ahhh, purveyor of all things good and bad:Amazon
@@technovangelist That moment of realization that amazon has *pages* of results with "men rubber duck button down shirt".
Thanks
I'm reporting you to the emergency tape misappropriation department.
combine this with a bigger swap file and your laughing! you dont need gpu swap file is your friend!
What am I going to do with my 300 GB dual Xeon server I have now I can do it on a laptop. LOL
❤
El audio😢, no problem i stand inglish, 😅 the life dev😂, thank
I hate when viewers said "nice explanation", im absolutely no idea about this
There's a lot to learn, same channel has a playlist to learn Ollama. Ollama is the open source platform your AI models run on. If his style of explanation is clicking try someone else that does similar.
Thank you, Matt.
You might be interested that large models are cheapest to run on Orange pi 5 plus. Where RAM is used as vRAM. We have up to 32Gb of vRAM for $220 with great performance 6Tops and power consumption 2,5AX5V. Ollama in arch packages, and available for arm64.
price/performance!