for models at ~70b, i am getting timeout issues using vanilla ollama. It works with the first pull/run, but times out when i need to reload model. Do you have any recommendations for persistently keeping the same model running?
This is very informative! Thanks :) Curious why you used a g4dn.xlarge GPU ($300/month) instead of a t3.medium CPU ($30/month)? I assumed the 8 Billion parameter model was out of reach with regular hardware. What max model size works with the g4dn.xlarge GPU? To put into perspective, I have a $4K macbook (16gb ram) that can really only run the large (150 million) or medium (100 million parameter) sized model, which i think the t3.medium CPU on AWS can only run the 50 million param (small model).
The best way to support this channel? Comment, like, and subscribe!
Great concise presentation. Thank you so much!
Thank you! 🙏
this is super valuable. awesome vid!
Thank you! 🙏
Thanks very nice tutorial
Thank you
maybe a dumb question. how do you turn the stream data you received into readable sentences
You could accumulate tokens and split by the end of sentences . ! ? Etc and then send resp after grouping function like that
for models at ~70b, i am getting timeout issues using vanilla ollama. It works with the first pull/run, but times out when i need to reload model. Do you have any recommendations for persistently keeping the same model running?
github.com/ollama/ollama/pull/2146
Can you use open web ui?
This is very informative! Thanks :)
Curious why you used a g4dn.xlarge GPU ($300/month) instead of a t3.medium CPU ($30/month)? I assumed the 8 Billion parameter model was out of reach with regular hardware. What max model size works with the g4dn.xlarge GPU? To put into perspective, I have a $4K macbook (16gb ram) that can really only run the large (150 million) or medium (100 million parameter) sized model, which i think the t3.medium CPU on AWS can only run the 50 million param (small model).
nice explaination
Thank you!