Great video, llamafile is really interesting, it adds a lot of flexibility for deployment options. Can't wait to start seeing the various ways people are leveraging this.
Thanks! I've been super busy with client work so had to put Jar3d on the back burner. I still intend to include additional features though when I get the chance.
Are you planning on updating Jar3d any time soon? I would like it to work locally with gpu with ollama, i don't want to spend time modifying it if you have already done it and more.
John, this is great! Could you share a quick vid on integrating this into apps to replace ollama via API. Also a vid on how we can include a GPU via this method would be great. Thanks - and keep it up!
@Data-Centric thank you. You are easily one of my favourite content creators. Fact based, you talk about the good and bad and test relatable, practical use cases. Don't change.
@Data-Centric Hi, I need your suggestion for below: Want to build a workflow automation using Multi-Agent Framework. For Example Insurance claim workflow which has Agents (Raise New claim, validate policy, validate customer, determine payout, approve, deny). Whereas we have to implement these individual agents in our own BPMN workflow which will be exposed as APIs and We need a best multi-Agent Framework to orchestrate these Agents (by calling these agents via API as tools). Which is best-fit multi-Agent framework (LangGraph,CrewAI,AutoGen)? We are looking for hybrid approach (Individual Agents like 'Raise New Claim' implementation will be in our own APIs and Supervisor Agent will be on one of these Framework to orchestrate these Agents. Please advice.
I have an i5 @3.3GHz (4cores). I think I can reach 4.2Ghz overclocked. And an 8gb AMD R9 200series GPU. Is it possible to run ollama & train my own LLMs? Everywhere seems to recommend a min of 16gb, so I haven't spent the time.
I think you might want to consider renting GPUs or using an existing platform to train LLMs (assuming you are referring to fine-tuning when you say training).
I haven't tested ollama with 8 gb. In general for training you are facing two challenges: you have an AMD card not one from Nvidia. Support for training on AMD cards (ROC) is only starting and seems to be problematic yet. 8 GB is really small and doesn't work for most language models with slightly larger parameter counts (see the page "Can you run it? LLM version" on Huggingface). Inference is a different thing, especially if you have lots of RAM. Ollama /llama.cpp is able to make use of both.
you can also use cpu for inferance with ollama. And what's more, you can easily see the tokens/s generated by ollama if you add --verbose at the end of your command to run a LLM
@@serikazero128 i cant find anywhere where to run it in CPU, i know in LLMStudio you can switch it to use cpu or gpu and i did do a little test and CPU was so much more slower like mega slow
@@BenjaminK123 ollama automatically detects your system, so if lets say you have only a CPU, it runs on that. take my laptop for example, its a 4 years old laptop that has a gen10 intel i7 CPU. It generates around 3 to 5 tokens/s with Llama 3.1 8b model To give you a speed perspective, an RTX 4090 will generate at around 70 to 80 token/s. It depends on the model you use. The larger the model, the more slow it will run. Your memory bandwidth since you will be using RAM for the AI calculations, and the CPU's capability to process data. To give another perspective, last month I tested the new intel lunar lake 258v or something like that in shop. And that was scoring around 8 to 10 tokens/s on the same question. While the AMD variant from asus was scoring 10-12 tokens/s I ordered the intel variant in the end because I went with the 14inch laptop over the 16 inch one
I do it all the time. I have a measly 1050ti and I usually opt to not use it for offloading. I am looking for answers and do not care if it is a "chat". I think of it like "a person at work"... I might not get an answer right away, but I want the right one, and quantizing to get it on the GPU is not ALWAYS the right choice. You can get a few older PCs and 64 Gigs of RAM for less than a newer GPU. This also opens VPS usage beyond GPU providers... or hybrid systems when you metric the hell out of it to know when to rent a GPU for high priority inferences... Lamafiles on Android in GGUF works, but all the models I tried have been blubbering idiots because of their small size... but for boolean function choosing they are priceless. It replaced a PC with OVOS and put it in my hand.
@ Not really. When using the LLM within the limits of its training & fine-tuning, its accuracy is comparable. Also ChatGPT is a MoE (mixture of experts) based language model meaning it’s multiple LLMs & LVMs working together compared to a single LLM most users run locally. Now there are ways to expand the capabilities & accuracy of local LLMs such as building an MoAs (mixture of agents) but that’s another topic.
I've not tried this yet, but really well put together video. Thanks!
please do more content. loved the channel. subscribed
Great video, llamafile is really interesting, it adds a lot of flexibility for deployment options. Can't wait to start seeing the various ways people are leveraging this.
Thanks for this man. Good stuff
🎉 Thanks for sharing!
amazing. I had no idea about this. thank you! I'll check it out tonight. How's the Jer3d project going?
Thanks! I've been super busy with client work so had to put Jar3d on the back burner. I still intend to include additional features though when I get the chance.
thanks for the share!
Are you planning on updating Jar3d any time soon? I would like it to work locally with gpu with ollama, i don't want to spend time modifying it if you have already done it and more.
Soon, but feel free to submit a PR. I have some commitments I need to prioritise ahead of Jar3d right now.
@ Yes, I understand, I will try to tidy up my mods and put it into a PR if I get it working ok.
John, this is great!
Could you share a quick vid on integrating this into apps to replace ollama via API. Also a vid on how we can include a GPU via this method would be great. Thanks - and keep it up!
I think it's possible to run llamafile inference on GPU. I'll look into doing something showing how you can integrate it into your apps.
@Data-Centric thank you. You are easily one of my favourite content creators. Fact based, you talk about the good and bad and test relatable, practical use cases. Don't change.
I'm excited to see how this works, I have only 4g vram but I got 64g of RAM so I'd love to see if I can run bigger models at less then a snails pace
Nice, thanks bro
@Data-Centric Hi, I need your suggestion for below:
Want to build a workflow automation using Multi-Agent Framework. For Example Insurance claim workflow which has Agents (Raise New claim, validate policy, validate customer, determine payout, approve, deny). Whereas we have to implement these individual agents in our own BPMN workflow which will be exposed as APIs and We need a best multi-Agent Framework to orchestrate these Agents (by calling these agents via API as tools). Which is best-fit multi-Agent framework (LangGraph,CrewAI,AutoGen)? We are looking for hybrid approach (Individual Agents like 'Raise New Claim' implementation will be in our own APIs and Supervisor Agent will be on one of these Framework to orchestrate these Agents. Please advice.
Ohhh yesss thank u
I cannot believe ollama eould be even slower than that
Don't take my word for it. Read the blog post, and try the approach for yourself. Ollama is even slower than that on my machine.
@Data-Centric it would be a nice test to do then. I only use llms that fit on my GPU though, otherwise it is too slow... Sadly I only have 8gb
How is this different than using hugging face models on ollama? I see nothing in this video where this makes anything faster
ollama is still faster if you have GPU (dedicated or in SoC)
llamafile is faster if you only have CPU
I suggest you read the paper I posted in the description.
❤
thank yu
👍🏼
I have an i5 @3.3GHz (4cores). I think I can reach 4.2Ghz overclocked.
And an 8gb AMD R9 200series GPU.
Is it possible to run ollama & train my own LLMs?
Everywhere seems to recommend a min of 16gb, so I haven't spent the time.
I think you might want to consider renting GPUs or using an existing platform to train LLMs (assuming you are referring to fine-tuning when you say training).
I haven't tested ollama with 8 gb. In general for training you are facing two challenges: you have an AMD card not one from Nvidia. Support for training on AMD cards (ROC) is only starting and seems to be problematic yet. 8 GB is really small and doesn't work for most language models with slightly larger parameter counts (see the page "Can you run it? LLM version" on Huggingface). Inference is a different thing, especially if you have lots of RAM. Ollama /llama.cpp is able to make use of both.
i'll try that out on my amd 100gb ram, hopefully running the larger 20gb+ will give this a perf boost
How did it go?
Are people really using CPU for inference?
Seems like it
you can also use cpu for inferance with ollama. And what's more, you can easily see the tokens/s generated by ollama if you add --verbose at the end of your command to run a LLM
@@serikazero128 i cant find anywhere where to run it in CPU, i know in LLMStudio you can switch it to use cpu or gpu and i did do a little test and CPU was so much more slower like mega slow
@@BenjaminK123 ollama automatically detects your system, so if lets say you have only a CPU, it runs on that.
take my laptop for example, its a 4 years old laptop that has a gen10 intel i7 CPU. It generates around 3 to 5 tokens/s with Llama 3.1 8b model
To give you a speed perspective, an RTX 4090 will generate at around 70 to 80 token/s.
It depends on the model you use. The larger the model, the more slow it will run. Your memory bandwidth since you will be using RAM for the AI calculations, and the CPU's capability to process data.
To give another perspective, last month I tested the new intel lunar lake 258v or something like that in shop. And that was scoring around 8 to 10 tokens/s on the same question.
While the AMD variant from asus was scoring 10-12 tokens/s
I ordered the intel variant in the end because I went with the 14inch laptop over the 16 inch one
I do it all the time. I have a measly 1050ti and I usually opt to not use it for offloading. I am looking for answers and do not care if it is a "chat". I think of it like "a person at work"... I might not get an answer right away, but I want the right one, and quantizing to get it on the GPU is not ALWAYS the right choice. You can get a few older PCs and 64 Gigs of RAM for less than a newer GPU. This also opens VPS usage beyond GPU providers... or hybrid systems when you metric the hell out of it to know when to rent a GPU for high priority inferences... Lamafiles on Android in GGUF works, but all the models I tried have been blubbering idiots because of their small size... but for boolean function choosing they are priceless. It replaced a PC with OVOS and put it in my hand.
is slower than ollama 10 times :-)
I agree. I run Ollama on my MacBook M1 Max and it’s faster than ChatGPT.
Well he compares it to Ollama on CPU.
@@TheWallReports It's also 10x more inaccurate
@ Not really. When using the LLM within the limits of its training & fine-tuning, its accuracy is comparable. Also ChatGPT is a MoE (mixture of experts) based language model meaning it’s multiple LLMs & LVMs working together compared to a single LLM most users run locally. Now there are ways to expand the capabilities & accuracy of local LLMs such as building an MoAs (mixture of agents) but that’s another topic.