Run Any Local LLM Faster Than Ollama-Here's How

Поделиться
HTML-код
  • Опубликовано: 15 янв 2025

Комментарии • 51

  • @rikmoran3963
    @rikmoran3963 2 месяца назад +2

    I've not tried this yet, but really well put together video. Thanks!

  • @johnbox5540
    @johnbox5540 2 месяца назад +5

    please do more content. loved the channel. subscribed

  • @brinkoo7
    @brinkoo7 2 месяца назад

    Great video, llamafile is really interesting, it adds a lot of flexibility for deployment options. Can't wait to start seeing the various ways people are leveraging this.

  • @IkechiGriffith
    @IkechiGriffith 2 месяца назад +1

    Thanks for this man. Good stuff

  • @andyzheng7377
    @andyzheng7377 2 месяца назад +2

    🎉 Thanks for sharing!

  • @ashgtd
    @ashgtd 2 месяца назад

    amazing. I had no idea about this. thank you! I'll check it out tonight. How's the Jer3d project going?

    • @Data-Centric
      @Data-Centric  2 месяца назад

      Thanks! I've been super busy with client work so had to put Jar3d on the back burner. I still intend to include additional features though when I get the chance.

  • @SheeceGardazi
    @SheeceGardazi 2 месяца назад

    thanks for the share!

  • @nedkelly3610
    @nedkelly3610 2 месяца назад +1

    Are you planning on updating Jar3d any time soon? I would like it to work locally with gpu with ollama, i don't want to spend time modifying it if you have already done it and more.

    • @Data-Centric
      @Data-Centric  2 месяца назад

      Soon, but feel free to submit a PR. I have some commitments I need to prioritise ahead of Jar3d right now.

    • @nedkelly3610
      @nedkelly3610 2 месяца назад

      @ Yes, I understand, I will try to tidy up my mods and put it into a PR if I get it working ok.

  • @SejalDatta-l9u
    @SejalDatta-l9u 2 месяца назад

    John, this is great!
    Could you share a quick vid on integrating this into apps to replace ollama via API. Also a vid on how we can include a GPU via this method would be great. Thanks - and keep it up!

    • @Data-Centric
      @Data-Centric  2 месяца назад +2

      I think it's possible to run llamafile inference on GPU. I'll look into doing something showing how you can integrate it into your apps.

    • @SejalDatta-l9u
      @SejalDatta-l9u 2 месяца назад

      @Data-Centric thank you. You are easily one of my favourite content creators. Fact based, you talk about the good and bad and test relatable, practical use cases. Don't change.

    • @ashgtd
      @ashgtd 2 месяца назад

      I'm excited to see how this works, I have only 4g vram but I got 64g of RAM so I'd love to see if I can run bigger models at less then a snails pace

  • @d3mist0clesgee12
    @d3mist0clesgee12 2 месяца назад

    Nice, thanks bro

  • @bvinodmca
    @bvinodmca 12 дней назад

    @Data-Centric Hi, I need your suggestion for below:
    Want to build a workflow automation using Multi-Agent Framework. For Example Insurance claim workflow which has Agents (Raise New claim, validate policy, validate customer, determine payout, approve, deny). Whereas we have to implement these individual agents in our own BPMN workflow which will be exposed as APIs and We need a best multi-Agent Framework to orchestrate these Agents (by calling these agents via API as tools). Which is best-fit multi-Agent framework (LangGraph,CrewAI,AutoGen)? We are looking for hybrid approach (Individual Agents like 'Raise New Claim' implementation will be in our own APIs and Supervisor Agent will be on one of these Framework to orchestrate these Agents. Please advice.

  • @and1play5
    @and1play5 2 месяца назад

    Ohhh yesss thank u

  • @vertigoz
    @vertigoz 2 месяца назад

    I cannot believe ollama eould be even slower than that

    • @Data-Centric
      @Data-Centric  2 месяца назад +1

      Don't take my word for it. Read the blog post, and try the approach for yourself. Ollama is even slower than that on my machine.

    • @vertigoz
      @vertigoz 2 месяца назад

      @Data-Centric it would be a nice test to do then. I only use llms that fit on my GPU though, otherwise it is too slow... Sadly I only have 8gb

  • @ChrisSteurer
    @ChrisSteurer 2 месяца назад

    How is this different than using hugging face models on ollama? I see nothing in this video where this makes anything faster

    • @tuna1867
      @tuna1867 2 месяца назад

      ollama is still faster if you have GPU (dedicated or in SoC)
      llamafile is faster if you only have CPU

    • @Data-Centric
      @Data-Centric  2 месяца назад +1

      I suggest you read the paper I posted in the description.

  • @joesmoo9254
    @joesmoo9254 2 месяца назад

  • @existentialbaby
    @existentialbaby 2 месяца назад

    thank yu

  • @Talaria.School
    @Talaria.School 2 месяца назад

    👍🏼

  • @IJH-Music
    @IJH-Music 2 месяца назад

    I have an i5 @3.3GHz (4cores). I think I can reach 4.2Ghz overclocked.
    And an 8gb AMD R9 200series GPU.
    Is it possible to run ollama & train my own LLMs?
    Everywhere seems to recommend a min of 16gb, so I haven't spent the time.

    • @Data-Centric
      @Data-Centric  2 месяца назад

      I think you might want to consider renting GPUs or using an existing platform to train LLMs (assuming you are referring to fine-tuning when you say training).

    • @fotisj321
      @fotisj321 2 месяца назад

      I haven't tested ollama with 8 gb. In general for training you are facing two challenges: you have an AMD card not one from Nvidia. Support for training on AMD cards (ROC) is only starting and seems to be problematic yet. 8 GB is really small and doesn't work for most language models with slightly larger parameter counts (see the page "Can you run it? LLM version" on Huggingface). Inference is a different thing, especially if you have lots of RAM. Ollama /llama.cpp is able to make use of both.

  • @themax2go
    @themax2go 2 месяца назад

    i'll try that out on my amd 100gb ram, hopefully running the larger 20gb+ will give this a perf boost

  • @BenjaminK123
    @BenjaminK123 2 месяца назад +1

    Are people really using CPU for inference?

    • @ZacMagee
      @ZacMagee 2 месяца назад +2

      Seems like it

    • @serikazero128
      @serikazero128 2 месяца назад +3

      you can also use cpu for inferance with ollama. And what's more, you can easily see the tokens/s generated by ollama if you add --verbose at the end of your command to run a LLM

    • @BenjaminK123
      @BenjaminK123 2 месяца назад +1

      @@serikazero128 i cant find anywhere where to run it in CPU, i know in LLMStudio you can switch it to use cpu or gpu and i did do a little test and CPU was so much more slower like mega slow

    • @serikazero128
      @serikazero128 2 месяца назад

      @@BenjaminK123 ollama automatically detects your system, so if lets say you have only a CPU, it runs on that.
      take my laptop for example, its a 4 years old laptop that has a gen10 intel i7 CPU. It generates around 3 to 5 tokens/s with Llama 3.1 8b model
      To give you a speed perspective, an RTX 4090 will generate at around 70 to 80 token/s.
      It depends on the model you use. The larger the model, the more slow it will run. Your memory bandwidth since you will be using RAM for the AI calculations, and the CPU's capability to process data.
      To give another perspective, last month I tested the new intel lunar lake 258v or something like that in shop. And that was scoring around 8 to 10 tokens/s on the same question.
      While the AMD variant from asus was scoring 10-12 tokens/s
      I ordered the intel variant in the end because I went with the 14inch laptop over the 16 inch one

    • @timothywcrane
      @timothywcrane 2 месяца назад +1

      I do it all the time. I have a measly 1050ti and I usually opt to not use it for offloading. I am looking for answers and do not care if it is a "chat". I think of it like "a person at work"... I might not get an answer right away, but I want the right one, and quantizing to get it on the GPU is not ALWAYS the right choice. You can get a few older PCs and 64 Gigs of RAM for less than a newer GPU. This also opens VPS usage beyond GPU providers... or hybrid systems when you metric the hell out of it to know when to rent a GPU for high priority inferences... Lamafiles on Android in GGUF works, but all the models I tried have been blubbering idiots because of their small size... but for boolean function choosing they are priceless. It replaced a PC with OVOS and put it in my hand.

  • @fabriziocasula
    @fabriziocasula 2 месяца назад +3

    is slower than ollama 10 times :-)

    • @TheWallReports
      @TheWallReports 2 месяца назад

      I agree. I run Ollama on my MacBook M1 Max and it’s faster than ChatGPT.

    • @foxusmusicus2929
      @foxusmusicus2929 2 месяца назад +2

      Well he compares it to Ollama on CPU.

    • @Eldorado66
      @Eldorado66 2 месяца назад

      @@TheWallReports It's also 10x more inaccurate

    • @TheWallReports
      @TheWallReports 2 месяца назад

      @ Not really. When using the LLM within the limits of its training & fine-tuning, its accuracy is comparable. Also ChatGPT is a MoE (mixture of experts) based language model meaning it’s multiple LLMs & LVMs working together compared to a single LLM most users run locally. Now there are ways to expand the capabilities & accuracy of local LLMs such as building an MoAs (mixture of agents) but that’s another topic.