I'm trying to turn my RPI 5 into a local virtual assistant that only communicates data from PDFs, with low latency. I've installed a 1TB Samsung 980 PRO PCIe 4.0 NVMe M.2 SSD to it hoping it will help with all the pdf data as well as whatever LLM I decide to install. But im in a rut; I'm not familiar with RAG, not to great at coding Im not even sure if the RPI 5 can handle all this (been alternatively considering the Jetson Orin Nano Developer Kit) 😮💨... Could you please offer your wise council?
Hi, thank you for sharing this, sounds awesome! Your SSD will handle things just fine, but as you can see in my video, unfortunately the performance when run in raspberry is not going to be low latency, typical speed of a raspi (depending on the model, memory, and LLM being used) tends to be around 1 token per second. Meaning, it's going to be slowwww, and there's not many ways around it. So it's okay for usecases where response does not need to be immediate, but it's pretty far from low latency. Mostly great for experimentation, I would say. I've been toying with virtual assistants that I actually use myself, and for this, a raspberry won't cut it. You want: - A heavy machine with a lot of oomph, definitely a good graphics card and working CUDA drivers. The more the better, most models being run as a service run on CRAZY hardware. But my personal gaming machine does okay. - Coding approach centered around streaming, aka: give me the tokens immediately as they are generated, don't wait for full answer. You have to play a bit with granularity. I think good starting point would be to grab the tokens, and send them to speech interface when you have full sentences. Otherwise the intonation will be far off. - Fastest, real-time, gpu-accelerated versions of any parts, so use a very low-latency text-to-speech solution, preferably gpu accelerated, along with the model. - Unfortunately offline models you can run locally are somewhat slower and stupider than the big models you use via API. But they make up for that by being more secure, especially with RAG, or at least letting you control the security. And being potentially more cost-effective, depending on how you calculate costs. So, just general advice, but if you see the slowness demonstrated in this video, you can see that a raspberry is not good for low-latency case. I built my own virtual assistant on top of a local model, running it on my gaming beast, and it runs with acceptable latency, aka some seconds most of the time. To get natural dialogue going on, you actually want faster, and that requires heavier hardware. But all is good for research and experimentation! Latency and speed is an optimization question once you know what you want to be building.
Thanks for this and your "How to run ChatGPT in your own laptop for free? | GPT4All" video showing the practicalities of running a language model. I think that an important aspect and benefit of a local model is being able to train it. Please cover this or point us. Being able to read pdfs to learn would be great.
I'm trying to turn my RPI 5 into a local virtual assistant that only communicates data from PDFs, with low latency. I've installed a 1TB Samsung 980 PRO PCIe 4.0 NVMe M.2 SSD to it hoping it will help with all the pdf data as well as whatever LLM I decide to install. But im in a rut; I'm not familiar with RAG, not to great at coding Im not even sure if the RPI 5 can handle all this (been alternatively considering the Jetson Orin Nano Developer Kit) 😮💨... Could you please offer your wise council?
Hi, thank you for sharing this, sounds awesome!
Your SSD will handle things just fine, but as you can see in my video, unfortunately the performance when run in raspberry is not going to be low latency, typical speed of a raspi (depending on the model, memory, and LLM being used) tends to be around 1 token per second. Meaning, it's going to be slowwww, and there's not many ways around it. So it's okay for usecases where response does not need to be immediate, but it's pretty far from low latency. Mostly great for experimentation, I would say.
I've been toying with virtual assistants that I actually use myself, and for this, a raspberry won't cut it. You want:
- A heavy machine with a lot of oomph, definitely a good graphics card and working CUDA drivers. The more the better, most models being run as a service run on CRAZY hardware. But my personal gaming machine does okay.
- Coding approach centered around streaming, aka: give me the tokens immediately as they are generated, don't wait for full answer. You have to play a bit with granularity. I think good starting point would be to grab the tokens, and send them to speech interface when you have full sentences. Otherwise the intonation will be far off.
- Fastest, real-time, gpu-accelerated versions of any parts, so use a very low-latency text-to-speech solution, preferably gpu accelerated, along with the model.
- Unfortunately offline models you can run locally are somewhat slower and stupider than the big models you use via API. But they make up for that by being more secure, especially with RAG, or at least letting you control the security. And being potentially more cost-effective, depending on how you calculate costs.
So, just general advice, but if you see the slowness demonstrated in this video, you can see that a raspberry is not good for low-latency case. I built my own virtual assistant on top of a local model, running it on my gaming beast, and it runs with acceptable latency, aka some seconds most of the time. To get natural dialogue going on, you actually want faster, and that requires heavier hardware.
But all is good for research and experimentation! Latency and speed is an optimization question once you know what you want to be building.
@@DevXplainingWould you recommend Nvidia's Jetson nano then? And thanks by the way, I appreciate the detailed response.
Thanks for the Video !!! Will try on my Raspberry Pi 5 with 8GB of RAM !!!
Perfect! It's gonna be slowwww... But fully local too :)
Thanks for this and your "How to run ChatGPT in your own laptop for free? | GPT4All" video showing the practicalities of running a language model. I think that an important aspect and benefit of a local model is being able to train it. Please cover this or point us. Being able to read pdfs to learn would be great.