There are two features that I am especially looking forward too: a) Video text search: I have security cameras that are using Frigate NVR that is using AI for image recondition to trigger if a person entered and are; audio AI model that listens for fire alarm or breaking glass but they are working on implementing text search for video clips, so you could search for clips with a guy in red jacket. b) Local audio transcription*: Tested whisper large models for transcribing non English call recording and it works but it is sloooow. I ran out of time on google collaboratory. I saw that there is optimized whisper version that I can run on google coral locally without a GPU, so I still need to test that one out. I would love to be able to search my calls.
you should try faster whisper and insanely faster whisper. I got a good performance in Portuguese even with smaller models, or you can try a fine tuned model for your language, there's some people on huggingface that already did for various languages
Great video as always :) Would love to see a video about the hardware setup & requirements and some guidelines for which models to choose for different hardware configs
9:51 nah, I work for a company where we've been doing this since before "AI" was mainstream and the e2e models have not only helped accuracy but improved performance, even with our CPU workloads. It's been incredible to be working on this and seeing the sudden rapid development.
I'm all about self-hosting these technologies. Ever since DALL-E hit the scenes, I've been thinking that artists should train a model on their own art so if they get creatively stuck, they can, "ask themselves" for inspiration.
I've played with Ollama, the open-webui, a different open-webui, and Automatic1111. One of the models ended up needing about 40 GB of VRAM, so I had to use two 3090s to be able to have enough VRAM for the model. Pretty nifty though. Not perfect, but still fun to play with.
@@TechnoTim Any chance that you might post your AI rig's hardware composition in this description before finishing up with the more detailed video on the other channel?
I can understand that but this tutorial might be close to 40 minutes long (or longer) 😅. Videos that long do not perform well and ultimately hurt the channel.
1:45 - Third option: Let surf shark snoop on you. VPN providers are no more trustworthy than your mobile ISP. VPNs are for getting around region blocks, NOT for privacy.
You can also set up piper as a server, and just feed it text by curl (local or remote). Then it generates audio-files super quick. It can also be piped to stdout iirc if you don't need the files.
@@TechnoTim I think the problem was that is just a ton of overhead every time you run the executable file, so by keeping a server running the .exe is "running" all the time.
I've watched a few videos on people setting up AI like this, but this just has the perfect blend of information AND instruction. Your 230K follows should be more like 2.30M. Thanks for sharing so much good stuff!
Thank you so much! If you can believe it, it's actually more difficult to say less. I had to constantly remind myself to not ramble or go on side quests 😅. Thanks for noticing and a full tutorial will be coming soon on my other channel, @technotimtinkers
@@TechnoTim I get that! I used to be an educator and it's hard not to tell everyone you meet all of the facts you know, especially when it's stuff that excites you. For the record I would happily listen to all of the side quests haha. And how did I not know about your other channel??? HERE I GO
Awesome video. Would love to see a follow up video where you go over the hardware for inferencing these models. And what kind of performance changes you noticed when playing around with different components
This seems to be covered in many other places, and it's almost entirely subject to the models you run. Hard to generalize such a thing. Google for Llama.cpp benchmarks and INT8 performance for GPUs.
Really cool video TIm! I've been wanting to play with some image to image "AI" stuff, but it's been hard to find much about it when self hosting is involved. I'll be poking around with the tools you mentioned to see if I can find something.
I wonder how well these tools work in an offline or no-internet VLAN. Most still tend to connect to third party domains/servers, and we have no clue what data is being sent when it does. I'm not ready to trust these yet. Would make a good video to showcase the endpoints they do try and connect to.
I have pretty much been running local AI from the onset of all the opensource models and have ran plenty of backends and now am on ollama and plan to stick with it as its the fastest backend I have ran out of all of them.. and on Linux so easy to run the models on AMD OR Nvidia.. run 7b-13b models on my little ol RX 6600 XT with Rocm and tbh it runs great and also IMO running locally 7b-13b bout all anyone needs just have specialty models on the ready for different tasks which ollama makes that easy af haha.. best feature to me with ollama is having it setup to auto unload models when not in use
What are the gpu requirements for all this? Are we talking a recent-enough gaming gpu like a 3060, or do you have to shell out for those enterprise cards with no video output?
Hi! Nice video! Can you dive a bit deeper in how to set it up, what the draw backs are, hardware requirements (CPU/GPU/disk space/...). The positive things as well but those are covered a lot here already. Thanks!
NPU performance is going to be bound by memory bandwidth performance and ddr5 isn't where you want to be. The soldered lpddr5x is going to have much better memory bandwidth and will be when these chips start to get some reasonable performance. Lunar lake and Zen 5 should both come in this configuration at some point.
Great video thank you so much for the info. I am a completely new person to this space. (Boomer status front and center) But i am going to try to go all in on a self host scenario and try to have fun and learn including taking some python stuff to enhance my experience. Keep up the great work.
What kind of GPU are using ? I have a Dell R730, I wanted to try to put a GPU on that and run Ollama . I reallly wish there was a low power AI processor that we could plug into any device with sufficient RAM and be able to run models effectively and efficiently at a relatively affordable cost
I would say RTX 4090 but with poor performance experience. For GPT Like experience you will need something like 4x RTX 4090. But than you could deploy Mixtral 8x7B which is a GPT-4 class LLM with good Performance and Context Window.
Id say 2 4090s or a 4090 plus another nvidia card. Like a 4060 or 3060. You will need about 40gb of vram for decent quantization but if you are willing to give up decent responses go for about 30-ish. Just keep clear of the 2k quantization. The 3k is okw with 4k being a standard. 8k/q is about the same as the full float 16 model but need huge amounts of vram. Anyway more vram/cuda = better
It varies since you can adjust the quantization for fit. For the big models (70b) I would suggest > 40GB if you can swing it. >70 GB if you want to run 120b models. A pair of p40s off eBay isn't too bad to buy. Probably the best budget path presently.
Did you try the 70B model from Llama? (because I saw you also used the 8B model only) I read some stuff about this with 2 rtx 4070 or an Ada 6000 but I sadly dont have the hardware to run that purely on Grafic cards yet. The results should even be better than the payed ChatGPT stuff.
Curious if there is a self-hosted AI which could serve as a replacement for Grammarly? I recently noticed my Office 2016 had a new AI process running. From a privacy perspective I'd prefer not sharing my documents with organizations like MS/Google/Grammarly.
I tried this a while back with an nVidia 3060 RTX 12GB and of course bigger models wouldn't load. Would using two GPU's help load bigger models giving a combined memory of 24GB? Also do you know if mixing GPU's works, for example having a 3060 12GB with a 4060 16GB to give a combined 28GB?
If you dont have a 3090 at least you are realy limited . Yes that exist but you coud also just buy a workstation card means insane costs . So if you realy want to play with AI you need a 4090 becouse of the Vram its the only real option other then going with a NVIDIA RTX 6000 for 6 grand and 48gb Vram
the thing with AI is that even if you are running it locally you need to get the training data from somewhere, so someone still has to give up their privacy :)
If you dont have a realy realy powerfill gpu its not realy possibel in turms of usability if you have to wait ages for something to happen its kind of pointless
@@xythiera7255 4090 is enough for llama3 8B. 4x 4090 or one A100 will work for the 70b version or even for Mistral 8x7b nearly as good as GPT-4 and super fast :) but phi-3 and llama3 8B are really not that bad. They are better than GPT-3.5, so i see this as a good starting point. I would recommend waiting for new hardware like llm specific GPUs because they can be much cheaper like 1/4 of the price.
HA Voice integration is, unfortunately, very strange. They insist on using HA "add-ons" for voice what I really don't want because I do not use HAOS, but deploy HA as any other container.
I see a future video of you building a dedicated AI server with multiple GPUs and benchmarking the tokens per second depending on the setup. It would get many views from r/LocalLLM or r/LocalLLaMA groups for sure.
I am generally skeptical of the AI hype but your way of going about it has piqued my interest. Hope more in-depth guides on setup and hardware are coming, subscribed ;)
The bigger models need more vram than a single 4090 provides. You can run the smaller models just fine. You will lose out on some performance the bigger models provide but it runs!
I currently run two 1070 (8gb), while a little slow it works fine, but for image generation you would need more vram, 8b llm models works fine on single 8gb vram. A 3090 is much faster and does images very well and can run larger models. imho integrating search had bigger impact than using a larger model of the same type(not tested 70b)
@@TechnoTim i love AI but HI will always win my heart. but seriously, thanks for this video, i've been waiting for this one. now i need to integrate more stuff to my open webui!
Hey everyone! Thanks for watching and asking for the tutorial! I've just posted it on my new channel! Enjoy!
ruclips.net/video/yoze1IxdBdM/видео.html
There are two features that I am especially looking forward too:
a) Video text search: I have security cameras that are using Frigate NVR that is using AI for image recondition to trigger if a person entered and are; audio AI model that listens for fire alarm or breaking glass but they are working on implementing text search for video clips, so you could search for clips with a guy in red jacket.
b) Local audio transcription*: Tested whisper large models for transcribing non English call recording and it works but it is sloooow. I ran out of time on google collaboratory. I saw that there is optimized whisper version that I can run on google coral locally without a GPU, so I still need to test that one out. I would love to be able to search my calls.
you should try faster whisper and insanely faster whisper. I got a good performance in Portuguese even with smaller models, or you can try a fine tuned model for your language, there's some people on huggingface that already did for various languages
Great video as always :) Would love to see a video about the hardware setup & requirements and some guidelines for which models to choose for different hardware configs
9:51 nah, I work for a company where we've been doing this since before "AI" was mainstream and the e2e models have not only helped accuracy but improved performance, even with our CPU workloads. It's been incredible to be working on this and seeing the sudden rapid development.
One of the most useful tech videos of this year. Unlike some other channels that post so many videos but 95% of them are useless
I'm all about self-hosting these technologies. Ever since DALL-E hit the scenes, I've been thinking that artists should train a model on their own art so if they get creatively stuck, they can, "ask themselves" for inspiration.
That's an awesome idea. I really wish I knew more about training. Maybe soon!
@@TechnoTim If you can train a dog... actually that's nothing like training an A.I.
On top of that one of my dogs still bites me!
I've played with Ollama, the open-webui, a different open-webui, and Automatic1111.
One of the models ended up needing about 40 GB of VRAM, so I had to use two 3090s to be able to have enough VRAM for the model.
Pretty nifty though.
Not perfect, but still fun to play with.
Are you going to release any how to’s for this? Preferably with you explaining what each step does rather than just going down a list of steps
Yes, coming soon on my Techno Tim Tinkers channel! Subscribe there to know when it's available!
@@TechnoTim I'm surprised it will end up on Tinkers, given these videos would seem to hit your core main channel. Interesting.
If not for reading this comment I would never have know about tinkers
@@TechnoTim Any chance that you might post your AI rig's hardware composition in this description before finishing up with the more detailed video on the other channel?
I can understand that but this tutorial might be close to 40 minutes long (or longer) 😅. Videos that long do not perform well and ultimately hurt the channel.
Man this looks interesting, you gotta show us how you set this all up
I second this
I approve this
glasses off? it's about to get serious!
Yeah. Nice try Tim AI
There is a HACS version of ollama support where you already can control your devices with it in Home Assistant
well done, local LLMs are the future
1:45 - Third option: Let surf shark snoop on you.
VPN providers are no more trustworthy than your mobile ISP. VPNs are for getting around region blocks, NOT for privacy.
They both have data logging, selling, sharing, and trading policies... ISP is to do it ...VPNs like this is to not.
VPNs are for creating encrypted tunnels for sensitive data. Not all privacy revolves around torrenting and hiding IPs.
that dances with wolves earned my thumb
You can also set up piper as a server, and just feed it text by curl (local or remote). Then it generates audio-files super quick. It can also be piped to stdout iirc if you don't need the files.
Thank you! I will look into how to connect this to HASS!
@@TechnoTim I think the problem was that is just a ton of overhead every time you run the executable file, so by keeping a server running the .exe is "running" all the time.
I've watched a few videos on people setting up AI like this, but this just has the perfect blend of information AND instruction. Your 230K follows should be more like 2.30M. Thanks for sharing so much good stuff!
Thank you so much! If you can believe it, it's actually more difficult to say less. I had to constantly remind myself to not ramble or go on side quests 😅. Thanks for noticing and a full tutorial will be coming soon on my other channel, @technotimtinkers
@@TechnoTim I get that! I used to be an educator and it's hard not to tell everyone you meet all of the facts you know, especially when it's stuff that excites you. For the record I would happily listen to all of the side quests haha. And how did I not know about your other channel??? HERE I GO
@@andrewbennett5733 Sometimes side quests are more fun than the main quest!
I need you to go the @JeffGeerling route and start a third channel for side quests 🤣
That's what Techno Tim Tinkers is for ;)
This is 12 minutes of pure gold, thank you very much. 😊
Awesome video. Would love to see a follow up video where you go over the hardware for inferencing these models. And what kind of performance changes you noticed when playing around with different components
This seems to be covered in many other places, and it's almost entirely subject to the models you run. Hard to generalize such a thing. Google for Llama.cpp benchmarks and INT8 performance for GPUs.
Really cool video TIm! I've been wanting to play with some image to image "AI" stuff, but it's been hard to find much about it when self hosting is involved. I'll be poking around with the tools you mentioned to see if I can find something.
this teaser was nice, where is the setup video ? :D
Very cool idea, the private search AI!
Thank you, great video. I wish you would run through what hardware you run this on.
Thanks for the feedback. I have a video on it, it's my new All in One HomeLab server. More to come!
I wonder how well these tools work in an offline or no-internet VLAN. Most still tend to connect to third party domains/servers, and we have no clue what data is being sent when it does. I'm not ready to trust these yet. Would make a good video to showcase the endpoints they do try and connect to.
I have pretty much been running local AI from the onset of all the opensource models and have ran plenty of backends and now am on ollama and plan to stick with it as its the fastest backend I have ran out of all of them.. and on Linux so easy to run the models on AMD OR Nvidia.. run 7b-13b models on my little ol RX 6600 XT with Rocm and tbh it runs great and also IMO running locally 7b-13b bout all anyone needs just have specialty models on the ready for different tasks which ollama makes that easy af haha.. best feature to me with ollama is having it setup to auto unload models when not in use
Been waiting for this one. Let's go!
You’ve just given me so many ideas. This is awesome.
What are the gpu requirements for all this? Are we talking a recent-enough gaming gpu like a 3060, or do you have to shell out for those enterprise cards with no video output?
3060 should work fine. Smaller models should fit fine!
@@TechnoTim good to hear!
Love the vid. Please also try to include a notice to help these free models either via training or donations to accelerate their further development
Hi! Nice video!
Can you dive a bit deeper in how to set it up, what the draw backs are, hardware requirements (CPU/GPU/disk space/...). The positive things as well but those are covered a lot here already.
Thanks!
There’s a link in the description and pinned comment for the full tutorial
Super awesome video - unique cutting edge I can't wait to give it a go
We need info on the hardware setup! Like are Nvidia GPUs the only option or can we use NPUs in the newer Intel processors?
NPU performance is going to be bound by memory bandwidth performance and ddr5 isn't where you want to be.
The soldered lpddr5x is going to have much better memory bandwidth and will be when these chips start to get some reasonable performance.
Lunar lake and Zen 5 should both come in this configuration at some point.
I was so ready for you to do a video on this.
Great video thank you so much for the info. I am a completely new person to this space. (Boomer status front and center) But i am going to try to go all in on a self host scenario and try to have fun and learn including taking some python stuff to enhance my experience. Keep up the great work.
@@skelious thank you! It’s never too late!
What kind of GPU are using ? I have a Dell R730, I wanted to try to put a GPU on that and run Ollama . I reallly wish there was a low power AI processor that we could plug into any device with sufficient RAM and be able to run models effectively and efficiently at a relatively affordable cost
For anyone who's using Ollama, what's the minimum hardware needed to run a 70b model?
I would say RTX 4090 but with poor performance experience. For GPT Like experience you will need something like 4x RTX 4090. But than you could deploy Mixtral 8x7B which is a GPT-4 class LLM with good Performance and Context Window.
Id say 2 4090s or a 4090 plus another nvidia card. Like a 4060 or 3060. You will need about 40gb of vram for decent quantization but if you are willing to give up decent responses go for about 30-ish. Just keep clear of the 2k quantization. The 3k is okw with 4k being a standard. 8k/q is about the same as the full float 16 model but need huge amounts of vram. Anyway more vram/cuda = better
Phi3 14b 128k is really good and i heard good things about gemma 2 27b. Though overall im still a fan of llama3
It varies since you can adjust the quantization for fit.
For the big models (70b) I would suggest > 40GB if you can swing it. >70 GB if you want to run 120b models.
A pair of p40s off eBay isn't too bad to buy. Probably the best budget path presently.
It was a great video but U didn't show us how we can install it in our home lab 😢
I’m loving Gemini for sure! It’s a bit better than llama or ChatGPT.
Gonna need a video of whisper. Also any chance it can be integrated into Plex drafting subtitles
what is the project called that you use for the whisper webui?
Thank you 😊
What hardware are you using to run this?
Im ready for the how to! I have messed with it and have something running,but these features look awesome!
Soon on my other channel!
You have other channel?
This is the first time I have done something Techno Tim is showing before he did show it :D
Ha! It took a while for me to build, integrate, and actually evaluate all of these systems!
Did you try the 70B model from Llama? (because I saw you also used the 8B model only) I read some stuff about this with 2 rtx 4070 or an Ada 6000 but I sadly dont have the hardware to run that purely on Grafic cards yet. The results should even be better than the payed ChatGPT stuff.
RTX 4090 with 24GB VRam I mean.
Curious if there is a self-hosted AI which could serve as a replacement for Grammarly? I recently noticed my Office 2016 had a new AI process running. From a privacy perspective I'd prefer not sharing my documents with organizations like MS/Google/Grammarly.
You don't need a full ai for grammar. Language tool is self hostable and they have browser extensions you can configure for your local copy.
I tried this a while back with an nVidia 3060 RTX 12GB and of course bigger models wouldn't load. Would using two GPU's help load bigger models giving a combined memory of 24GB? Also do you know if mixing GPU's works, for example having a 3060 12GB with a 4060 16GB to give a combined 28GB?
If you dont have a 3090 at least you are realy limited . Yes that exist but you coud also just buy a workstation card means insane costs . So if you realy want to play with AI you need a 4090 becouse of the Vram its the only real option other then going with a NVIDIA RTX 6000 for 6 grand and 48gb Vram
@@xythiera7255 I'm going for the cheapest option, If I can buy two 4060 16GB to have a combined 32GB of GPU memory then I will do that!
What is the hardware stack you are using for your AI solution
Do you know what's the most cost-effective GPU to get this done as I doubt it will work well generating images or processing PDF smoothly on a CPU?
the thing with AI is that even if you are running it locally you need to get the training data from somewhere, so someone still has to give up their privacy :)
touché
Do you have a part list and or setup tutorial?
Ok, Tim, where is the guide for how to set this all up ? Especially the Home Assistant stuff....
Soon on my Techno Tim Tinkers channel!
@@TechnoTim Standing by then......
If you dont have a realy realy powerfill gpu its not realy possibel in turms of usability if you have to wait ages for something to happen its kind of pointless
@xythiera7255 It really depends on the GPU, I will cover this in my tutorial!
@@xythiera7255 4090 is enough for llama3 8B. 4x 4090 or one A100 will work for the 70b version or even for Mistral 8x7b nearly as good as GPT-4 and super fast :) but phi-3 and llama3 8B are really not that bad. They are better than GPT-3.5, so i see this as a good starting point. I would recommend waiting for new hardware like llm specific GPUs because they can be much cheaper like 1/4 of the price.
what is the Difference between Ollama with WebUI and LangChain for NLP tasks
This was really interresting, now I want to build it :D
HA Voice integration is, unfortunately, very strange. They insist on using HA "add-ons" for voice what I really don't want because I do not use HAOS, but deploy HA as any other container.
The addons are just docker containers, you can find them in the rhasspy git repo
Mac Whisper is amazing
100% agree! I bought it for better models and they work even better for scripted talks (like this). It's so accurate!
What is the UI that shows the app stack flow? Is it an actual app or just after effect?
Love your videos, even though there are plenty of how to videos on these topics, I would love to hear it with your mesmerizing voice 😊
🥰. thank you! Audio in this old wooden / plaster room is hard, so hopefully it sounds ok!
This is Tim’s evil twin brother NoTechTim.
Insert Travolta meme looking for the tech.
TechNOTim 😂
What are the hardware requirements ?
What’s the nocode workflow looking thing you are using?
Just subsribed for the upcoming guides on local Ai 😃🥰😎
@@droneforfun5384 soon!!!
which rack is that? at 0:44
So power hungry the good AIs Gpus are.
@@tendosingh5682 for sure.
Why do so many services go with such odd names? Like Sear XNG, which is how I'd pronounce it,, not search NG. That's how it's written after all.
I think in the area it comes out of the "x" makes a "ch" sound.
@@benhillard919 I think so too, and I totally guessed so I hope that's how it's pronounced! Also, now that I see it again, it might be "searching". 🤣
Nice, welcome to Minnesota btw 😂
haven't been able to get Home Assistant to give me any data back from AI agents, so frustrating
You can run homeassistant faster whisper on gpu, ive been doing it for months. I’ve got a dockerfile for this, lmk if you want it
Thank you! I found a forked version of wyoming whisper but it didn't seem to help. I figured I'd wait for the official one to get updated.
@@TechnoTim I’m also using someones fork, don’t remember if i changed it in any way but its running perfectly on my quadro p2000
Hello what is the name of the open source web based version of whisper that is mentioned please?
What are the system spec?
Glass off so we don’t see that DeskPi
I see a future video of you building a dedicated AI server with multiple GPUs and benchmarking the tokens per second depending on the setup. It would get many views from r/LocalLLM or r/LocalLLaMA groups for sure.
Thanks! Sounds awesome! I am always hesitant to share my content on subreddits other than my own, but if you feel this is worthy of it feel free to!
I am generally skeptical of the AI hype but your way of going about it has piqued my interest. Hope more in-depth guides on setup and hardware are coming, subscribed ;)
Ah I just found your homelab video! That answers some questions!
Great video! ❤
Damn, you just sent me down a rabbit hole.. lol
How to define a graphics card on Docker in Ubuntu
Would this all run well on a 4090?
The bigger models need more vram than a single 4090 provides.
You can run the smaller models just fine. You will lose out on some performance the bigger models provide but it runs!
this is awesome!
Can I do these things with a 4060 TI 16Gb version ?
Yes, just use smaller models.
you can but it will be realy slow
Ok. Plan dropped .
I will just keep watching TechnoTim 😁.
I currently run two 1070 (8gb), while a little slow it works fine, but for image generation you would need more vram, 8b llm models works fine on single 8gb vram. A 3090 is much faster and does images very well and can run larger models. imho integrating search had bigger impact than using a larger model of the same type(not tested 70b)
I was promised cookies!!!!
YEESS!!!!
That's awesome, I like ur vids...
Surfshark privacy???
Open Source?
nice vid
Sure, electricity is free nowadays
It uses a lot less power than a gaming machine since you only use it in spurts, nothing new here, just shifting the workload that's using the card.
What happened to Tim? Who is this imposter?
🤓
you forgot to mention the script for this video was made by AI 🤖
Ha! Nope, 100% me! Bad grammar, bad jokes, stutters were all compliments of HI (Human Intelligence)
@@TechnoTim i love AI but HI will always win my heart. but seriously, thanks for this video, i've been waiting for this one. now i need to integrate more stuff to my open webui!
hi!
Surfshark has no logging policy, yea right. A VPN seller with no logging policy will never exist. Don’t lie we like you to much
lol
What hardware is used for this AI heaven?
When AI can write an OS it will have arrived.
The singularity!
You're a One Piece fan?
😀😀🥰🥰🥰🥰
Please use audio dubbing from English to Arabic in your videos
Can I put this into proxmox?
"Proxmox, spin up LXC container for Plex and pass my gpu through from hardware encoding."
да, в lxc контейнере с GPU работает без проблем, только настроить сетевой адрес