The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)
Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.
"A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".
@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes. So it's more like it's constantly rechecking it's work. Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it. This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is. Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.
Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap. Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.
@@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way. What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining. But you could also just remove the flags or use wrappers that doesn't check. The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.
I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!
having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.
With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.
Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!
The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.
Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.
Watch it turn out to be faster than the 50K Dell. I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.
I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.
🎯 Key points for quick navigation: 00:00:00 *💡 Introduction & Overview* - Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000, - Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation. 00:00:43 *🐢 Running on Raspberry Pi 4* - Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM, - Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use. 00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)* - Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU, - Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead. 00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080* - Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2, - GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware. 00:09:42 *🍎 Mac Pro M2 Ultra Testing* - Tested on Mac Pro with M2 Ultra and 128 GB unified memory, - Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs. 00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada* - Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada, - Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware. 00:13:12 *⚡ Efficient Model on High-End Hardware* - Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup, - Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization. 00:14:33 *📢 Conclusion & Call to Action* - Summary of testing LLMs on various hardware from low-end to high-end, - Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video. Made with HARPA AI
This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!
Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.
I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.
@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.
@@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.
I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.
Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!
I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu. And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there. I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.
I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!
Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.
@@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU. I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.
This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.
You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!
@DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.
you are all wrong you can run the biggest model perfectly listen to me Large Language Models (LLMs) can be run efficiently using a multi-tiered approach. A smaller, lightweight model (like a 2GB LLM) acts as a 'librarian,' managing the selection of relevant data or tasks. It decides which parts of the larger model or dataset to load based on the specific question or context. This eliminates the need to run the full 1TB model continuously, saving on computational resources. The 'librarian' retrieves and activates only the necessary parts of the larger model for focused processing. This modular approach balances speed, memory use, and accuracy, enabling effective use of massive models without their full resource load. Think of it as a smart filter between your query and a massive database, providing what’s needed on demand, so its like going to library and asking her ware the books are found and only read them books, so yea you can run on the max. my secret to you thanks.
Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...
The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.
I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it. i tried words with multiple meanings , spelling mistakes etc and the answers was correct. Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.
The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU. My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.
I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be
@@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.
@@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?
I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU. I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.
Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.
Correct me if I am wrong, but the reason the Herk box is using a CPU is because its GPU is an AMD. Pretty much every ML framework today expects to use CUDA library for GPU acceleration. CUDA is proprietary library developed by Nvidia. AMD has been fighting tooth and nail to gain wider adoption for their own alternatives, but they are simply not there yet.
I appreciate this vid of using “affordable” or affordable” hardware. I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards
Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.
Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram). The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)
I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .
Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.
And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!
My next-door neighbour has an autistic son aged 10. I am reading as much as I can find to understand the condition. Your book is my latest purchase. I'm not sure if it will help the lad as he has very complex needs, but the knowledge will be useful.
Have already tested on a few machines in the meantime, was good fun! My new Asus Vivobook with Ryzen AI365 runs the llama 3.2 model very similar to your desktop even when not using it's GPU. I also tested an Intel N100 box and that is indeed barely useful like the rPi even though it had 32 GB of ram available. An I7 1265 machine although significantly slower than the Ryzen AI365 is also was quite useable. I wonder however why you don't also install Olama itself in Docker? On the Ollama Github there is just a single Docker install running both Ollama and open-webui in one go, so easy :).
I just watched a video on the limits of LLM error rate as relates to parameters, performance, etc. basically the relationship is asymptotic. More is better but the relationship decreases logarithmically. I think most people won't understand how AI models are being designed for levels of complexity and ambiguity that are difficult to grasp. They do this by having a massive number of parameters and ability to discriminate finer and finer details. These are use cases for AI to interact with humans in a visual and audio world that is absurdly complex, all while hoping to have the ability to interact with millions or billions of humans.
I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.
@Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch. I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!
I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video
BRO to use AMD Gpus you have to install Rocm.. and for older/"unsupported" GPUs its best to use linux because you can trick Rocm into running thus how I use my RX 6600 XT .. I trick it into thinking its a 6700 to load. I can run up to 13b models on my modest system.
@@tohur Rocm on Windows worked pretty well for me using zluda to run Stable Diffusion. But now I just use Distrobox on Linux and it's so much better lol.
You can run 405b on Vega 56 or other Vega cards that support HBCC. They can use system memory (including swap on linux) as "VRAM". I'd reccomend Vega 20 and Optane U.2s if you want to go that route though. NVME swap is not ideal. Even for my 4 NVME drives. Most Vega 20 GPUs like the MI60 support PCIE4.0 which helps. Also, the more real Ram and VRAM your system has the better.
I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at. Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it
nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.
Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro P2000
Quick note, the 8b llama 3.1 model is 5 gb around 11:10 you said you were running the 70b model, maybe you used the larger model on your pc and mac and forgot to mention that?? i did check the vid footage and you used the latest model which always uses the smaller model
Tried this on a 13900HX laptop, 40GB RAM, RTX 4080 mobile 12GB dedicated, 19.8GB shared (according to Task Mgr). It does seem to use the GPU. 5.8GB memory used while ollama is running the 3.1 llm. Using the same test prompts as this video, it is wicked fast and doesn't seem to peg the CPU or GPU. See no reason you couldn't do this on any nvidia mobile GPU going back to the 1060 so long as it has 6-8GB ram. Could not get ollama serve to work on the command line, but I'm a windows terminal noob. GPU never maxed over 16% usage. Now pulling the 405b model and will update later.
Also, I think there is a correction to your video, Dave. You mention that you ran the 40b models on the lesser machine, but I don't think you did. "latest" pulls the 8b model. I'm totally green on the topic, so I don't really know what the difference is between the models.
There no way you're even going to be able to load 405b. *Maybe* you could run a 70b quant at very low speeds. Dave definitely misspoke about all the previous models being 70b--there's absolutely no way. They must have been 8b, and even then, a quite quantized version. We need more of this kind of local speed testing for LLMs on the internet, but tests of models like an 8b quant are almost useless imo, because the models are almost useless. I wouldn't trust a model of that size with anything.
@@justtiredthings you are correct, the 405b did not run. Separately, ollama setting itself up to run on windows start up also caused the first win11 BSODs I've ever seen 30 seconds after boot until I disabled it. Odd, since it didn't say or ask to be allowed to do that.
I mean, my largest problem with the previous video was running a 2GB model on a 50k machine, a system with at least 45GB available VRAM... I can easily run a quantized 14B (10GB) model on my 2080TI 100% on GPU. Kinda expected more. And seems to be the same issue in this video. Maybe editing issue?
@@_chipchipit's fair to critique relatively useless testing. Dave is already putting all the effort in--the videos could be much more useful if he tested appropriate model sizes for each machine. And an 8b quant is a virtually useless model in general
@@justtiredthings Yeah. At least back when I fiddled around with this stuff I found 13b 4 bit models were still too incoherent to be useful, which was a pity because those were the biggest ones I could get running on my GPU. I ended up upgrading to 64 GB of ram and running much larger models on my CPU. They were slooow, but the results were much better. Though this was a while ago. I assume the latest generation of models are a bit more efficient.
@@fnorgen yeah, I will acknowledge that Qwen2.5 14b is pretty impressive for its size, at least. I'm new to playing with it, but I thinknit could probably do some useful work. But even that is almost twice the size of an 8b model, and I'm running it at an 8-bit quant, I believe. Also, Qwen2.5 is just a lot more impressive than Llama in general
I seriously think your show is great. It's interesting and it's entertaining. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I think there's RUclipsrs, that could benefit, from as well as you do at presenting the material. You're not. Just staring at a screen and watching you do stuff.
Dave: Your listed Radeon supported GPUs suggests to me ollama for GPU acceleration demands ROCm driver support. Unfortunately for us ordinary users, AMD's presentation to investors all the way back in spring 2020 made it crystal clear that ROCm was more geared towards the data center, the AMD CDNA architecture, not the consumer RDNA architecture. AMD has only supported ROCm on the very top-tier more expensive consumer GPUs.
When you did the intro into the last video, I knew this would be a followup kind of video. It made no sense to just leave the demo out of youtube watcher reach :D
I'm here for the moment when the Pi says: "I can't do that, Dave"
it has to wait for dave to forget his space helmet
Open the pod bay doors!!
The irony being that the Pi could do that
1:17 on this part it would actually be I CAN DO THAT, Dave
😆😆😆🤣
Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.
The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)
Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.
"A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".
@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes.
So it's more like it's constantly rechecking it's work.
Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it.
This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is.
Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.
Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap.
Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.
@@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way.
What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining.
But you could also just remove the flags or use wrappers that doesn't check.
The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.
I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!
I smiled too, but got the impression that Dave cares for his viewers.
He is quite precise when he talks which rather suits me.
Definitely learned something there. 😀
Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.
Do you mean using your own reference documents? If so, take a look at AnythingLLM, it might meet your requirements
Check out N8N or Dify
LMstudio. Anything LLM or simular
having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.
11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.
correct, llama3.1:latest =llama3.1:8B
With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.
Came here to say the same. The 70B might be a great fit for the faster machines.
I haven't played with llama yet, mostly mistral, so I was also surprised when the 70b param model was only 5gb 🥲
@@sharpenednoodles 70b llama3.1 is more like 40gb 😅
Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.
As someone who gave you "heat" in the last video, thank you for the follow-up!
You bet!
I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server
Thanks for updating and including budget friendly options.
Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!
Superb content. Not many channels with this amount of quality in terms of delivery.
The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.
What's the cost of such a home "server"
Pretty awesome the pi even ran. Super cool Dave thanks as always man!
Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.
Video to come out in 200 years
Do you want it done in real time?
@@20chocsaday What's the max allowed length for a RUclips video, 10 hours?
Watch it turn out to be faster than the 50K Dell.
I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.
I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.
I'd prefer it directly on Linux, but ofc I'm sure I can figure that out myself I'm just here watch 😂
I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.
🎯 Key points for quick navigation:
00:00:00 *💡 Introduction & Overview*
- Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000,
- Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation.
00:00:43 *🐢 Running on Raspberry Pi 4*
- Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM,
- Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use.
00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)*
- Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU,
- Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead.
00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080*
- Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2,
- GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware.
00:09:42 *🍎 Mac Pro M2 Ultra Testing*
- Tested on Mac Pro with M2 Ultra and 128 GB unified memory,
- Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs.
00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada*
- Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada,
- Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware.
00:13:12 *⚡ Efficient Model on High-End Hardware*
- Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup,
- Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization.
00:14:33 *📢 Conclusion & Call to Action*
- Summary of testing LLMs on various hardware from low-end to high-end,
- Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video.
Made with HARPA AI
This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!
Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.
I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.
@@JonVB-t8l That's quite the setup. I'd be very curious how that performs.
@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.
Came here to say this. I think 70b is like 40GB model
@@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.
I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.
Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!
I've run Windows on my RPi4, tutorial videos are out there. Not to complicated.
That windows method is even more straightforward than the wsl from the last video. Thanks for sharing!
I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu.
And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there.
I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.
I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!
Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.
and how do you use this 2gb (8B?) model in daily use?
I used this on my machine , a i5 14500 with 16GB DRR5 with a nvidia gpu rtx 4060 running linux mint , and the speed is good enough for me
What LLM?
@@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU.
I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.
@@ArthurFlimbimlinson-x1r Dolphin
This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.
You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!
Thanks so much for this favorite opportunities. We really loving your online classes.
@DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.
Ok, thanks Dave. Got it running. Any interest in setting it up to web scrape and analyze results based on a local query?
Hey, as for RPI4 and RPI5 there are tons of models of 1B-3B size, which are pretty fast even on Raspberry PI
You needed to run Minesweeper on the $50k Dell to really push it ;) Another great video Dave, thanks.
you are all wrong you can run the biggest model perfectly listen to me Large Language Models (LLMs) can be run efficiently using a multi-tiered approach. A smaller, lightweight model (like a 2GB LLM) acts as a 'librarian,' managing the selection of relevant data or tasks. It decides which parts of the larger model or dataset to load based on the specific question or context. This eliminates the need to run the full 1TB model continuously, saving on computational resources. The 'librarian' retrieves and activates only the necessary parts of the larger model for focused processing. This modular approach balances speed, memory use, and accuracy, enabling effective use of massive models without their full resource load. Think of it as a smart filter between your query and a massive database, providing what’s needed on demand, so its like going to library and asking her ware the books are found and only read them books, so yea you can run on the max. my secret to you thanks.
I built a system with 4x P102-100's which total 40GB of GPU ram. Now I can use the 70b quantized models and it is awesome! Best bang for your $$$.
Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...
Nice pivot and delivery, sir. Respect. I can't wait to follow along.
The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.
Only at 10 TOPS according to their website. For comparison, the Copilot+-PCs need at least 40 TOPS. So questionable if it's accelerating anything.
There are projects working on incorporating ROCm which I believe can leverage the TOPS AI processor. Similar to MLX based Apple Silicon models.
Turns out, 3.1 runs reasonably well on 4080. Thanks for the tip! Until this video I didn't know I could run an LLM on my PC.
That was best of the internet right there. Thanks, Dave.
Best I can do is like and say "thank you" since I've already subscribed. How about a heart? ❤
I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it.
i tried words with multiple meanings , spelling mistakes etc and the answers was correct.
Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.
The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU.
My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.
I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be
@@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.
@@firecat6666 I am running the Q4
@@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?
I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU.
I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.
Nice content, i like that you seem completely agnostic between, mac, linux and windows and even the different hardware.
Thanks, Dave. You've given me a lot more confidence in my beat-up 2015 MacBook Pro. Off to Ollama now!
So kewl. Was just about to look for resources regarding this topic and this video got recommended. Amazing, thank you!
These vids are exactly what I need right now. Good to know that the pi can actually run it in some capacity.
Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.
win10 i7-13700k with no video card pegs at 100%, and llama3.2 generates about 80% as fast as normal reading speed.
with a 10600k its at least 2-3x times faster than normal reading speed. But I am on linux
Correct me if I am wrong, but the reason the Herk box is using a CPU is because its GPU is an AMD. Pretty much every ML framework today expects to use CUDA library for GPU acceleration. CUDA is proprietary library developed by Nvidia. AMD has been fighting tooth and nail to gain wider adoption for their own alternatives, but they are simply not there yet.
It does support some AMD dedicated video cards as you saw in the video. Not sure how effective it will be vs CUDA.
I appreciate this vid of using “affordable” or affordable” hardware.
I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards
Thanks for making this video. I'm building a new PC and wanted to play with running local LLMs. To see just how fast a 4080 is...holy crap!
Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.
Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram).
The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)
Wonderful!! Actually very useful. I plan on upgrading my own PC to do AI stuff, and now I can see roughly how well it'll do it! Thank you so much!
I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .
Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.
"Nothing but the 2nd best, for dave.... " Classic hahahaha
And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!
This video should save me a lot of time when I get around to running an LLM, many thanks.
You should be using llama3.2 on the PI, which is designed specifically for edge devices like SBCs or smartphones
Wow, educational, interesting and inspiring! Thanks for showing us what is possible, in detail. I'd not even heard of ollama!
Top notch work Dave!!! Thank you!
My next-door neighbour has an autistic son aged 10. I am reading as much as I can find to understand the condition. Your book is my latest purchase. I'm not sure if it will help the lad as he has very complex needs, but the knowledge will be useful.
There's a lot of overlap even between mild and severe cases, so hopefully the info is still useful!
@DavesGarage Thanks, I'm sure it will help. I love your work on the channel. Keep it up.
I also came here for the dog playing the piano. You're the best, Dave!!
Have already tested on a few machines in the meantime, was good fun! My new Asus Vivobook with Ryzen AI365 runs the llama 3.2 model very similar to your desktop even when not using it's GPU. I also tested an Intel N100 box and that is indeed barely useful like the rPi even though it had 32 GB of ram available. An I7 1265 machine although significantly slower than the Ryzen AI365 is also was quite useable.
I wonder however why you don't also install Olama itself in Docker? On the Ollama Github there is just a single Docker install running both Ollama and open-webui in one go, so easy :).
I just watched a video on the limits of LLM error rate as relates to parameters, performance, etc. basically the relationship is asymptotic. More is better but the relationship decreases logarithmically. I think most people won't understand how AI models are being designed for levels of complexity and ambiguity that are difficult to grasp.
They do this by having a massive number of parameters and ability to discriminate finer and finer details. These are use cases for AI to interact with humans in a visual and audio world that is absurdly complex, all while hoping to have the ability to interact with millions or billions of humans.
For AMD 7840, you should try lm studio on windows, I can run llama 3.1 7b with respectable result. GPU could be used, however, the NPU is idle.
You should use the --verbose flag when running the examples as it will give the tokens/sec
Nice one Dave, bravo.
Think the next good video should be on how to trin it on your own data. Lets say a simple ms access local db?
I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.
Definitely! Or a collection of things, such as a bunch of emails or source code files.
This
@Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch.
I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!
Good info… answers many questions I had without me having to do the experiments myself, so thanks.
Love it. Would also like to see a chart showing tokens per second on thr same model across the hardware. Good ollama benchmarks are hard to come by
Even though you brought the 50K machine to it's knees , and we're somewhat saddened ; I'm guessing there was a well hidden smirk as well ..😅
Awesome video Dave. I was playing with Stable Diffusion. Will try to explore Llama in WSL
I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video
As always, great video Dave.
BRO to use AMD Gpus you have to install Rocm.. and for older/"unsupported" GPUs its best to use linux because you can trick Rocm into running thus how I use my RX 6600 XT .. I trick it into thinking its a 6700 to load. I can run up to 13b models on my modest system.
i have a 6800m in one of my laptops and 6800xtx in my desktop but i cannot get rocm to install its frustrating.
@@erikgiggey4783 Rocm on windows BLOWS I use Linux so not an issue other then I have to set an environment variable to force it to think its a RX 6700
@@tohur Rocm on Windows worked pretty well for me using zluda to run Stable Diffusion.
But now I just use Distrobox on Linux and it's so much better lol.
You can run 405b on Vega 56 or other Vega cards that support HBCC. They can use system memory (including swap on linux) as "VRAM".
I'd reccomend Vega 20 and Optane U.2s if you want to go that route though. NVME swap is not ideal. Even for my 4 NVME drives. Most Vega 20 GPUs like the MI60 support PCIE4.0 which helps. Also, the more real Ram and VRAM your system has the better.
Great video, thanks for sharing 👍
There's a 3.2 11b that will be out soon. That's probably the sweet spot for most people. Especially for 12Gb and up GPUs. It also adds image support.
Thanks for tickling my fancy with the "Do it Len" animations! 😂
I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at.
Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it
nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.
Great episode! I loved this one.
It was nice to see the canals of the city of Brugge in the background of the windows machine.
Perfect! Just in time for me to install Ollama on my new Lenovo Yoga Slim 7x Copilot+ PC with the Snapdragon X Elite processor and NPU!
I love your channel! The OGs of Tech Samarai!
Thank you for this, Dave!
Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro
P2000
if you typed /set verbose you would see the exact tokens/s as well as other inference stats, just fyi
Thanks for listening to the comments. Great video!
Quick note, the 8b llama 3.1 model is 5 gb
around 11:10 you said you were running the 70b model,
maybe you used the larger model on your pc and mac and forgot to mention that?? i did check the vid footage and you used the latest model which always uses the smaller model
70b paramenters... not file size.
I don't believe any of your examples ran the 70b model. I would be curious to see that.
@@ChristophBerg-vi5yr He said he watched the vid if he did, he then obviously from his erroneous question did not understand it.
@@ChristophBerg-vi5yr yea, that should be more than 5gb, 8b model is close to 5gb, 70b one is larger
@@TBizzleII yes same, I use 8b and 70b llama quite frequently on my ec2 instance, so this was sending red flags when he said 70b model was running
Well-made, full of information for the public.
Tried this on a 13900HX laptop, 40GB RAM, RTX 4080 mobile 12GB dedicated, 19.8GB shared (according to Task Mgr). It does seem to use the GPU. 5.8GB memory used while ollama is running the 3.1 llm. Using the same test prompts as this video, it is wicked fast and doesn't seem to peg the CPU or GPU. See no reason you couldn't do this on any nvidia mobile GPU going back to the 1060 so long as it has 6-8GB ram. Could not get ollama serve to work on the command line, but I'm a windows terminal noob. GPU never maxed over 16% usage. Now pulling the 405b model and will update later.
Also, I think there is a correction to your video, Dave. You mention that you ran the 40b models on the lesser machine, but I don't think you did. "latest" pulls the 8b model. I'm totally green on the topic, so I don't really know what the difference is between the models.
405b is going to suck. Even the 70b was quite slow on my system (13600K, 3080. 64 GB).
There no way you're even going to be able to load 405b. *Maybe* you could run a 70b quant at very low speeds. Dave definitely misspoke about all the previous models being 70b--there's absolutely no way. They must have been 8b, and even then, a quite quantized version. We need more of this kind of local speed testing for LLMs on the internet, but tests of models like an 8b quant are almost useless imo, because the models are almost useless. I wouldn't trust a model of that size with anything.
@@justtiredthings you are correct, the 405b did not run. Separately, ollama setting itself up to run on windows start up also caused the first win11 BSODs I've ever seen 30 seconds after boot until I disabled it. Odd, since it didn't say or ask to be allowed to do that.
I mean, my largest problem with the previous video was running a 2GB model on a 50k machine, a system with at least 45GB available VRAM... I can easily run a quantized 14B (10GB) model on my 2080TI 100% on GPU. Kinda expected more. And seems to be the same issue in this video. Maybe editing issue?
Maybe it’s just your expectations?
@@_chipchipit's fair to critique relatively useless testing. Dave is already putting all the effort in--the videos could be much more useful if he tested appropriate model sizes for each machine. And an 8b quant is a virtually useless model in general
@@justtiredthings Yeah. At least back when I fiddled around with this stuff I found 13b 4 bit models were still too incoherent to be useful, which was a pity because those were the biggest ones I could get running on my GPU. I ended up upgrading to 64 GB of ram and running much larger models on my CPU. They were slooow, but the results were much better. Though this was a while ago. I assume the latest generation of models are a bit more efficient.
@@fnorgen yeah, I will acknowledge that Qwen2.5 14b is pretty impressive for its size, at least. I'm new to playing with it, but I thinknit could probably do some useful work. But even that is almost twice the size of an 8b model, and I'm running it at an 8-bit quant, I believe. Also, Qwen2.5 is just a lot more impressive than Llama in general
I seriously think your show is great. It's interesting and it's entertaining. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I wasn't born in the age of the, computers you grew up using, but you explain it in a very good and interesting way. I think there's RUclipsrs, that could benefit, from as well as you do at presenting the material. You're not. Just staring at a screen and watching you do stuff.
Dave: Your listed Radeon supported GPUs suggests to me ollama for GPU acceleration demands ROCm driver support. Unfortunately for us ordinary users, AMD's presentation to investors all the way back in spring 2020 made it crystal clear that ROCm was more geared towards the data center, the AMD CDNA architecture, not the consumer RDNA architecture. AMD has only supported ROCm on the very top-tier more expensive consumer GPUs.
WSL2 Linux on Windows is a perfectly cromulent decision. That WSL2 tech is magical.
When you did the intro into the last video, I knew this would be a followup kind of video. It made no sense to just leave the demo out of youtube watcher reach :D
The highlight for me was the internet speed flexing downloading a 240 GB file at 300 megs per second. 💪💪💪