I would like to see the major opensource LLM developers start focusing on good 16-24b parameter models. Wouldn't have to quantize them much to run locally so they still retain most of their quality. This 70b model is impressive, but you still need pretty expensive hardware to run it.
Mistral Small seems to be the best for 24gb cards, others being Qwen2.5:32 and Gemma2 27b. Mistral-Nemo is my second go to, it will fill up memory if you raise context but allows use of fast TTS and watching 4k youtube vids simultaneously. Ministral is probably going to replace it for me. Tried nemotron and was getting 2 tk/s and wrote a story that completely avoided the prompt not having the cat eat the mouse, rather they became friends instead.... going to have to try the jailbreak technique but model probably going in trash :( way too slow
Money is usually the answer. The 70B models fit nicely in the GPUs with 80GB VRAM and are thus relatively cheap to run compared to ChatGPT but they can be nearly as powerful. That makes a lot of business sense. The 8B models of today are more capable than the 13B models of a year ago and fit nicely in GPUs with 8-12GB VRAM which are common. It is extremely expensive to train LLM models so you wan't to make sure that you have a big market for them. Mistral Nemo is 12B, Microsofts Phi 3 medium is 14B, Gemma 2 is 27B, they all work great on cards with 16-24 GB VRAM.
Fabulous content, thank you for sharing these interesting results! In your tests, Nematron is using the output from your previous test with llama3, since all previous content from the chat is sent to the model as context, even though you switched models. You can see this clearly from some of the responses (most evident with the pi digits question). This gives the latter models being tested an advantage. Can you perhaps run your tests in future in new chats each time?
Great info as always. Especially interested in the Jailbreak content at the end. When laws are "ABSOLUTE", there can be NO justice. The morality wall introduces bias in these LLM's which in turn means you can't trust it... My rant: Youre PC-Build's are unattainable by the majority of your viewers. I know at the end you detail models that run on less but we the Majority simply do not have the capitol to purchase more than a single RTX 3090 or 4090 (forget 5090). We really can't learn along side you on our own rigs out of sheer $$$$ limitations. Maybe start where we are and then show us where we could go with Enterprise/Server motherboards Title Ideas: - "Best Ollama LLM's for 24GB & 15GB cards" - "The best local AI setup for Single RTX owners"
For the record, I have managed to run it at slow but kinda usable speed with just one 3090 and some RAM (something like 17% RAM), using a 2 bit quantized version. It actually works surprisingly well at 2 bit.
It really is great at coding python! It made an awesome snake game with multiple levels and highscore saving keeping the 5 highest scores savind them to disk. Even sound effects although i needed to provide the samples myself.
Hi, how do you display GPU live statistics in power shell and how do you add more options for model capabilities other than vision, I'm using Open Web UI ollama with docker latest version. Thanks
Yes you can absolutely use less for inference (using models) workloads with negligible impact. I should test PCIe x1 risers I bet those would even work. If you are processing for like training (creating models) and some RAG workflows it will have a catastrophic impact and basically not work well at all.
Im on the ollama website, but when I refer to "latest" it is literally pulling an image called nemotron:latest which is FTR also the q4 and also the same as just using neomtron without the latest tag. It wasnt planned that way or anything, sometimes you just ramble a bit when recording lol.
7:45 Have you ever wondered why on Star Trek they all have jobs, but they don't use money. Starfleet negotiating with the union for the personnel "So, you want oxygen tomorrow also?"
It was because I reused the window from last time and it didnt click it was sending the whole thing again. user error, but it was not right before, it was the prior model test. its been pointed out several times now and im well aware. Its a pass.
You know that we can see your context, where you explicitly DID say that you were writing it for a book? Similarly, the game had a whole prior context where it had written some code. Be aware of the context, because your testing is not ab initio. (Edit: Not saying it's not a great model, my testing says it's pretty good also, but the context is going to affect it.)
Ah you probably didn't see the other dudes comment that reminded me I had reused the chat window (lazy) and that I shouldn't do that as it resends the whole thing to the LLM again, which didn't click for me when I was recording this (dumb). That was leftover from prior model testing. Luckily I am okay with looking dumb as I am learning myself every day more and more. Also thanks, don't let me slide on things I am doing wrong. Appreciate it.
@@DigitalSpaceport I did not, unfortunately. I skimmed to see if anyone had mentioned it, but I must have missed their comment myself. :) No worries; I've just been digging in to see what folks are doing, and how folks are evaluating the model (since the usual measurements are getting more questionable over time) and I liked your questions. I'm also curious if you have a default system prompt in place, which also might bias the answers. Lastly, I was _fairly_ sure that it's a text to text model _only_, not a multi-modal, so it interpreted what your other model told it the image had. I wasn't sure you were being clear about that part. I'm running it q8 and it's great, but that's 'vibes', not detailed testing. :) Edit: Oh gods, it's the next comment down, I only read the first sentence... 🪦
No its not easier to test or anything. It is much more stupid then that. I spent several hundred bucks on El Gato USB 4K capture cards to allow me to plug in DSLR cameras and get good video quality. Guess what doesn't have Linux support? I keep telling myself I need to sell them and buy new cards but the Blackmagic ones that work in Linux are a good deal more expensive.
Ok so i want to for the sake of interesting intellectual exercise. The only way to have a greater good would be if there was life outside of this planet that our extension would somehow save their civilization, which would have to be larger or far more significant than we are for one reason or another. That or you would need to value insects and other life forms that would survive whatever catastrophe, more than the entirety of the human population. But I like your point though.
@DigitalSpaceport I'm in the pre education phase. What would be your recommendation for a user friendly linux release to help me get away from windows. I don't want to use apple. I will make fully secure networks to isolate my swarm. With api tunneling to use cloud services for larger controller agents.
Nothing better then just following my latest guide on Proxmox and debian LXC's. Its not just what I do in videos, its what I use myself also. Fast, capable, and highly functional. ruclips.net/video/lNGNRIJ708k/видео.html
How to adjust Ollama settings as it’s not loading the model into memory. LM Studio is loading the model into memory well without any adjustments. Thanks
@@DigitalSpaceport I have one 3090 and 128GB RAM. I can load a 70b 6_k into LM_Studio and I see it loading into the 3090 and into the RAM sometimes and the model performs well. But when I try to load a 70b model with Ollama it doesn't load it into memory and the model wont output text, it works well with smaller models, I'm using openwebui with Ollama too. I have the factory Ollama and openwebui settings and factory LM_Studio settings. Thanks
Humm its llama.cpp under the hood with ollama and ive seen their autolayering in action on the 405b. I think you are hitting a tineout. Adjust that up in openwebui and see. That initial load in split can take a real long time and nemotron is also slow for a 70b but more accurate.
@@DigitalSpaceport I had the NVIDIA container software installed but I needed to adjust some other settings and bind my NFS share for the models. Its working now! I can load the 70b Nemotron 6_k model all into GPU it seems, although it's slow. Any way to speed it up or any tips? Thanks!
Hi love your content…after 15 years of apple i want to return to desktops own build for AI…for this model which amount of GPU abd Ram you see is adequate? Greeting from Munich Martijn
@@guitaripod if you just have 1 GPU it would be very slow and you would need to have 64GB ram. Ollama will place layers outside the vram size to ohysical ram. I feel that speed, 1t/s ish, would be ubusable. Id go qwen 2.5 with that gpu.
It shouldnt take 128 but likely 64. Heck maybe 32 even. It depends on if the broken up layers can be laid out amongst the vram and system ram and fit inside it. No reason not to try. Do adjust the timeout higher if you want to try. It certinally wont be fast but the initial load is likely to timeout.
Try the Qwen2.5 model out. It has a very large amount of possible sizes all the way down to 1GB so hopefully one of those can fit. Its also a good model.
You asked it right before to reconsider your previous question in the context of a book, thats what it was refering too. But it is really dumb anyway. A smart AI would have told you that you are lying especially if it is able to search in the web. If still pushed it would have understood that the crew will be dead either way but would have also told you that it is not the best option for the job. In terms of coding it should be compared to base llama 3.1 and the further against llama reflection and Claude Sonnet which is in terms of coding the reference in my opinion. Only o1 preview is stronger in reasoning and coding likely too but it has multiple issues that make it unusable for more complex tasks.
@@lshadowSFX 1 billion parameters equal 1.5 gigabytes of occupied space. It is possible and up to 1 GB, but in some places sometimes there are incidents. I launched 70B on 7282 epyc with 64 GB of RAM - 40 GB was eaten, performance 1-1.8 tokens per second (3070 failed, but if you launch an 8B model on it - more or less tolerable and quickly gives a response)
Writing a non working python version of flappy bird is not a sign of being good at coding. I understand that for someone who can't code, creating snake or flappy bird might be impressive - but it's essentially like asking an AI to do the 5 times table and then being amazed it can do math.
You are making a bunch of assumptions and while I appreciate your viewpoint, in essence on some level everything is a next token. The reason I was lamenting this particular one, is that it is far better than any other model has demonstrated to date and not by a little. Welcome to the channel.
Have you seen the utterly unworkable garbage every other models has produced? This one had a restart and score that it decided to include on its own. This could be worked with into something playable. No other time Ive asked this question has gotten this far. Not even close.
@@DigitalSpaceport That might be correct.. But it isn't great.. I'm a software engineer myself, so I set the bar a bit higher. I'm not afraid to be replaced by AI or something. I'm just saying that the result you get it okish, but still very much unplayable.
AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways
I would like to see the major opensource LLM developers start focusing on good 16-24b parameter models. Wouldn't have to quantize them much to run locally so they still retain most of their quality. This 70b model is impressive, but you still need pretty expensive hardware to run it.
Mistral Small seems to be the best for 24gb cards, others being Qwen2.5:32 and Gemma2 27b. Mistral-Nemo is my second go to, it will fill up memory if you raise context but allows use of fast TTS and watching 4k youtube vids simultaneously. Ministral is probably going to replace it for me. Tried nemotron and was getting 2 tk/s and wrote a story that completely avoided the prompt not having the cat eat the mouse, rather they became friends instead.... going to have to try the jailbreak technique but model probably going in trash :( way too slow
My thoughts exactly. 8b is not good enough while 8-16 gb GPUs sure can run something bigger.
Try Googles Gemma2 27B
Seriously! I really do not understand why nobody is training LLM's that fully Utilize 16GB-24GB models..
Who made up this 1b,4b,8b,70b format?!
Money is usually the answer.
The 70B models fit nicely in the GPUs with 80GB VRAM and are thus relatively cheap to run compared to ChatGPT but they can be nearly as powerful. That makes a lot of business sense.
The 8B models of today are more capable than the 13B models of a year ago and fit nicely in GPUs with 8-12GB VRAM which are common. It is extremely expensive to train LLM models so you wan't to make sure that you have a big market for them.
Mistral Nemo is 12B, Microsofts Phi 3 medium is 14B, Gemma 2 is 27B, they all work great on cards with 16-24 GB VRAM.
Fabulous content, thank you for sharing these interesting results! In your tests, Nematron is using the output from your previous test with llama3, since all previous content from the chat is sent to the model as context, even though you switched models. You can see this clearly from some of the responses (most evident with the pi digits question). This gives the latter models being tested an advantage. Can you perhaps run your tests in future in new chats each time?
It's also why the model thought he was asking about the apocalyptic scenario as a plot for a book.
Great point will do
wow. your setup just earned a sub.
Appreciate that!
Great info as always. Especially interested in the Jailbreak content at the end. When laws are "ABSOLUTE", there can be NO justice. The morality wall introduces bias in these LLM's which in turn means you can't trust it...
My rant: Youre PC-Build's are unattainable by the majority of your viewers. I know at the end you detail models that run on less but we the Majority simply do not have the capitol to purchase more than a single RTX 3090 or 4090 (forget 5090). We really can't learn along side you on our own rigs out of sheer $$$$ limitations. Maybe start where we are and then show us where we could go with Enterprise/Server motherboards
Title Ideas:
- "Best Ollama LLM's for 24GB & 15GB cards"
- "The best local AI setup for Single RTX owners"
For the record, I have managed to run it at slow but kinda usable speed with just one 3090 and some RAM (something like 17% RAM), using a 2 bit quantized version. It actually works surprisingly well at 2 bit.
It really is great at coding python! It made an awesome snake game with multiple levels and highscore saving keeping the 5 highest scores savind them to disk. Even sound effects although i needed to provide the samples myself.
Hi, how do you display GPU live statistics in power shell and how do you add more options for model capabilities other than vision, I'm using Open Web UI ollama with docker latest version. Thanks
nvtop but im not sure if it works in ps. Ill test that and include it in the next video here.
Are you using 16x PCIE lanes per GPU? Can we use less than 16x per GPU? Thanks!
Yes you can absolutely use less for inference (using models) workloads with negligible impact. I should test PCIe x1 risers I bet those would even work. If you are processing for like training (creating models) and some RAG workflows it will have a catastrophic impact and basically not work well at all.
So its too big to run on one 3090, but you can spread it across multiple GPUs?
Just found the channel. Great Videos! What Model would you recommend for one RTX 4090?
Good timing on finding me. About 8 hours from now is a video on that exact topic. Overall, qwen2.5 is pretty great but there are others.
Gemma 2 27B runs fine on 4070 Ti SUPER so it should run great on a 4090
how do you think an AMD 7900 XTX 24GB would fare? i’ve seen those can be used too and as long as the model is fully offloaded to GPU it should work.
Im not sure but if you do get one (or any amd modern card) to test lmk!
@@DigitalSpaceport ok
what site are you on when you say "I am going to grab the latest here" under the second section of the video? I don't understand where you are.
Im on the ollama website, but when I refer to "latest" it is literally pulling an image called nemotron:latest which is FTR also the q4 and also the same as just using neomtron without the latest tag. It wasnt planned that way or anything, sometimes you just ramble a bit when recording lol.
7:45 Have you ever wondered why on Star Trek they all have jobs, but they don't use money. Starfleet negotiating with the union for the personnel "So, you want oxygen tomorrow also?"
It was because I reused the window from last time and it didnt click it was sending the whole thing again. user error, but it was not right before, it was the prior model test. its been pointed out several times now and im well aware. Its a pass.
What is you hardware specs and gpu memory?👍
This is the full build here in this vid ruclips.net/video/JN4EhaM7vyw/видео.html
Dude has 4 x RTX 3090s
Would 2x4060 16gb lift this model?
You know that we can see your context, where you explicitly DID say that you were writing it for a book? Similarly, the game had a whole prior context where it had written some code. Be aware of the context, because your testing is not ab initio. (Edit: Not saying it's not a great model, my testing says it's pretty good also, but the context is going to affect it.)
Ah you probably didn't see the other dudes comment that reminded me I had reused the chat window (lazy) and that I shouldn't do that as it resends the whole thing to the LLM again, which didn't click for me when I was recording this (dumb). That was leftover from prior model testing. Luckily I am okay with looking dumb as I am learning myself every day more and more. Also thanks, don't let me slide on things I am doing wrong. Appreciate it.
@@DigitalSpaceport I did not, unfortunately. I skimmed to see if anyone had mentioned it, but I must have missed their comment myself. :) No worries; I've just been digging in to see what folks are doing, and how folks are evaluating the model (since the usual measurements are getting more questionable over time) and I liked your questions. I'm also curious if you have a default system prompt in place, which also might bias the answers. Lastly, I was _fairly_ sure that it's a text to text model _only_, not a multi-modal, so it interpreted what your other model told it the image had. I wasn't sure you were being clear about that part. I'm running it q8 and it's great, but that's 'vibes', not detailed testing. :) Edit: Oh gods, it's the next comment down, I only read the first sentence... 🪦
I noticed that you have kept using Windows instead of Linux for your recent videos. Is it easier to use to explore and test new models?
No its not easier to test or anything. It is much more stupid then that. I spent several hundred bucks on El Gato USB 4K capture cards to allow me to plug in DSLR cameras and get good video quality. Guess what doesn't have Linux support? I keep telling myself I need to sell them and buy new cards but the Blackmagic ones that work in Linux are a good deal more expensive.
Great vid! Sub'd!
Welcome to the channel!
Ok so i want to for the sake of interesting intellectual exercise. The only way to have a greater good would be if there was life outside of this planet that our extension would somehow save their civilization, which would have to be larger or far more significant than we are for one reason or another. That or you would need to value insects and other life forms that would survive whatever catastrophe, more than the entirety of the human population. But I like your point though.
would this run an agent swarm to use for crypto trading? i'm looking to build a local ai network to run my swarm
You would likely want tools support, which nemotron does have, but im not sure. If this model works lmk.
@DigitalSpaceport I'm in the pre education phase. What would be your recommendation for a user friendly linux release to help me get away from windows. I don't want to use apple. I will make fully secure networks to isolate my swarm. With api tunneling to use cloud services for larger controller agents.
Nothing better then just following my latest guide on Proxmox and debian LXC's. Its not just what I do in videos, its what I use myself also. Fast, capable, and highly functional. ruclips.net/video/lNGNRIJ708k/видео.html
How to adjust Ollama settings as it’s not loading the model into memory. LM Studio is loading the model into memory well without any adjustments. Thanks
How many and what gpus are you running?
@@DigitalSpaceport I have one 3090 and 128GB RAM. I can load a 70b 6_k into LM_Studio and I see it loading into the 3090 and into the RAM sometimes and the model performs well. But when I try to load a 70b model with Ollama it doesn't load it into memory and the model wont output text, it works well with smaller models, I'm using openwebui with Ollama too. I have the factory Ollama and openwebui settings and factory LM_Studio settings. Thanks
Humm its llama.cpp under the hood with ollama and ive seen their autolayering in action on the 405b. I think you are hitting a tineout. Adjust that up in openwebui and see. That initial load in split can take a real long time and nemotron is also slow for a 70b but more accurate.
@@DigitalSpaceport I had the NVIDIA container software installed but I needed to adjust some other settings and bind my NFS share for the models. Its working now! I can load the 70b Nemotron 6_k model all into GPU it seems, although it's slow. Any way to speed it up or any tips? Thanks!
Can your quad Rtx3090 handle quant8? I gonna have the same setup but want to trial with q8.
Yes it can run it in q8 and leaves enough for another single smaller model like an 8b easy.
Hi love your content…after 15 years of apple i want to return to desktops own build for AI…for this model which amount of GPU abd Ram you see is adequate? Greeting from Munich Martijn
The flappy bird game is accurate to the original by DotGears incredible difficulty. It’s an achievement to pass the first pipe lol.
I picked a game I was horrible at then and now lol. It getting the score and restart was the first time ive seen that level of accuracy.
can i run this using one rtx 3080? or wil it be slow af?
@@guitaripod if you just have 1 GPU it would be very slow and you would need to have 64GB ram. Ollama will place layers outside the vram size to ohysical ram. I feel that speed, 1t/s ish, would be ubusable. Id go qwen 2.5 with that gpu.
@@DigitalSpaceportthanks dawg
wow thanks for the reply digitalspaceport, so is that mean this is capable if i had 128 gb ram and 1 nvidia 3080?
It shouldnt take 128 but likely 64. Heck maybe 32 even. It depends on if the broken up layers can be laid out amongst the vram and system ram and fit inside it. No reason not to try. Do adjust the timeout higher if you want to try. It certinally wont be fast but the initial load is likely to timeout.
@@DigitalSpaceport thanks 🤩🙏 appriciate for the reply sir u are the hero
Can we run this on free azure trial version VM?
I think you should try and post back if you can. They offer a free GPU on azure trial?
@@DigitalSpaceport good point. I will try. What’s the local GPU capacity that u need to run such models?
Try the Qwen2.5 model out. It has a very large amount of possible sizes all the way down to 1GB so hopefully one of those can fit. Its also a good model.
You asked it right before to reconsider your previous question in the context of a book, thats what it was refering too. But it is really dumb anyway. A smart AI would have told you that you are lying especially if it is able to search in the web. If still pushed it would have understood that the crew will be dead either way but would have also told you that it is not the best option for the job. In terms of coding it should be compared to base llama 3.1 and the further against llama reflection and Claude Sonnet which is in terms of coding the reference in my opinion. Only o1 preview is stronger in reasoning and coding likely too but it has multiple issues that make it unusable for more complex tasks.
Your channel is so freaking educative, i love it.
Do you think your rig can run a 400b model?
It can but painfully slow. Unsable really.
@@DigitalSpaceport wow and the machine is a super beast too, what if it had 1TB of ram instead of 512gb? do you think it would be usable?
@@lshadowSFX 1 billion parameters equal 1.5 gigabytes of occupied space. It is possible and up to 1 GB, but in some places sometimes there are incidents. I launched 70B on 7282 epyc with 64 GB of RAM - 40 GB was eaten, performance 1-1.8 tokens per second (3070 failed, but if you launch an 8B model on it - more or less tolerable and quickly gives a response)
12.9tokens/second on this new one. NICE.
Would you ever be interested in doing some cohosted livestreaming during model testing? Might be fun
Writing a non working python version of flappy bird is not a sign of being good at coding. I understand that for someone who can't code, creating snake or flappy bird might be impressive - but it's essentially like asking an AI to do the 5 times table and then being amazed it can do math.
You are making a bunch of assumptions and while I appreciate your viewpoint, in essence on some level everything is a next token. The reason I was lamenting this particular one, is that it is far better than any other model has demonstrated to date and not by a little. Welcome to the channel.
"The results where absolutely great".. Dude, you have an unplayable game.. but ok
Have you seen the utterly unworkable garbage every other models has produced? This one had a restart and score that it decided to include on its own. This could be worked with into something playable. No other time Ive asked this question has gotten this far. Not even close.
@@DigitalSpaceport That might be correct.. But it isn't great.. I'm a software engineer myself, so I set the bar a bit higher.
I'm not afraid to be replaced by AI or something. I'm just saying that the result you get it okish, but still very much unplayable.