@@DigitalSpaceport You have answered every question I had with your diverse testing. Thanks! The VRAM really holds this GPU back in NLP tasks. Would you recommend waiting for the 5090
Yes it does work like that. Here is a dual 4090 video. ruclips.net/video/Aocrvfo5N_s/видео.html and you should check this channels history for Quad demonstrations. You can also mix and match generations to a point and vram sizes.
You can definitely run Nemotron 70B on a single 4090, I do that every day. You can offload about half of the model onto the VRAM, and compute the other layers on CPU, this gives me around 2.5 tok/s, which is still slow, but workable.
We should see 3090 and 4090 prices come down when the 5090 drops so hopefully 2.5t/s can not be the norm. I also suspect nvidia releases a 32b in the next few months.
Running Llama-3.1-Nemotron-70B-Instruct_iMat_GGUF/Llama-3.1-Nemotron-70B-Instruct_iQ2xxs.gguf MarsupialAI 19.1GB using LM Studio and a 7900xtx. 14.81 tok/sec. It fits.
I think I know why Qwen went crazy-- by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.
Yes, in some AI applications 4090 is 60-80% faster than 3090 - when model needs to be unloaded several times. In most applications 4090 is ~40% faster.
Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks
I just use the GB on the ollama site and realistically not a lot older models are that good. I dont go back past llama3.1 myself for main models. Rag video is getting redone. Making a video on a new topic (to me) gets out of scope fast and folks shouldnt watch multi hour videos of me rambling and fixing things.
@@DigitalSpaceport Sounds good. I use to try and get large models running but its better to get models that fit 100% in the GPU. I found a decent calculation site and it looks like the quen2.5:32b-instruct-q5_K_S is the largest new model that can fit 100% in a 3090 and about 20/TPS response. Ive been so busy setting up an epyc vitalization server (virtualized storage server too) and an AI workstation but it all coming together! I want to get RAG and searxng setup for openwebui. I wish I could tell my AI to build those projects while I sleep without me submitting hundreds of commands.. :)
The searxng/redis/tika setup video I did has all that compose stuff linked in the description so you could pasta it in. Might save some time. I think its labelled vision + web in the video history. Dont forget to leave some room for embed and vision models.
I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 - anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.
can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using? I love your content
I love playing with Flux and some other models with a mid-high end GPU....but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware....but I don't understand the application of it in this case.
Practical application and creativity are the realms that those questions are testing. I do use it for such tasks myself, as well as many other tasks not demonstrated so far. The question is evaluating the quality in no small part on things like following details and completing answers that demonstrate added levels of minor things that can fail. Also I have a lot more than 2 high-end GPUs 🤓
Because people will ask me - 3x 4060 ti 16GB GPU's - here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K - 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted - if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)
@DigitalSpaceport would you accept 25$ n hour? Sure I won't need anything past 1 hour our 1st meeting. I just want to run a vision by you. And you tell me if it's possible in your opinion. N if so what equipment/setup I'll need to bring that to life
@@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II with an Asus Strix 4090, a Threadripper pro 3955wx, and 128gb RAM I'm getting about 40% better token per second generation using the same ones, GGUF versions, on LM Studio.
@@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II, with the Asus Strix 4090, a Threadripper Pro 3955wx, and 128gb RAM. I am getting about 40% better results than you with same models, not sure if it is because of my RAM, or I am using the GGUF versions on LM Studio.
If I was paying an outrageous amount to get a GPU early - I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.
Good Morning Everyone. Have an amazing day.
GM ☕
@@alexandrew83 bless up
Man, you have no idea how much help this video has been. There are so limited reviews of 4090 for LLMs. Awesome video, kudos!!
Glad it has helped. Any other questions on 4090s for llms you have?
@@DigitalSpaceport You have answered every question I had with your diverse testing. Thanks! The VRAM really holds this GPU back in NLP tasks. Would you recommend waiting for the 5090
The video I was looking for ❤. Thank you so much. Would it be possible to “cluster” these GPUs and potentially run larger models?
Yes it does work like that. Here is a dual 4090 video. ruclips.net/video/Aocrvfo5N_s/видео.html
and you should check this channels history for Quad demonstrations. You can also mix and match generations to a point and vram sizes.
@@DigitalSpaceport oh sweet! Thanks! Checking it out now.
You can definitely run Nemotron 70B on a single 4090, I do that every day. You can offload about half of the model onto the VRAM, and compute the other layers on CPU, this gives me around 2.5 tok/s, which is still slow, but workable.
We should see 3090 and 4090 prices come down when the 5090 drops so hopefully 2.5t/s can not be the norm. I also suspect nvidia releases a 32b in the next few months.
Running Llama-3.1-Nemotron-70B-Instruct_iMat_GGUF/Llama-3.1-Nemotron-70B-Instruct_iQ2xxs.gguf MarsupialAI 19.1GB using LM Studio and a 7900xtx. 14.81 tok/sec. It fits.
@@DigitalSpaceport My dual 3090 rig runs nemotron 70b Q4 (43 GB model) at 14.7 TPS. Definitely usable. That's with 270 W power limit applied.
I think it would be fun to see a GPU showdown for these AI tasks. Compare some Tesla GPUs and some budget consumer GPUs.
Good idea
For 5090 nvidia should include a trolley
I think I know why Qwen went crazy-- by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.
I think your onto something there. I had set it to 4096 but its back to its default now.
Hi, another great video! Regarding the speed: is there a big difference in speed between the 4090 and the 3090? Because is the extra cuda cores
Yes, in some AI applications 4090 is 60-80% faster than 3090 - when model needs to be unloaded several times. In most applications 4090 is ~40% faster.
Im going to test this since I have things in a mess right now already. Ill use the same models unless you have one extra you would like checked.
쿠다 코어의 수와 VRAM의 크기에 따라서 출력 토큰의 수가 다르죠.
3090 그래픽카드는 단종되어서 새 제품을 구매할 수가 없습니다.
그렇다면 40XX 그래픽카드 중에서 무엇을 구매해야 3090과 출력 토큰의 수가 비슷할까요?
추론만 하는 경우, 어떤 모델 크기를 사용할지에 대한 결정이 구매 방향을 결정해야 합니다. 8b 모델 크기를 찾는 경우 4070 16gb가 매우 경제적인 경로입니다. 더 큰 크기의 모델을 사용하려면 48gb에 도달하는 GPU VRAM의 조합이 선호됩니다.
That is a comically large card :D
Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks
I just use the GB on the ollama site and realistically not a lot older models are that good. I dont go back past llama3.1 myself for main models. Rag video is getting redone. Making a video on a new topic (to me) gets out of scope fast and folks shouldnt watch multi hour videos of me rambling and fixing things.
@@DigitalSpaceport Sounds good. I use to try and get large models running but its better to get models that fit 100% in the GPU. I found a decent calculation site and it looks like the quen2.5:32b-instruct-q5_K_S is the largest new model that can fit 100% in a 3090 and about 20/TPS response. Ive been so busy setting up an epyc vitalization server (virtualized storage server too) and an AI workstation but it all coming together! I want to get RAG and searxng setup for openwebui. I wish I could tell my AI to build those projects while I sleep without me submitting hundreds of commands.. :)
The searxng/redis/tika setup video I did has all that compose stuff linked in the description so you could pasta it in. Might save some time. I think its labelled vision + web in the video history. Dont forget to leave some room for embed and vision models.
@@DigitalSpaceport Okay cool, I'll check it out. Thanks!
Never miss a chance to get a view they tell me lol ruclips.net/video/IC_LGmqjryg/видео.html
I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 - anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.
can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using?
I love your content
I can answer your first question, yes easily. The second is around 90 but that may not be as important of a metric I think.
Awesome video. Would love to see an amd mi60 in a rig! Older but cheap and 32gb
I love playing with Flux and some other models with a mid-high end GPU....but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware....but I don't understand the application of it in this case.
Practical application and creativity are the realms that those questions are testing. I do use it for such tasks myself, as well as many other tasks not demonstrated so far. The question is evaluating the quality in no small part on things like following details and completing answers that demonstrate added levels of minor things that can fail. Also I have a lot more than 2 high-end GPUs 🤓
1:22 what kind of motherboard ?
The build specs linked in the description sry I cant copy paste for some reason in the studio app and im out right now.
Because people will ask me - 3x 4060 ti 16GB GPU's - here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K - 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted - if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)
How much would you accept for 1 hour of consultation?
I am sharing as I learn, as I learn, but I am not qualified for consulting on these topics. I do appreciate the thought however 🥰
@DigitalSpaceport would you accept 25$ n hour? Sure I won't need anything past 1 hour our 1st meeting. I just want to run a vision by you. And you tell me if it's possible in your opinion. N if so what equipment/setup I'll need to bring that to life
When I saw it i immediately thought how big is the 5090 going to be…
Its gonna be massive
@@DigitalSpaceportcan you palm a 5090?
4090 is that big? or are you a small person? I'm confused.
I'm like umpalumpa tall bro. Low blow.
Weird, my Strix 4090 gets way better results. Might be my 128gb of RAM?
Can you tell me more about your complete hardware and software setup and results please?
@@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II with an Asus Strix 4090, a Threadripper pro 3955wx, and 128gb RAM
I'm getting about 40% better token per second generation using the same ones, GGUF versions, on LM Studio.
@@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II, with the Asus Strix 4090, a Threadripper Pro 3955wx, and 128gb RAM. I am getting about 40% better results than you with same models, not sure if it is because of my RAM, or I am using the GGUF versions on LM Studio.
If I was paying an outrageous amount to get a GPU early - I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.
I also have no desire to pay a dime over $900 for MSRP 80 class at any point in time.
Scappers are unlikely to go away. 3500 5090 ebay likely
I wonder ...Next amd rdn4 wit 42 gb vram ? Amd wake aluppe hurry uppe
🎉😅
AMD can also MAD like they make their GPU customers lol
4090 almost bigger than you 😀
5090 will put us all in our place 🤏