DeepSeek R1 Hardware Requirements Explained
HTML-код
- Опубликовано: 8 фев 2025
- Wondering what hardware you need to run DeepSeek R1 models? This video breaks down the GPU VRAM requirements for models from 1.5B to 671B parameters. Find out what your system can handle
In this video, I failed to mention that all the models shown are quantized Q4 models, not full-size models. These Q4 models are smaller sized models. They are easier to load on computers with limited resources. That's why I used Q4 models-to show what most people can run on their computers. However, I should have mentioned that these are not full-size models. If you have enough hardware resources you can download larger Q8, and fp16 models from Ollama's website. Also, I didn't cover running local LLMs in RAM instead of VRAM in detail because this video focuses mainly on GPUs and VRAM. I might make another video explaining running them in RAM in more detail.
14b can fit fine on a 2080 Ti that's only got 11 GB of vram. 1.5B is a 2GB model - you don't need 8 gigs of ram for it.
Your specs all seem way higher than actually needed.
I think you should rewatch the video because I didn't say that
@@BlueSpork Okay, you did though.
0:25 "1.5B model, at least 8 gigabytes of ram" - It doesn't use nearly that much. The model's only 2 GB.
2:26 "14B model, 16GB VRAM" - It doesn't use nearly that much. I've run it on a 2080 with 11gb vram.
These are literally written on screen? I'm telling you, these specs are way too high.
@@jeffcarey3045 For the 1.5B model, I said you need a computer with 8GB of RAM-I didn’t say the model itself needs 8GB. Sure, you can run it on a computer with 4GB of RAM, but I left room for overhead. For the 14B model, I said it will run fine on 12GB of VRAM before recommending 16GB, again to allow for overhead.
@@BlueSpork I get that you're trying to be safe with extra overhead--but you're still off. The 14B model is a 9GB load, and even with a buffer, 12GB is plenty. Insisting on 16GB is overkill given real-world performance, so your caution doesn't change the fact that the numbers don't add up.
love how ppl went from having "no chance of owning intelligent robot" to "4 words per second is too slow"
It's not an intelligent robot.
@@MrViki60 knows more then you are can do more stuff then you... who are you then
more like 20 years but yeah
If you’re doing work with an AI for 4 words per second, you’re going to get fired soon. Just down load the app & run it on the cloud!
I don't think that a lot of people can speak faster than 4 words per second.
Thank you! Finally someone mentioning and explaining actual hardware requirements
🙏
Best Nvidia could do was give us 5 WHOLE gigabytes in 8 years (GTX 1080Ti - RTX 5080). Blessed be thy leather jacket!
One day, when the 5090 comes back in stock, you'll be able to get 32GB without paying obscene amounts of money for 80GB of VRAM. At least ollama is pretty good at splitting a model across multiple GPU's--I ran DeepSeek-r1:70b on a pair of 3090's I rented and it was pretty fast.
I bet you anything this is intentional. Slowly trickle out tech and charge a lot for each increment. That's why I'm looking forward to seeing China catch up in the chips race. I guarantee you we'll see huge advancements then.
@@poweredbydecaf1915 If Chy-na catches up, it won't need Taiwan and start a world whar. Not looking fwd to that!
@ How many wars has China started?
@@poweredbydecaf1915 Bro, do you even know what Tibet is? 🤨
With Chy-na it's the "3 T's": Tibet, Tiananmen, Taiwan.
Do you know what they did to Hong Kong? Before the CCP got their grubby hands on it they had F*KING SKYSCRAPERS THERE. Without the help of Chy-na.
You don't consider those invasions?
The concept of "country" is very new to Chy-na. They've always been a dynasty. An empire. They don't consider what they did before becoming a "country". BUT THEY SHOULD.
Concise, easy to understand. Thanks mate.
You do not need VRAM to run those models. It's all about memory bandwidth. For VRAM usually is around 1000 GB/s, but you can get about 500GB/s with RAM on better motherboards supporting 8-channel memory or even 12, and 16. You can run 671B model on such machine with 5 T/s and it will be much cheaper than using GPUs.
Some systems have cpu limitations for ram. Only 64 gig max.
Can u recommend workstation cpus with 8 channel. Even if i use old one with 96 gig ddr3
Or do i need to get newer gen workstation. I can put budget for test. Cpu intel xeon or intel or amd? Xeons are cheap for not having io and integrated unit but if not needed that we can get 8 core cpu just 20 dollar.
Or maybe need ddr5 for 5000 gt/s above along with 8 channel
Make a video about this and explain it to us please. And show us everything!
@@Mehtab20mehtab still has to have the model in memory, you can use SSDs so I have heard but wouldnt be very fast at all
Not sure but I think MacBook Pro M4 Max's will be enough for the larger ones. Not entry-level hardware though.
a nice way to use DeepSeek R1 is Deep Infra who are offering 671B and 70B models for dirt cheap. The 70B distilled model actually works better for me and is 23cents/69cents per Mtoken input/output
But the news media keeps harping on DeepSeek’s $0.14/Mtoken input! Don’t tell me the news media is embellishing.
I run the 70B model on my 128Gb ram with 3rd gen Ryzen (R9 5950). Iget around 1token/s = which slow, but the model is very good at reasoning and providing detailed answers.
VRAM is the secret sauce, not DRAM.
@examplerkey Correct but I can never afford more than 16Gb VRAM
@@xhobv02So you’re not going to get a 5090 with 32G of GDDR7 for the $5k street price? Where’s your commitment to AI?😂😂😂
It's pretty fast on a pair of 3090's, but the 32GB model is pretty much just as smart anyway for most stuff. The 671b model is much smarter.
You can afford a 5950 but you can't afford a 3090? I feel fortunate to have a 3090 running 32B on a 5900. It's pretty snappy!
Nice! Short and crisp 👍
yeah... about 500GB ram for the 671b Q4... but full model its 1.6+ TB
Thank you for pointing that out. I did not mention that these are Q4 models
@@BlueSpork ahh no wonder it didnt make sense to me.
I thought the full mode was ~700GB only
this is not only the type of content i seek but also the no-boilerplate style i like. insta-subscribed.
Thanks!
I am running r1-14b on a Ubuntu with Ryzen 5900x, 16GB ram and Rtx3070. Getting 5 tps. But the quality of response is much better than chat GPT.
Nice to have the specs. I'm temped to try the 671b model on a server with 8 A6000 that I can rent for a few bucks an hour. This would be 384GB of VRAM which is almost enough to run efficiently with 4-bit quantization. I can run the DeepSeek-r1:14b at 11.93 tokens/s on an laptop with a Quadro P5000 video card, it's nice to know a 3060 is 2.2x as fast. The 32b model was running at 1.73 tokens/s, but this is largely a CPU measurement. I'm tempted to upgrade to an AMD or 3090 or 5060 Ti or something.
I rented a server with 2x 3090's from vast AI when DeepSeek-R1 first came out and tried the 70b model. It ran quite well with ollama, utilizing both GPU's at 250-300watts. I didn't see a large difference in intelligence between the 70b and 32b model...though I wish there a deepseek coder model with the R1 styled thinking/fine-tuning.
Do you always rent servers from Vast AI?
Simple very clear informative video ❤❤ no unneeded nerdy talk
😁
It has really been an explanatory and beautiful piece of work. Thank you.
Thank you. I appreciate it 🙂
Macbook M3 Max 128 GB unified memory can run 32B at 17.5 token/s. Works great for me.
Great! Thanks for letting us know
@BlueSpork does this hurt the macbooks health long term?
$600AU PC complete with nvidia GPU, 32G ram, runs DeepSeek R1 8b at 100 tokens per second.
did you try the 70B? With 128GB you should be able to do that.
I ran the 70B on my Mac Studio with 192GB of RAM, and it provided answers very quickly.
how many token per second? thanks. and what cpu?
Which setup do you run on your M2 Ultra? Ollama or LMStudio?
ROG Ally Z1E white works really well on 14b as long i set the VRAM to auto and set 8 cores (via LMStudio) to allocate solely on RAM. So possible to go 20b on Ally X or any 32GB based Handheld PCs.
What are your settings on LM studio? How do you set vram to auto?
I accidentally tested the difference between running 85% GPU memory and 100% GPU memory for the deepseek-r1:32b on a 4090 because I launched ollama while still having World of Tanks minimized in the background.
Having the model 85% in GPU memory will get me 12.46 token/s, 100% GPU bumps that up to 40.51 token/s. It's a huge difference in user experience.
I run the 7B model in a toaster i3 Sandy Bridge with 8 gb of Ram, no GPU, slow as hell, but it runs. The Chinese did a huge mathematical dark magic, and I'm completely grateful with them.
7B Model runs great on my M1 Pro chip. Perhaps it's harnessing the ML cores on top of the CPU? The new 50 series cards boast a LOT of ML cores, so they should be able outperform the 3090s significantly?
3090 also has 24G, so only 4090 and 5090 are better than 3090
You can't really use 32b on a 3090. It does run, but takes all VRAM and after a few following questions it gets into a thinking loop and times out. The only solution is to restart the model... 14b works wonderfully fast and you can do other things at the same time.
That's interesting. I'm able to run it on 3060 12GB and 32GB RAM. It uses all VRAM and all RAM and it's not very fast, but still runs fine.
thanks for your contribution! Very insightful!
What an excellent video!
Thanks this is very helpful
Glad it helped!
Used 3090 costs between €500 and €700 where I live. With that, you can run the r1 distilled 32b qwen model, pick a quant that barely fits in VRAM, at a decent speed. Or you could buy 16 * 3090's to run the whole 671B r1 :D
Q4 :)
The funny part is that it's not worth upgrading to a 4090 for LLM use because the memory bandwidth is barely higher and the huge L2 cache has negligible effect on this kind of workload... it's a much different story for Stable Diffusion where the extra compute grunt can shine.
i asked deepseek what the recommended hardware was for running deepseek locally, and it gave me numbers that were nearly double yours, lol
1. will it run (and use ai accelerator of) amd hx370 ?
2. what software will it run on in this setup?
3. are distilled models "reasoning models"?
4. can u continue training of the distilled models?
The question with the AMD npu is very interesting
1. Yes
2. Download LM Studio for Ryzen AI (not the normal LM studio, you need specifically this version)
3. Yes
4. I am not quite sure about that
2:45 - On MI25 at 220W I get:
total duration: 40.378089366s
load duration: 30.090414ms
prompt eval count: 11 token(s)
prompt eval duration: 69ms
prompt eval rate: 159.42 tokens/s
eval count: 765 token(s)
eval duration: 40.277s
eval rate: 18.99 tokens/s
I wonder if it's core speeds that affect this, MI25 has HMB2 memory that seems doesn't play dominant major role in this case.
Note: your input was slightly inconsistent, at least once you had "larger" instead of "largest"
The MI25's memory bandwidth is more comparable to a 3060Ti, so it was probably compute limited. Did you check that the model loaded right and was running 100% on GPU memory?
@@Steamrick Ah true about memory, despite crazy 2048 bus, I forgot that it operates at significantly lower voltage and speed to reduce the power and temperature. I did verify that it load and work fully from GPU, that instance has only 4GB or RAM, so 14b model would simply not fit, rocm-smi and radeontop both showed memory and gpu load
I just loaded R1 UD 130gb into ddr5 memory (of 192 in total) as a test and it runs fine.
Run 14b on my Mac, M3, 24G, 9.24 tok/sec, I feel it was slow, so I choose 8b to run locally, it was 16.27 tok/sec
You are the best and thanks you !
I have Deepseek 70B working on an i3 with SSD drives, 32GB RAM and a 4GB GPU under Ollama.
I get about 1 - 5 tokens per second.
Ollama has a cache system and the model is a Mixture Of Experts so only a portion of the 70GB is in use at any one time.
The 70B model feels very similar to the full size version.
It did however take 1 hour to create the code for the Snake game .. which worked first time!!!
That said, students with big ideas, no money but loads of time could tolerate such slow speeds.
The problem is the distilled R1 versions are nowhere as good as the main one.
I've been running all models up to the 32b model on a 12 year old machine just on CPU. AMD fx-8350 CPU (8 cores) and 24gb ddr3 ram. Getting 6-8 tokens per socond on the 1.5b, 2.5 tokens on the 7b, 1 token per second on the 14b and the 32b is very slow. However the answers on anything smaller than the 14b are poor quality so for this to be effective in a real world setting I will need a better machine.
Would be nice to know elo rating of each
Thank you!
You forgot about size of context window. Ollama sets default on 2k, and its rather small for thinking model. R1 can be set up to 128k. And 7B model with ctx 128k needs 22GB (and it not fit into VRAM of my 24GB 3090 - 6% goes to RAM), but reasonable 32k only needs 9GB. And 32B model with 32k ctx needs 32GB. Sizes are reported by ollama ps - for estimating needed GPU RAM add 1.5-4 GB depending on system.
To run Deepseek R1 70B and 671B you can deploy on 8-24 Channel memory with AMD epyc cpu
Could you also explore the quantization?
All of these models are Q4 (quantization Q4_K_M)
@@BlueSpork😂
Very good. Could you teach how to use 2 GPUs simultaneously? Is there any special configuration required?
Maybe, I’ll add it to the wishlist. Thanks
Can you benchmark how each model performs? There must be a sweet spot for performance compared to requirements.
will it work if i run it with a amd card?
I can run the 70b model, but I really want to run the 671b, the "distilled" aren't really distilled from that. They're just llama or qwen fine tuned with 800k reasoning examples. Their reasoning ability may be somewhat faked. In most situations, the "reasoning" doesn't change the answer the original model would have given. It can be useful to see what the model knows.
Apparently you can get the 671B model running on a 'mere' combined 168GB of total system memory if you use the IQ1_S quant, which is apparently still quite usable.
Im using ryzen 5600, 16G DDR4 3000, 3GB GTX1060 and its decent to run on 8b (at 10 tk/s). But went BSOD (Black) and rebooted due to vram use after 30 mins lol.
are you clearing the memory often? that might prevent bsod
I ran the 14b model on RTX 4050 Laptop GPU with 6GB VRAM, Ryzen 5 8645HS with 16GB single ram stick runninng at 5600MT/s. Getting an answer from it took 5-10 minutes. 😂😂😂
Mismo modelo ejecutado una gtx1660ti, una respuesta tarda aproximadamente 1.5 minutos. Veo esto:
El modelo no utiliza casi nada de memoria RAM, apenas 0.5GB
El modelo utiliza toda la vram del GPU (6GB)
El modelo unicamente utiliza el 100% del GPU al inicio, luego el uso baja al 20%.
El modelo usa todos los nucleos del procesador al 50%, excepto uno que se satura al 100% (i5-11400).
Se supone que ese modelo deberia usar mas VRAM, pero parece ejecutarse bien en esta configuracion. Te recomiendo revisar los drivers de nvidia y ver si realmente esta usando la aceleracion por GPU, o si existe algun problema de sobre calentamiento.
The media hype around DeepSeek running locally is misplaced! Running an DeepSeek at 5 tokens per second is ridiculous but helps NVDA get more business!
@@tringuyen7519 it's about privacy
ranit on the gpu pr cpu ? cpu might be actualyl faster than that crappy gpu
@ With this hype of running AI models locally, NVIDIA will have to upgrade the VRAM on next releases of graphic cards. Peoples is already dissatisfied with 16GB VRAM for the 5080. I ran 8b model and it took 5GB of VRAM, but the 14b needs 9-10GB of VRAM. My laptop has only 6GB of VRAM and it was bottlenecked because the shared GPU memory is slow, compared to VRAM DDR5 single channel 5600MHz is 4-5 times slower in bandwidth. It took me about 10-15 minutes to get an response from the 14b model. 9b model has a decent speed because it doesn't fill up all 6GB of VRAM.
deepseek-r1 14b runs smoothly in 16GB VRAM + 32GB 3200MT/s RAM(it fits in VRAM alone), but 32b is molasses slow, not worth it.
how well do the 1.5b and 7b perform for simple tasks?
The 1.5g model is pretty much drunk and or on lsd and I only use it for fun just to see what madness it will cook up. I assume the 7b model will be somewhat similar to that.
you are getting better performance than I am
14-b is giving me eval rate: 25.57 tokens/s on a 4060ti 16gb. could it be related to me running a docker container with ollama through WSL? wondering how to speed up my setup
why don't you run ollama directly through terminal to compare?
Running Ollama in a Docker container through WSL can introduce some performance overhead compared to running it natively on Windows
Isn't the full model a MoE? Only 37GB is loaded to memory at a time.
Even though only 37GB is active per pass, total memory use is much higher due to other factors
you can easily run entire deepseek r1 model 128 gb ram + 48 gb vram btw. But only u will get 1.6 t/s. (you have to use unsloth 1.58 bit version)
Cool! I found going below 3-4bit quantization really starts to affect intelligence, but it's probably still smarter than the 32b & 70b version.
Can I use multiples of Tesla K80 24gb instead of 4090 (cost around 50 use on eBay)? I suspect not. So, what exact hardware technical specs matter?
No, you can't. the K80 has cuda 3.7 and you need at least cuda 5.0 , the bare minimum for Ollama/llama.cpp. But don't worry: the K80 performance is no good at all.
Btw, Ollama can also use AMD GPUs, through Vulkan API, instead of CUDA.
was wondering why my 3060 gpu was so slow with the 14b model 6.7 t/s but my 3060 laptop only has 6gb dedicated vram and it's using 9.3 gb total.
Bro 8b models can be FULLY loaded on to a 1660 TI with 6 GB of Vram. run AI stuff on my 1660 TI in my system as that the only thing loaded on it as I game on another GPU and it runs quite fast with ollama
Clear and to the point
you are not talking about powering full size FP8 models by the way, you are talking about powering ollama 4bit quantized models, which are around half the size of the true models.
For 671 b parameters, you would need around 700 Gb of RAM.
You are right. Thank you for pointing that out. I failed to mention that in the video so now I made a comment about it an pinned it to the top. Thanks!
Even fp8 isn't "full size". And during training, it's likely even more than fp16.
@@crackwitz i heard they only trained on FP8
Real Deepseek R1 (NOT the distill models) takes at least 2 TB of ram to be used at a useable context level locally and even then it will be very slow unless your running on 2TB of VRAM which would cost Tens of Thousands of Dollars at least to have a GPU farm like that
thanks for the video...... so the 70b model would be fine for me to run on DIGITS and still room for me to get a large token response
Are Intel Arc B570 10gb and B580 12gb and good for this ?
I have 4gb vram of nvdia rtx 3050 laptop gpu. Can I run 1.5b or 7b model?
How much RAM?
@BlueSpork 16gb
@ yes you can run 1.5b and 7b and 8b. 7b and 8b might be on a slow side
DeepSeek r1 32b 29token/s with amd 7900xtx
What's the quality of 14b model compared to ChatGPT's 4o or o1 models?
thankyou....if it takes so much of cpu and memory for one user, how are they processing millions of queries every minute ? can u speak to the cloud services they use and hardware and cost incurring to them please ?
So a M4 Mac Pro / Max with 64GB and more unified memory could run a 70B model?
I'm pretty sure it would run, just not sure how fast
Can I run 10 GPUs with 8GB VRAM in parallel and run the 70b model?
Technically, but it would be rough. LLM performance is mostly limited by memory speed and in a multi-GPU setup you get more capacity, but speed will be limited by the speed of a single GPU.
That is to say that two RTX 3090's will perform about the same as an RTX A6000, which is the same chip with switch as much VRAM. The RTX 3090's are still the cheaper option, but the power draw will be twice that of a single GPU.
GPU's with smaller amounts of memory typically have slower memory, so three 8 GB 3060's will deliver much worse performance than a single 24 GB 3090.
I wish we could go back to the days when board partners could release models with twice the memory of the OEM version of a GPU.
It would be interesting to test this. I ran the 70b model on a pair of 3090's and it was reasonably fast, both GPU's were taking 250-300w of power, but I don't know if this is better or worse than a single A6000. For the 671b model, it's using a mixture of experts system which should be much more efficient than a large model like the Llama 405b because the GPU's don't need to communicate as much. Presumably this is because DeepSeek was using H800 GPU's instead of H100's...the Chinese variants have less inter-GPU communication and less 64-bit floating point arithmetic, but they both have 80GB of VRAM and for FP4 & FP8 calculations they're both fine. I've used Mixtral a few months back, and it was faster than other models with the same number of parameters, but I'm not sure if this was caused by inter-GPU communication. I think the computer I rented had 4x 4090's when I tested Mixtral 8x22b.
@nathanbanks2354 I'm pretty sure MoE models are faster even when ran on one GPU. Because only a subset of parameters are active at any given time, the models will run like a smaller model despite needing more VRAM than an actually smaller model.
As for the inter GPU connectivity, I don't think that's nearly as important for inferencing versus training. I saw a video a while back where someone distributed inferencing across multiple machines, including a custom build and a Mac and I don't recall it showing significant impact to the performance.
As I understand it, and please correct me if I'm wrong, the high memory bandwidth required for LLM inferencing only applies within processing a layer of the model. So as long as you distribute whole layers to each available GPU the traffic between GPUs is quite minimal.
Of course, distributing layers means that smaller gpus are even more wasteful.
For example, let's say we have a 40 GB model made up of eight 5 GB layers.
You would need eight 8 GB GPU's for good inferencing performance and likely a 9th GPU if you want decent context. That's a total of 64 to 72 GB of VRAM.
Compare that to a 48 GB GPU, where you can load all layers onto one GPU and still have 8 GB leftover for context.
what about distilled models ?
They are all distilled models, except 671B
Can I repurpose a 12 GPU mining rig to make up for the vram needed?
It depends on the rig I suppose
Can you please also test APU with 64gb+?
I wish
what would be better/cheaper: RTX GPUs or apple mac.
At this state with inflated RTX prices and supply vs demand issues a Mac might be cheaper while RTX is going to be better in performance especially with the 5090.
Apple charges tons for RAM, but not as much as NVIDIA. The new NVIDIA Project DIGITS will be most flexible. It's slower & more expensive than a 5090, but has 128GB of unified RAM.
I'm thinking of grabbing a 3090 or AMD card with 24GB of VRAM once the stock for the 5090's are available. But I may get a 5070 Ti or just rent GPU's from Vast AI or RunPod whenever I need them.
@@nathanbanks2354 I just tried one model today, a 16B model with a size of 16GB and loaded it into my 3080 10GB with offloading to system ram option with LM studio. I'm definitely looking for an upgrade to a 5080 or 5090 and in the meantime my current setup plus runpod is a good solution.
@@nathanbanks2354 Can AMD cards run these? Do I need to install additional software to make them run? They have more VRam and cheaper? thanks..
why running it locally ? isn't it free to use ?
It's a security issue.
If I have a laptop with 64 GB RAM, what model of DeepSeekR1 can run on my machine. Also, is the 8 GB VRAM at 1:32 min purchased separately and connected to a laptop via USB cord?
vram is in the graphics card it can be added only if you can add a graphics card
謝謝版主
i have a ryzen 1700 with 64gb ram and gtx 1070 with 7gb vram, the 14b model runs at a decent speed, of course not like 8b one which is about 3-4x faster. Tried the 32b but its very slow
is it possbile to run 70b on 2 4090? thank you
Canthe 14B model be runon a 8gb vram gpu? Will it divide the workload between cpu and gpu or shift entirely to cpu?
It will divide it between the GPU and CPU\RAM. Since it needs around 10GB to run
@@BlueSpork I have an i5-13400 32gb ram and a gtx 1080 8gb gpu. I am running the 8b model and it runs quite well on the gpu but I want to run the 14b model. What kind of speeds can I expect if it will divide it between cpu and gpu? Like at least 4 5 tokens per seconds or even lower?
me 2070 8GB 32GB RAM TEST
8b tokens 40~46
14b tokens 3.59~4.5 (full load GPU RAM)
If you run it using Ollama it will use both GPU and CPU. I'm not sure about the speed but I would expect it to be faster than 5 TPS, but not much faster. Try it al let us know
Thanks. So no 671B model for me for next 20 years.....
can you explain why people can run 671B using regular ram?
I might explain this in one of my upcoming videos. Thank you for the suggestion.
My 4070 runs 14b smoothly though?
Yeah, it runs smoothly on my 3060 12GB too, but if I run something else that requires VRAM at the same time … something will slow down. This is why I recommended 16GB to leave some room for other programs
Btw what do you use to generate speech for your video? Is this also AI or is this normal text to speech?
unfortunately I don't really understand any of this, although still trying to wrap my head around the basics, but from what I've gathered, if you utilize partial CPU/GPU offloading and let's say you run a 7B model (GTX 1650 : 4GB VRAM) wouldn't this cause premature wear to the hardware ?
a lot of people will probably be using this 24/7, in my case I use AI as an optimized search engine, so I don't see myself using it locally, but I'd be lying if I didnt' say I wasn't curious to see if my laptop can run it or not, but at the same time I'm worried it might hurt my laptop on the long run
I have a GTX1650 and and 8GB of memory, so I think I might be able to run a 1.5B model, but still I can't help but worry about hardware wearing out prematurely, is there a way to at least ensure that the graphics card don't get worn out ?
14B can fit into 12GB of ram
This is for Q4
So… it would be possible to run R1 with 4x128 GB RAM -I wonder how slow…
Someone tried it and it was extremely slow.
Look up for Digital Spaceport for his test
For an idea of the speed at 8x80GB, you can see ruclips.net/video/bOsvI3HYHgI/видео.html
These servers cost over $20/hour unless you're youtube famous.
@ I actually meant RAM rather than VRAM. No budget for the VRAM 😅
Great video!
Do you have an estimate of how many tokens per second a 24gb VRAM gpu will generate?
Thanks! Do you mean for the 32b model? I’m not sure. Maybe someone with a 24GB GPU will see your comment and answer
With 32b model, while running at RTX 4090 24GB, I get around 34 tokens per second
I get 7-10 tokens/s or so for the eval rate on a 4090
I get ~32 tokens/s with 3090 on 32b model
@@SCPH79 : cpu spec? great results btw
Okay, let's try the 7B model out with a Nvidia GT 1030 (2GB VRAM) , and a Intel N100 with 32 GB RAM
Ignorant question: I have a Mac mini M4 Pro with 48GB of unified memory.. I realize that I can run useful models on it, but I'm curious about whether it's feasible to run large models, e.g., undistilled R1, using an external Thunderbolt 5 SSD at ~0.4 GB/s in lieu of some RAM/VRAM.
Short answer is NO
This presentation is so old-fashioned PC-think. Just get an M4 version Mac and at least 24GB of unified memory, though more memory is always better.
I tried to use it and it was laughable. Is the response/mapping scope limited by RAM? So what I mean is - if there is not enough RAM, it gives simpler answers than on a beefier machine?
Hope nvidia will not stop to make videocard with more vram, because maybe they want make more money...
thanks , i don't try to run it locally . instead i would prefer online version
I understand, we all have our preferences
yeah, running this locally would cost a lot of computing powers, my pc's gonna cry with its 4 tokens/second 😂
Has anyone tested Intel Arc. The A770 and B580 should be usable via IPEX-LLM
I was able to run the 32b on my laptop with intel 10th gen 8 core and a 1050 3gb vram, and 16gb of ram under the condition that i need to restart every time i need ai assistance. I did also increase my virtual ram to 16gb so that helps too.
I recommend installing the LLama studio bc for me the terminal is very confusing
@@user-gw2vz5gh2n yeah its trash and just frustrating for better ui i recooomend gpt4all from nomiac
Can I ran it on Radeon? 6900xt for example
Yes
@@rajeebbhoumik4093 how for me its saying 100% cpu in processor
PyTorch supports ROCm, as does ollama. However older NVIDIA GPU's work better than older AMD GPU's. I've done AI stuff on an MI25 card. If it's not working for you, it could be a driver issue or an old card...I've never tried to run ollama in Windows or MacOS.
are these requirements for the 4q versions? as the 8q seem to have much larger requirements
These are for Q4
This is wrong, the 671B parameter is a mixture of expert model. So VRAM is only needed for the active MoEs. The inactive MoEs can be offloaded into RAM. This means that usually you only need 4 to 8 active MoE.
So for 8 active MoE and the 4bit quantized version, around 64Go of vRAM and 322Go of RAM
I haven't heard of anyone doing that yet, although there are some discussions and papers about methods but I am not sure tis is an actual thing being done right now tbh. It would be a huuuuuuge achievement if you could run Deepseek V3/R1 on one or two consumer GPUs at home
@@danielhenderson7050 and there are also people running it on EXO with stacked computers and high network bandwidth. With EXO you can run R1 691B with 4 high end different computers (each computer being an amd threadripper + 3090 RTX 24GO + 128 GO RAM) .
I'm not sure how much this would speed things up because loading/unloading the correct expert for any given question is pretty hard. It's designed to avoid GPU to GPU communication, where an MoE model will use only 2 of the 8 GPU's. However maybe if you ask the same type of questions over and over it could keep the most commonly used weights cached on the GPU and the rest in RAM...I'm not familiar enough with how the weights are divided. I remember Mixtral 8x22b would typically activate two "experts" for one answer.
@@nathanbanks2354 It doesn't per say speed things up, but make it runnable. A particularly succefull setup (3 to 4 tokens per sec) is to use 4 RTX 3090 and an AMD threadripper with 512GB of RAM. And people even succefully managed to distribute the compute into many computers using EXO project (can be found on github)
Is there no model that will run efficiently on 4GB of GPU RAM? I have a 1050 Ti. If I buy a second GPU with 4GB and use them together, will it efficiently run on the resulting 8GB of total GPU RAM?
1.5b will run great. 7b and 8b should run fine too, but they will use some of your RAM in addition to VRAM
Does anyone have rough data on how distilled models compare to the 671b one and other models? For example how much worse is 7b compared to 671b?
One interesting thing I found out, is that 7 nor 14b can/very poorly speak slovenian. 671b can speak it no problem.
Hm interesting, 8b is pretty good with Serbian, on 4080 super
thanks
I run 14b model on i5-8250U and got 1.7 token/s. 😅
Can this run on AMD gpus?
ollama supports RX 6000 and 7000 series cards, as well as W 6000 and 7000 series workstation gpus
@d42ks0ul thats good, i'll try it
so I guess you guys in the west do not have the modded 2080ti with 22g vram?