5-star content for AI@home. Not many can afford the defacto AI hardware. It never occurred to me that anyone would piece together older server hardware and explore the "wide" approach. You rock!
I do want A100's pretty bad tbh 🤣 WEN NVDA, WEN! Seriously, I think smol models are amazing and doing so much but they don't get the hype they deserve. Vision models that can run on things like a 12G 3060, simply amazing. Bigger models are fun challenges though and I do like that. Cheers!
Quoted from the engineers of DeepSeek-R1: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1, Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.
I mentioned that I had 0.65 prior set but didn't elaborate on this. That was what was set in the first video on the r930 learning how to overcome unrelated obstacles. I observed that was corrected by utilizing .9 however I am not a scientific testing or benchmarking facility. I found .65 to be unbearably long winded for most answer types, however excellent for ethical reasoning and thought. No system prompt and ample context window were provided in both testing instances.
Everything's moving super fast at the moment. I personally love everything you're doing. Thank you very much for doing this. Another another model another algorithmic came out.
While I won't use this model for most uses I think it is a nice one to keep on hand for certain thought exploration that I need to have insight into the LLM thoughts to actively provide feedback into.
Reading your write-up now. Good stuff. You do a great job of providing details and explaining options. Also like that you assume nothing like explaining setting up the static IP
The big question is how it compares to the 70b... personally I haven't tried it on deepseek, but when I benchmarked the llama models, the 405B was not much different than the 70B. They were basically the same across the board. 70B (< 48GB vram) seems like as good as you need IMO, and I believe one could build a dual rtx 3090 machine within the $2k budget using ebay.
Very cool, have to make notes and checkout your writeup. My primary box is a "rebuilt", from eBay and Aliexpress, Dell C4135 (my own designation - most of it is C4130 parts, the rest C4140) with - dual Xeon E5s (18 core/36 threads each) and 1TB of DDR4 (16x64GB ECC DIMMs) - 4x 16GB V100 SXM2 GPUs nvlinked - 3x 24GB P40s and connectivity through a dual 40Gb/s Infiniband/ethernet card thats using up the rest of my PCI-E lanes. .. I got the P40s when they were less than half the price they are now - average cost was just under £130 each (not sure I'd go this way now), the V100s for an average of £160 apiece and the 1TB of RAM came in around £600. Unfortunately the DGX daughterboard for the SXM2 GPUs was disturbingly close to £400 and then another £200 on cables. Whole thing came in at just over £2k (plus SSDs)
I'm enjoying following your journey thru all these models. Have you done a video explaining your background and what got you into making these videos? If not, you should. A lot of us would enjoy it.
That's fantastic. I'm holding off building a workstation for this type of work until later in the year to see what digits or a potential M4 ultra looks like, but that's because I want to see if they deliver the capacity that I want; To be able to run these large models. If they don't do it, this is the type of thing I'm going to be looking to build.
Have you considered using the 1.58 dynamic quant model from Unsloth instead? It only uses 160GB of RAM and you can offload some of it to the GPU for faster performance. Getting a couple of used 3090s and using RAM for the rest of the remaining memory should give it pretty good performance while maintaining a somewhat reasonable price.
may be it would be more reasonable to specify a constant 'seed' before running a test to make everything reproducible? [EDIT] Great that you've actually done exactly that. ;)
Nice. Most of the other videos on running deepseek-r1 locally are clickbait, they actually run the llama and qwen r1 distills. Have you tried the unsloth dynamic quants? They are faster with mostly the same quality output. What's the power usage when inference is running?
What about an automation? One can write some bash script for a batch-inference testing. This way you would be able to setup a list of questions and just let it run and review the results later on. It should be possible to dump the state of neural net to the file and upload it later on in order to be able to ask some additional questions later if required.
In docker compose, `external: true` means the stack does not handle managing the creation and deletion of the docker network. It just expects the network to already exist
Amazing work, J. I would opt for Ubuntu 22 over 24 if possible, so that LM Studio works easily and you have that alternative to Ollama. Also, once you have R1 running, you can use your unused GPU to talk to your computer w/ my free Froshine app, and one-shot app development w/ Freepoprompt & o1-xml-parser (also free). Cheers. P.S. Wonder how many toks/sec this rig would do with Unsloth A.I.'s 1.58-bit Dynamic Quant
I think that I might have missed this from your video, but why did you limit the amount of memory that the DeepSeek R1:671b model can use, from your 3090s? I know that you added the memory parameter when you were launching/running it, but I don't exactly recall *why* you added that setting/flag though. If you can expand on that a little further, that would be greatly appreciated. Thank you!
@@DigitalSpaceport Thank you for your work. Another order of magnitude down (or simply technology speeding up with time), and we'll see good LLMs being ubiquitous and decentralized in any application and system, very exciting. (Btw how this is good news for Nvidia since there won't be the need to make massive data centers for inference?)
Given that I gobble around with deepseek-r1:32b an 3.5-4.5 tokens a second on a Ryzen 3800 32GB with a 1080ti, 3 tokens a second with this massive model doesn't seem too bad. Seeing how ollama can straight up use multiple GPUs without any problems, I may be come back to the my mining-rig idea, I've got a 8x p106-100 rig sitting around, I've bought for cents to the dollar to test if those cards may work for stable diffusion. But even pytorchcould use multiple (nvidia) gpus, there no way to use them with SD at the moment (except from using every gpu in a single instance and pulling those together.) Maybe Ollama can utilize them...
What are the guidelines for epyc processor selection in LLM tasks? For example the 7003 series models range from 8c/16t to 64c/128t. What is "unusable", what is "acceptable", what is "good"? Also, I thought that a 671b model would need 671GB of RAM, how did you manage to run it with only 512GB RAM?
@@vit3060 Yes I knew that the number is the parameter count but I understood them to be the model weights and if each value is stored at atleast with the resolution of one byte per value (if not more), then you would need at least as many bytes of RAM as there are parameters in the model.
@@vit3060 Sure, but if it's a quantized model where the model weights don't have their original resolution but they are stored in fewer number of bits per weight, then he's not using the "full model" as he is letting us understand, which is a bit misleading since the results won't be the same as with the original DeepSeek.
"RES" in htop is resident, not reserve: it's the amount of memory that's actually present in RAM, versus VIRT for the amount that is allocated address space. ollama mmaps the model (increasing VIRT), and then the weights actually get loaded in RAM, increasing RES roughly to match. 8:06
@ I think the working set (like the "RAM" where the current chat is being processed, versus the "program/ROM" of the weights) is being reset, but I hope it's not reloading the weights every time! sort of a weird thing to observe with the memory measurements though, hmm!
I also use same Xeon setup, even cheaper on prev gen Xeons used board. I recommend getting a thermal camera asap, such 200 bucks from China like infiRay, because this boards and esp LRDIMM are getting very hot but you can't check anywhere(not reported), RAM without cooling easily gets to 90 C. Even power cable connector there heating above rated 80 C during the load by LLM. I managed to keep Ram in 60 C with direct fan for it, all thanks to thermal camera. If board fits in standard PC case, using it possible mostly only horizontal.
Thanks for the tutorial. While it can work, I wouldn't opt for a 2k rig, even if you can can run a top notch model in 3k USD, in a general purpose hardware that could be later used for other VMs/stuff, wouldn't be a bad deal.
Looking forward, I wonder how fast a project digits type machine with x10 the memory would run this model at Q8. And how much extra memory it would need to run the full 130k context window. And for what cost. Less than $10000?
Still too expensive for me, BUT this is peanuts for this class of LLM. Thanks for this. Amazing AI is within reach. I liked the answers! It wasn't expected. haha. We have to be careful giving AI control of machines, apparently.
On most boards with IPMI you can upgrade/downgrade BIOS from IPMI without a CPU installed or the system turned on. You just need power to the motherboard, login to IPMI, and flash BIOS from there. Done that countless times. That's why it's called Out of Band Management
Love it! Quick question though, I basically copied your setup except am aiming to setup with 3 3090s as I am using the 4th in my daily PC until I can get a 5090. Will there be any load balancing/parallelization issues with this?
I like your content, but can you please try to fix the simple error's when trying to run the code: "unexpected indent" can be easily solved and the code may work perfectly for the rest part, but now we don't know. I don't suggest feeding the error back to the model, but just fix the indent spacing and run the code again.
No. You may not have been watching prior videos but I absolutely will not fix any of the code including silly things like indents in the video. This is a test and several other models have gotten very good one shots that are functional. FTR I did after the fact fix the indent and there was another error. Same as the r930 testing conducted prior. I may explain this better in the video however going forward so everyone understands I am providing a level playing field the best manner possible to all the models. Also I wouldnt be really intrigued unless a model got all the questions right in the set. Of note, this was the only question I gauged as outright missed. They are close.
Do you have a dedicated video/article (or know a resource from elsewhere) about the inner workings of spreading an AI workload (e.g. LLM inference) across CPU+RAM and > 1 NVGPU1+VRAM1 ... NVGPUn+VRAMn? I don't understand how the workload is spread across GPUs and if it's all the workload or just parts of it. Until now I was under the firm impression that all "less than datacenter" AI workloads need to be executed in a single memory space so either in CPU+RAM (super slowly) or on a NVGPU+VRAM (faster but much more constrained due to how much VRAM you can get on a single NV card). In data centers I was aware of being able to spread the load across many GPUs in the same box but I thought this is possible mainly due to NVLink providing that special ultra high speed NVGPU interconnect. I don't know details about how they spread the load across boxes, I know there are East-West ultra fast 400+ Gbit NICs used for this purpose but that's about all I know. So now I'm looking at your video and see this talk about using multiple GPUs interconnected only via PCIe for a single LLM instance and I would like to know details about how this works.
@@MarshallYang I believe FP precisions require GPUs which would send the price to the hundreds of thousands of dollars. It's full size because the parameter SIZE is the same, but being run at a lower precision.
The possibility of running this tool on 2000 dollars machine is very impressive and great for us people. This hegemony of tech bros and closed systems destroy development and possibilities to improve our lives just because some dudes whanted more money from us by charging us 200 dollars for subscription even when they use our data and are selling it to other companies and having founds from governments.
I am trying to build a dual Epyc CPU in Combination with 8 x 4090 GPU for analyzing & creating the readout of advanced medical Scan Machines like a 3 T MRI which is usually needed to get a high enough resolution for Areas like the C & T Spine. The reality is that almost all Hospitals perform these readouts themselves. But for late & over night they outsource the Readout of any CT or MRI scan which gets performed on an emergency base at these times. RUclips Creator’s which also own their own Data Center told me that the Dual 64 Core/128 Thread Epyc CPU setup together with maxed out Ram configuration working together with 8 x 4090 which have AI capabilities too should be sufficient. But even such a quite massive AI CPU & GPU System will need at least 30 minutes to analyze the Scanner Data & Transform it into a readout that Radiology Techs or even Radiologists will need to work together with the Doctors/Surgeons of that medical Center.
I wish to see a test where the active parameters are offloaded onto two GPUs, while the rest are stored on the CPU (and the model itself in RAM or very fast storage). This approach could potentially work really well and lead to a server under $10,000 or $15,000 that can run the model at more than 10 tokens per second. I would immediately build such a server, but NO ONE is doing this benchmark. I want to see the full context size tested in FP16. By the way, Q4 is not ideal At the very least, Q5_K_M should be used! "Unsoth" is a different topic, but I would love to see someone test their 1.58-bit model and the largest one (2.51bit) against the full size original one!, as per the research paper 'The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits.' It might offer almost the same quality as the full-sized original FP16 model, which would be insane!
So if I wanted to run a VLM or CNN to analyse image and output textual array of parameters. That describe the image. I need to fine tune the transformer. Example it might be camera images of hand gestures and text output (hand up, 2 fingers) etc. The transformer would need fine tuning. How practical is it too use say LMstudio and a RAG mode then store the fine tuned model as an SSD image to run each time the system is rebooted or turned on. I figure this is less complex than fine tuning the model itself ? Is that a reasonable statement?
Maybe I didn't unserstood perfectly, but is the inference speed on CPU and inference with GPU offload has equal speeds - 3.5 tokens per second? So why is the reason of GPU offload there if inference speed didn't increased?
I would expect you see tokens double but not more. I plan to test the unsloth version on both this rig and a gen 5 rig to determine a ratio that may be applicable to solve this question for others.
i m always wondering, if you can run a zen5 EPYC with 12 channel ddr5-6000 bandwidth would be 576G/s, what would be the token per second? feel like it still be cheaper than any GPU solution? but will be definitely more expensive than $2000.
I do have a 7995WX that I will run unsloth version against. It has insane BW as I have all 8 channels on that board filled. Unfortunately I wont be able to hit 12 channels. The 9xxx epycs are awesome, but if you only end up doubling the tps to 8 that would not be a big enough win imho.
Might be a beginner question, but why choose Linux in this scenario instead of a Windows platform? Is it preference or is it strategic based on performance?
4T/s is respectable number but it's too slow which interprets 10 minutes of R1 thinking and then 4T/s like 3 words spitting out per second, super slow since we've been spoiled with ChatGPT's speed. I would hate to see that speed after going through all that hoola hoop. Thanks for letting me know in advance.
System bandwidth in all cases is the limiting performance factor. Both for VRAM or System ram. How fast the clock can move the bits roundtrip and how many basically. ASIC > FPGA > VRAM > RAM > Me with pen and paper is another way to think of it. It is conceivable a 9xxx AMD system could double the performance here, but at more than double the costs. I am preparing a quasi scientific test to isolate the performance of the unsloth 1.58b version to run against this machine and also a top end 7995wx workstation. Should be interesting. *inserts ring bell call to action*
Its not about how fast, but how smart the model would be, i try running 1.5B to 70B 1.5 is meh for coding, only small task 32 its okay for simple code 70 6/10 for code
I’ve considering 7352 (24c 48t) Because this is the fastest way I can buy in the region. I see your cpu load and tried to establish but, I can’t see your cpu load clearly in the video so Can i use lower cpu?
CPU load is maxed out during testing on my 7702. The 7532 may not hit max bandwidth, you should investigate. I think it will but cant remember. The 7302 for sure can not due to lacking chiplets.
My CPU reads as fully loaded. I do not believe it is good for me to advise you on this as I have not seen the performance of the 7352 myself. I can tell you the 7302 is not good for achieving high bandwidth however. I have owned one of those in the past.
How to Run Deepseek R1 671b Locally on $2K EPYC Server Writeup digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/
A Q4 model isn't going to be the full model though, coming in at only half the size of the full model. Or perhaps there's a Q16 at 4x that size?
5-star content for AI@home. Not many can afford the defacto AI hardware. It never occurred to me that anyone would piece together older server hardware and explore the "wide" approach. You rock!
I do want A100's pretty bad tbh 🤣 WEN NVDA, WEN! Seriously, I think smol models are amazing and doing so much but they don't get the hype they deserve. Vision models that can run on things like a 12G 3060, simply amazing. Bigger models are fun challenges though and I do like that. Cheers!
@@DigitalSpaceport i am running the Stable Diffusion XL (SDXL) on 6GB VRAM rtx 2060 mobile GPU along with Deepseek 14B version. It's slow but works.
Quoted from the engineers of DeepSeek-R1:
We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:
1, Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
2. Avoid adding a system prompt; all instructions should be contained within the user prompt.
3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.
BTW, thank you for such excellent content!
I mentioned that I had 0.65 prior set but didn't elaborate on this. That was what was set in the first video on the r930 learning how to overcome unrelated obstacles. I observed that was corrected by utilizing .9 however I am not a scientific testing or benchmarking facility. I found .65 to be unbearably long winded for most answer types, however excellent for ethical reasoning and thought. No system prompt and ample context window were provided in both testing instances.
Everything's moving super fast at the moment. I personally love everything you're doing. Thank you very much for doing this. Another another model another algorithmic came out.
best AI content i've come around. You should make a Self Hosted LLM Tier list for different hardware based on accuracy, speed, etc.
I've never clicked my mouse faster in my life, I've been waiting to see this!
EPYC is the way to 671b full it seems. Really pleased with the improvement over the R930. Enjoy!
Me too i really like to see how smart it is
Amazing video truly. I did not think it would be possible to run this model at home until quite some time in the future.
While I won't use this model for most uses I think it is a nice one to keep on hand for certain thought exploration that I need to have insight into the LLM thoughts to actively provide feedback into.
Dude, love your channel.... No brainrot compared to other channels doing such things.
I have no script. Both in life and on the channel. Live dangerous.
Reading your write-up now. Good stuff. You do a great job of providing details and explaining options.
Also like that you assume nothing like explaining setting up the static IP
I try to make the video and written articles that compliment each other but I put more effort into what is written. I still value the old ways.
I was so much waiting for this video
Excited is an understatement.
The big question is how it compares to the 70b... personally I haven't tried it on deepseek, but when I benchmarked the llama models, the 405B was not much different than the 70B. They were basically the same across the board. 70B (< 48GB vram) seems like as good as you need IMO, and I believe one could build a dual rtx 3090 machine within the $2k budget using ebay.
@@differentmoves Is 70b multilingua l and multimodal ? I think no.
@@meneguzzo68 deepseek 671B is also not multimodal
you can only get true deepseek experience with 671b. 70b is distill to llama or qwen. that's definitely more different than llama 405b/70b.
Very cool, have to make notes and checkout your writeup.
My primary box is a "rebuilt", from eBay and Aliexpress, Dell C4135 (my own designation - most of it is C4130 parts, the rest C4140) with
- dual Xeon E5s (18 core/36 threads each) and 1TB of DDR4 (16x64GB ECC DIMMs)
- 4x 16GB V100 SXM2 GPUs nvlinked
- 3x 24GB P40s
and connectivity through a dual 40Gb/s Infiniband/ethernet card thats using up the rest of my PCI-E lanes.
.. I got the P40s when they were less than half the price they are now - average cost was just under £130 each (not sure I'd go this way now), the V100s for an average of £160 apiece and the 1TB of RAM came in around £600.
Unfortunately the DGX daughterboard for the SXM2 GPUs was disturbingly close to £400 and then another £200 on cables.
Whole thing came in at just over £2k (plus SSDs)
amazing setup! congrats
This is the video I was needing
🫡 It may not be 🍓 ready but it sure can parse peppermints🤖
Quality explanation, good job
I'm enjoying following your journey thru all these models.
Have you done a video explaining your background and what got you into making these videos? If not, you should. A lot of us would enjoy it.
That's fantastic. I'm holding off building a workstation for this type of work until later in the year to see what digits or a potential M4 ultra looks like, but that's because I want to see if they deliver the capacity that I want; To be able to run these large models. If they don't do it, this is the type of thing I'm going to be looking to build.
This is actually important work
Bloody hell mate, this is amazing
Thanks for testing, R1 is great but the RAM isn't fast enough yet
Finally, someone shows this. Thanks dude.
There you go! Smart ways to run gigantic LLMs
Finally we see someone display the hardware aspect, thank you sir.
Have you considered using the 1.58 dynamic quant model from Unsloth instead? It only uses 160GB of RAM and you can offload some of it to the GPU for faster performance. Getting a couple of used 3090s and using RAM for the rest of the remaining memory should give it pretty good performance while maintaining a somewhat reasonable price.
1:33 I have no idea what any of this means, but your server makes beautiful sounding harmonics. It would make a great bed for an ambient drone track.
Appreciate the content even though I have no clue what you talking about some of the time
insane guide, the day where Ironman's Travis live in my house is near!!!
Fantastic video - thank you!
wont be doing this but great someone is. keep this up man. love this.
The genius we all need!
may be it would be more reasonable to specify a constant 'seed' before running a test to make everything reproducible?
[EDIT] Great that you've actually done exactly that. ;)
I did show the seed that I have pinned for this test as 42069. 😉
This is awesome ❤
Nice. Most of the other videos on running deepseek-r1 locally are clickbait, they actually run the llama and qwen r1 distills. Have you tried the unsloth dynamic quants? They are faster with mostly the same quality output. What's the power usage when inference is running?
You should switch to using Portainer. Container ENV is much easier to manage. Dockge is just slightly better than just using CLI.
What about an automation? One can write some bash script for a batch-inference testing. This way you would be able to setup a list of questions and just let it run and review the results later on. It should be possible to dump the state of neural net to the file and upload it later on in order to be able to ask some additional questions later if required.
That's a great system.
In docker compose, `external: true` means the stack does not handle managing the creation and deletion of the docker network. It just expects the network to already exist
Amazing work, J. I would opt for Ubuntu 22 over 24 if possible, so that LM Studio works easily and you have that alternative to Ollama. Also, once you have R1 running, you can use your unused GPU to talk to your computer w/ my free Froshine app, and one-shot app development w/ Freepoprompt & o1-xml-parser (also free). Cheers. P.S. Wonder how many toks/sec this rig would do with Unsloth A.I.'s 1.58-bit Dynamic Quant
7995wx wants to know 😜
I think that I might have missed this from your video, but why did you limit the amount of memory that the DeepSeek R1:671b model can use, from your 3090s?
I know that you added the memory parameter when you were launching/running it, but I don't exactly recall *why* you added that setting/flag though.
If you can expand on that a little further, that would be greatly appreciated.
Thank you!
Thank you for doing this so I don't have to 😉 Very interesting excercise, but yeah not a daily driver.
Fully agree.
do Macs with unified memory have an advantage as setup? Although they do max out at 192gb and come in at more than double that amount.
You can exo up a cluster of macs and run the full 671b but it is not "cheap" at all. Performance I have seen looked like 5-8 tps from those.
@@DigitalSpaceport Thank you for your work. Another order of magnitude down (or simply technology speeding up with time), and we'll see good LLMs being ubiquitous and decentralized in any application and system, very exciting. (Btw how this is good news for Nvidia since there won't be the need to make massive data centers for inference?)
Given that I gobble around with deepseek-r1:32b an 3.5-4.5 tokens a second on a Ryzen 3800 32GB with a 1080ti, 3 tokens a second with this massive model doesn't seem too bad.
Seeing how ollama can straight up use multiple GPUs without any problems, I may be come back to the my mining-rig idea, I've got a 8x p106-100 rig sitting around, I've bought for cents to the dollar to test if those cards may work for stable diffusion. But even pytorchcould use multiple (nvidia) gpus, there no way to use them with SD at the moment (except from using every gpu in a single instance and pulling those together.) Maybe Ollama can utilize them...
What are the guidelines for epyc processor selection in LLM tasks? For example the 7003 series models range from 8c/16t to 64c/128t. What is "unusable", what is "acceptable", what is "good"?
Also, I thought that a 671b model would need 671GB of RAM, how did you manage to run it with only 512GB RAM?
Probably using int4
@@-tsvk- 671b is the number of inner parameters for this model, not required RAM size.
@@vit3060 Yes I knew that the number is the parameter count but I understood them to be the model weights and if each value is stored at atleast with the resolution of one byte per value (if not more), then you would need at least as many bytes of RAM as there are parameters in the model.
@-tsvk- that is why a different quantization is used.
@@vit3060 Sure, but if it's a quantized model where the model weights don't have their original resolution but they are stored in fewer number of bits per weight, then he's not using the "full model" as he is letting us understand, which is a bit misleading since the results won't be the same as with the original DeepSeek.
"RES" in htop is resident, not reserve: it's the amount of memory that's actually present in RAM, versus VIRT for the amount that is allocated address space. ollama mmaps the model (increasing VIRT), and then the weights actually get loaded in RAM, increasing RES roughly to match. 8:06
TIL thanks! If you set parallel max 1, can you describe what happens on a new chat window opening? Are the weights reloaded?
@ I think the working set (like the "RAM" where the current chat is being processed, versus the "program/ROM" of the weights) is being reset, but I hope it's not reloading the weights every time! sort of a weird thing to observe with the memory measurements though, hmm!
Needless to say I'll not be perusing this project, but fair play to You. As for PI, my puny brain currently has the first 14 digits logged ;-)
I also use same Xeon setup, even cheaper on prev gen Xeons used board. I recommend getting a thermal camera asap, such 200 bucks from China like infiRay, because this boards and esp LRDIMM are getting very hot but you can't check anywhere(not reported), RAM without cooling easily gets to 90 C. Even power cable connector there heating above rated 80 C during the load by LLM. I managed to keep Ram in 60 C with direct fan for it, all thanks to thermal camera. If board fits in standard PC case, using it possible mostly only horizontal.
Thanks for the tutorial. While it can work, I wouldn't opt for a 2k rig, even if you can can run a top notch model in 3k USD, in a general purpose hardware that could be later used for other VMs/stuff, wouldn't be a bad deal.
God's work!
Looking forward, I wonder how fast a project digits type machine with x10 the memory would run this model at Q8. And how much extra memory it would need to run the full 130k context window. And for what cost. Less than $10000?
Still too expensive for me, BUT this is peanuts for this class of LLM. Thanks for this. Amazing AI is within reach. I liked the answers! It wasn't expected. haha. We have to be careful giving AI control of machines, apparently.
On most boards with IPMI you can upgrade/downgrade BIOS from IPMI without a CPU installed or the system turned on. You just need power to the motherboard, login to IPMI, and flash BIOS from there. Done that countless times. That's why it's called Out of Band Management
another awesome and very detailed video!!
Can you test the local PC model AI DeepSeek version 2.5?
thank you ; that was interesting and informative;appreciate the effort; whats the budget way to increase tks tho?
Love it! Quick question though, I basically copied your setup except am aiming to setup with 3 3090s as I am using the 4th in my daily PC until I can get a 5090. Will there be any load balancing/parallelization issues with this?
Run Unsloth Dynamic Quants (DeepSeek-R1-UD-Q2_K_XL) model with huggingface
Very helpful
well done!
Did you try it on your 4x3090 (96 vRam). Very curious about that one.
Thank you for fixing the audio brother. I appreciate you.
@DigitalSpaceport How about 4 NVMe wit 7-10 GB/S on RAID on PCIe slot to act as one NVMe but really fast one, then load up the model on them.
It would be very very slow. Possible but very very slow. If you test this out, please report back what you find.
Try something similar on the cheap Chinese Xeon boards.
I wonder how the quant 8 version would perform...
Did you configure the correct temperature?
It’s necessary if you are prompting Maths or Coding.
Nice! What would be your go-to open source model for coding?
I like your content, but can you please try to fix the simple error's when trying to run the code: "unexpected indent" can be easily solved and the code may work perfectly for the rest part, but now we don't know. I don't suggest feeding the error back to the model, but just fix the indent spacing and run the code again.
No. You may not have been watching prior videos but I absolutely will not fix any of the code including silly things like indents in the video. This is a test and several other models have gotten very good one shots that are functional. FTR I did after the fact fix the indent and there was another error. Same as the r930 testing conducted prior. I may explain this better in the video however going forward so everyone understands I am providing a level playing field the best manner possible to all the models. Also I wouldnt be really intrigued unless a model got all the questions right in the set. Of note, this was the only question I gauged as outright missed. They are close.
Do you have a dedicated video/article (or know a resource from elsewhere) about the inner workings of spreading an AI workload (e.g. LLM inference) across CPU+RAM and > 1 NVGPU1+VRAM1 ... NVGPUn+VRAMn?
I don't understand how the workload is spread across GPUs and if it's all the workload or just parts of it. Until now I was under the firm impression that all "less than datacenter" AI workloads need to be executed in a single memory space so either in CPU+RAM (super slowly) or on a NVGPU+VRAM (faster but much more constrained due to how much VRAM you can get on a single NV card).
In data centers I was aware of being able to spread the load across many GPUs in the same box but I thought this is possible mainly due to NVLink providing that special ultra high speed NVGPU interconnect. I don't know details about how they spread the load across boxes, I know there are East-West ultra fast 400+ Gbit NICs used for this purpose but that's about all I know.
So now I'm looking at your video and see this talk about using multiple GPUs interconnected only via PCIe for a single LLM instance and I would like to know details about how this works.
how is quant 4 the full model?
@@MarshallYang I believe FP precisions require GPUs which would send the price to the hundreds of thousands of dollars. It's full size because the parameter SIZE is the same, but being run at a lower precision.
22:20 - A lot of boards allow upgrade of a main cpu "bios" without cpu present by using bmc. i have no experience around gigabyte boards, tho.
Every company now could have a ai model🎉
The possibility of running this tool on 2000 dollars machine is very impressive and great for us people. This hegemony of tech bros and closed systems destroy development and possibilities to improve our lives just because some dudes whanted more money from us by charging us 200 dollars for subscription even when they use our data and are selling it to other companies and having founds from governments.
I am trying to build a dual Epyc CPU in Combination with 8 x 4090 GPU for analyzing & creating the readout of advanced medical Scan Machines like a 3 T MRI which is usually needed to get a high enough resolution for Areas like the C & T Spine. The reality is that almost all Hospitals perform these readouts themselves. But for late & over night they outsource the Readout of any CT or MRI scan which gets performed on an emergency base at these times. RUclips Creator’s which also own their own Data Center told me that the Dual 64 Core/128 Thread Epyc CPU setup together with maxed out Ram configuration working together with 8 x 4090 which have AI capabilities too should be sufficient. But even such a quite massive AI CPU & GPU System will need at least 30 minutes to analyze the Scanner Data & Transform it into a readout that Radiology Techs or even Radiologists will need to work together with the Doctors/Surgeons of that medical Center.
Who told you to get 8x 4090s?
I wish to see a test where the active parameters are offloaded onto two GPUs, while the rest are stored on the CPU (and the model itself in RAM or very fast storage). This approach could potentially work really well and lead to a server under $10,000 or $15,000 that can run the model at more than 10 tokens per second. I would immediately build such a server, but NO ONE is doing this benchmark. I want to see the full context size tested in FP16. By the way, Q4 is not ideal At the very least, Q5_K_M should be used!
"Unsoth" is a different topic, but I would love to see someone test their 1.58-bit model and the largest one (2.51bit) against the full size original one!, as per the research paper 'The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits.' It might offer almost the same quality as the full-sized original FP16 model, which would be insane!
So if I wanted to run a VLM or CNN to analyse image and output textual array of parameters. That describe the image. I need to fine tune the transformer. Example it might be camera images of hand gestures and text output (hand up, 2 fingers) etc.
The transformer would need fine tuning. How practical is it too use say LMstudio and a RAG mode then store the fine tuned model as an SSD image to run each time the system is rebooted or turned on. I figure this is less complex than fine tuning the model itself ? Is that a reasonable statement?
23:20 did you actually look at the error message ? it was probably just a copy/paste issue and fixable in 10s by a junior intern
Maybe I didn't unserstood perfectly, but is the inference speed on CPU and inference with GPU offload has equal speeds - 3.5 tokens per second? So why is the reason of GPU offload there if inference speed didn't increased?
Thanks for the video. Is this mean if we use ddr5 instead of ddr4, token/s can significantly increase?
I would expect you see tokens double but not more. I plan to test the unsloth version on both this rig and a gen 5 rig to determine a ratio that may be applicable to solve this question for others.
i m always wondering, if you can run a zen5 EPYC with 12 channel ddr5-6000 bandwidth would be 576G/s, what would be the token per second? feel like it still be cheaper than any GPU solution? but will be definitely more expensive than $2000.
I do have a 7995WX that I will run unsloth version against. It has insane BW as I have all 8 channels on that board filled. Unfortunately I wont be able to hit 12 channels. The 9xxx epycs are awesome, but if you only end up doubling the tps to 8 that would not be a big enough win imho.
insane
I just looked this up. Thanks man!!
Wow 64GB DIMMs are sure handy.
please show the load via bpytop in the next video, more detailed cool thing
AGI in the garage! Nice video man!
🙌
Might be a beginner question, but why choose Linux in this scenario instead of a Windows platform? Is it preference or is it strategic based on performance?
In layman terms is this as powerful as MS copilot I use at work?
Lets hope the hardware are gonna to be cheaper and cheaper
The question is, does it work as what the say against Open AI 01?
4T/s is respectable number but it's too slow which interprets 10 minutes of R1 thinking and then 4T/s like 3 words spitting out per second, super slow since we've been spoiled with ChatGPT's speed. I would hate to see that speed after going through all that hoola hoop. Thanks for letting me know in advance.
Yeah, but its local though, if you want speed, just use the web version
bro why can't can't write out results of t/s into description or pinned comment....
Shouldn't you disable swap, or is it needed?
If it works with systemd it will work with docker. Just use env properly. Go into container, type "env" and see if these are actually configured.
what happens if you use ZRAM or ZSWAP to compress memory?
Bruh, do you have solar? Because your power bill… woof
precision training ,Lighter and faster.
What's the limiting factor when wanting to increase your tokens per second performance?
System bandwidth in all cases is the limiting performance factor. Both for VRAM or System ram. How fast the clock can move the bits roundtrip and how many basically. ASIC > FPGA > VRAM > RAM > Me with pen and paper is another way to think of it. It is conceivable a 9xxx AMD system could double the performance here, but at more than double the costs. I am preparing a quasi scientific test to isolate the performance of the unsloth 1.58b version to run against this machine and also a top end 7995wx workstation. Should be interesting. *inserts ring bell call to action*
Hey! AGI in the garage!
Hello, how fast would this be with the 32B or 70B model?
Probably similar speed due to similar number of active parameters.
Its not about how fast, but how smart the model would be, i try running 1.5B to 70B
1.5 is meh for coding, only small task
32 its okay for simple code
70 6/10 for code
I’ve considering 7352 (24c 48t) Because this is the fastest way I can buy in the region. I see your cpu load and tried to establish but,
I can’t see your cpu load clearly in the video so Can i use lower cpu?
CPU load is maxed out during testing on my 7702. The 7532 may not hit max bandwidth, you should investigate. I think it will but cant remember. The 7302 for sure can not due to lacking chiplets.
My CPU reads as fully loaded. I do not believe it is good for me to advise you on this as I have not seen the performance of the 7352 myself. I can tell you the 7302 is not good for achieving high bandwidth however. I have owned one of those in the past.
marc andreessen sent me