Deepseek R1 671b Running and Testing on a $2000 Local AI Server

Поделиться
HTML-код
  • Опубликовано: 31 янв 2025

Комментарии • 228

  • @DigitalSpaceport
    @DigitalSpaceport  14 часов назад +12

    How to Run Deepseek R1 671b Locally on $2K EPYC Server Writeup digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/

    • @jimbig3997
      @jimbig3997 Час назад

      A Q4 model isn't going to be the full model though, coming in at only half the size of the full model. Or perhaps there's a Q16 at 4x that size?

  • @markldevine
    @markldevine 21 час назад +116

    5-star content for AI@home. Not many can afford the defacto AI hardware. It never occurred to me that anyone would piece together older server hardware and explore the "wide" approach. You rock!

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад +15

      I do want A100's pretty bad tbh 🤣 WEN NVDA, WEN! Seriously, I think smol models are amazing and doing so much but they don't get the hype they deserve. Vision models that can run on things like a 12G 3060, simply amazing. Bigger models are fun challenges though and I do like that. Cheers!

    • @hackbaba999
      @hackbaba999 Час назад +1

      @@DigitalSpaceport i am running the Stable Diffusion XL (SDXL) on 6GB VRAM rtx 2060 mobile GPU along with Deepseek 14B version. It's slow but works.

  • @RickeyBowers
    @RickeyBowers 18 часов назад +76

    Quoted from the engineers of DeepSeek-R1:
    We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:
    1, Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
    2. Avoid adding a system prompt; all instructions should be contained within the user prompt.
    3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
    4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.

    • @RickeyBowers
      @RickeyBowers 18 часов назад +5

      BTW, thank you for such excellent content!

    • @DigitalSpaceport
      @DigitalSpaceport  11 часов назад +6

      I mentioned that I had 0.65 prior set but didn't elaborate on this. That was what was set in the first video on the r930 learning how to overcome unrelated obstacles. I observed that was corrected by utilizing .9 however I am not a scientific testing or benchmarking facility. I found .65 to be unbearably long winded for most answer types, however excellent for ethical reasoning and thought. No system prompt and ample context window were provided in both testing instances.

  • @Romans-310
    @Romans-310 12 часов назад +8

    Everything's moving super fast at the moment. I personally love everything you're doing. Thank you very much for doing this. Another another model another algorithmic came out.

  • @joejenkins9181
    @joejenkins9181 19 часов назад +36

    best AI content i've come around. You should make a Self Hosted LLM Tier list for different hardware based on accuracy, speed, etc.

  • @TheJimmyCartel
    @TheJimmyCartel 21 час назад +51

    I've never clicked my mouse faster in my life, I've been waiting to see this!

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад

      EPYC is the way to 671b full it seems. Really pleased with the improvement over the R930. Enjoy!

    • @mahdir_8391
      @mahdir_8391 20 часов назад

      Me too i really like to see how smart it is

  • @Residentevilfan1
    @Residentevilfan1 11 часов назад +5

    Amazing video truly. I did not think it would be possible to run this model at home until quite some time in the future.

    • @DigitalSpaceport
      @DigitalSpaceport  11 часов назад

      While I won't use this model for most uses I think it is a nice one to keep on hand for certain thought exploration that I need to have insight into the LLM thoughts to actively provide feedback into.

  • @AnarchyEnsues
    @AnarchyEnsues 19 часов назад +8

    Dude, love your channel.... No brainrot compared to other channels doing such things.

    • @DigitalSpaceport
      @DigitalSpaceport  10 часов назад +5

      I have no script. Both in life and on the channel. Live dangerous.

  • @MetaTaco317
    @MetaTaco317 21 час назад +12

    Reading your write-up now. Good stuff. You do a great job of providing details and explaining options.
    Also like that you assume nothing like explaining setting up the static IP

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад +6

      I try to make the video and written articles that compliment each other but I put more effort into what is written. I still value the old ways.

  • @StayAware9
    @StayAware9 16 часов назад +4

    I was so much waiting for this video

  • @Romans-310
    @Romans-310 11 часов назад +3

    Excited is an understatement.

  • @differentmoves
    @differentmoves 21 час назад +23

    The big question is how it compares to the 70b... personally I haven't tried it on deepseek, but when I benchmarked the llama models, the 405B was not much different than the 70B. They were basically the same across the board. 70B (< 48GB vram) seems like as good as you need IMO, and I believe one could build a dual rtx 3090 machine within the $2k budget using ebay.

    • @meneguzzo68
      @meneguzzo68 21 час назад

      @@differentmoves Is 70b multilingua l and multimodal ? I think no.

    • @differentmoves
      @differentmoves 20 часов назад +3

      @@meneguzzo68 deepseek 671B is also not multimodal

    • @lovestudykid
      @lovestudykid 20 часов назад +11

      you can only get true deepseek experience with 671b. 70b is distill to llama or qwen. that's definitely more different than llama 405b/70b.

  • @jelliott3604
    @jelliott3604 3 часа назад +2

    Very cool, have to make notes and checkout your writeup.
    My primary box is a "rebuilt", from eBay and Aliexpress, Dell C4135 (my own designation - most of it is C4130 parts, the rest C4140) with
    - dual Xeon E5s (18 core/36 threads each) and 1TB of DDR4 (16x64GB ECC DIMMs)
    - 4x 16GB V100 SXM2 GPUs nvlinked
    - 3x 24GB P40s
    and connectivity through a dual 40Gb/s Infiniband/ethernet card thats using up the rest of my PCI-E lanes.
    .. I got the P40s when they were less than half the price they are now - average cost was just under £130 each (not sure I'd go this way now), the V100s for an average of £160 apiece and the 1TB of RAM came in around £600.
    Unfortunately the DGX daughterboard for the SXM2 GPUs was disturbingly close to £400 and then another £200 on cables.
    Whole thing came in at just over £2k (plus SSDs)

  • @ko95
    @ko95 19 часов назад +2

    amazing setup! congrats

  • @svjness
    @svjness 21 час назад +17

    This is the video I was needing

    • @DigitalSpaceport
      @DigitalSpaceport  21 час назад

      🫡 It may not be 🍓 ready but it sure can parse peppermints🤖

  • @hamadeng.1671
    @hamadeng.1671 11 часов назад +1

    Quality explanation, good job

  • @MetaTaco317
    @MetaTaco317 21 час назад +1

    I'm enjoying following your journey thru all these models.
    Have you done a video explaining your background and what got you into making these videos? If not, you should. A lot of us would enjoy it.

  • @anthonyperks2201
    @anthonyperks2201 4 часа назад

    That's fantastic. I'm holding off building a workstation for this type of work until later in the year to see what digits or a potential M4 ultra looks like, but that's because I want to see if they deliver the capacity that I want; To be able to run these large models. If they don't do it, this is the type of thing I'm going to be looking to build.

  • @Fordtruck4sale
    @Fordtruck4sale 13 часов назад +1

    This is actually important work

  • @In_Development
    @In_Development 20 часов назад

    Bloody hell mate, this is amazing

  • @Techonsapevole
    @Techonsapevole 8 часов назад +2

    Thanks for testing, R1 is great but the RAM isn't fast enough yet

  • @danielkahbe964
    @danielkahbe964 6 часов назад

    Finally, someone shows this. Thanks dude.

  • @hemsingh23
    @hemsingh23 21 час назад +4

    There you go! Smart ways to run gigantic LLMs

  • @David-y1i8c
    @David-y1i8c 15 часов назад

    Finally we see someone display the hardware aspect, thank you sir.

  • @darknight264441
    @darknight264441 10 часов назад +4

    Have you considered using the 1.58 dynamic quant model from Unsloth instead? It only uses 160GB of RAM and you can offload some of it to the GPU for faster performance. Getting a couple of used 3090s and using RAM for the rest of the remaining memory should give it pretty good performance while maintaining a somewhat reasonable price.

  • @zx5147
    @zx5147 6 часов назад

    1:33 I have no idea what any of this means, but your server makes beautiful sounding harmonics. It would make a great bed for an ambient drone track.

  • @amplifiergamingBTC
    @amplifiergamingBTC 15 часов назад

    Appreciate the content even though I have no clue what you talking about some of the time

  • @bongkem2723
    @bongkem2723 3 часа назад

    insane guide, the day where Ironman's Travis live in my house is near!!!

  • @DigiByteGlobalCommunity
    @DigiByteGlobalCommunity 20 часов назад

    Fantastic video - thank you!

  • @neponel
    @neponel 19 часов назад +2

    wont be doing this but great someone is. keep this up man. love this.

  • @gmtoomey
    @gmtoomey 20 часов назад

    The genius we all need!

  • @alekseyburrovets4747
    @alekseyburrovets4747 11 часов назад

    may be it would be more reasonable to specify a constant 'seed' before running a test to make everything reproducible?
    [EDIT] Great that you've actually done exactly that. ;)

    • @DigitalSpaceport
      @DigitalSpaceport  11 часов назад +1

      I did show the seed that I have pinned for this test as 42069. 😉

  • @MatijaGrcic
    @MatijaGrcic 5 часов назад

    This is awesome ❤

  • @andutei
    @andutei 5 минут назад

    Nice. Most of the other videos on running deepseek-r1 locally are clickbait, they actually run the llama and qwen r1 distills. Have you tried the unsloth dynamic quants? They are faster with mostly the same quality output. What's the power usage when inference is running?

  • @foureight84
    @foureight84 19 часов назад +2

    You should switch to using Portainer. Container ENV is much easier to manage. Dockge is just slightly better than just using CLI.

  • @alekseyburrovets4747
    @alekseyburrovets4747 3 часа назад

    What about an automation? One can write some bash script for a batch-inference testing. This way you would be able to setup a list of questions and just let it run and review the results later on. It should be possible to dump the state of neural net to the file and upload it later on in order to be able to ask some additional questions later if required.

  • @hooni-mining
    @hooni-mining 16 часов назад +1

    That's a great system.

  • @caedis_
    @caedis_ 19 часов назад

    In docker compose, `external: true` means the stack does not handle managing the creation and deletion of the docker network. It just expects the network to already exist

  • @AdrianDoesAi
    @AdrianDoesAi Час назад

    Amazing work, J. I would opt for Ubuntu 22 over 24 if possible, so that LM Studio works easily and you have that alternative to Ollama. Also, once you have R1 running, you can use your unused GPU to talk to your computer w/ my free Froshine app, and one-shot app development w/ Freepoprompt & o1-xml-parser (also free). Cheers. P.S. Wonder how many toks/sec this rig would do with Unsloth A.I.'s 1.58-bit Dynamic Quant

  • @ewenchan1239
    @ewenchan1239 10 часов назад +1

    I think that I might have missed this from your video, but why did you limit the amount of memory that the DeepSeek R1:671b model can use, from your 3090s?
    I know that you added the memory parameter when you were launching/running it, but I don't exactly recall *why* you added that setting/flag though.
    If you can expand on that a little further, that would be greatly appreciated.
    Thank you!

  • @dr_harrington
    @dr_harrington 17 часов назад

    Thank you for doing this so I don't have to 😉 Very interesting excercise, but yeah not a daily driver.

  • @dom2555
    @dom2555 20 часов назад +6

    do Macs with unified memory have an advantage as setup? Although they do max out at 192gb and come in at more than double that amount.

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад +6

      You can exo up a cluster of macs and run the full 671b but it is not "cheap" at all. Performance I have seen looked like 5-8 tps from those.

    • @dom2555
      @dom2555 8 часов назад

      @@DigitalSpaceport Thank you for your work. Another order of magnitude down (or simply technology speeding up with time), and we'll see good LLMs being ubiquitous and decentralized in any application and system, very exciting. (Btw how this is good news for Nvidia since there won't be the need to make massive data centers for inference?)

  • @marcus_w0
    @marcus_w0 18 часов назад +2

    Given that I gobble around with deepseek-r1:32b an 3.5-4.5 tokens a second on a Ryzen 3800 32GB with a 1080ti, 3 tokens a second with this massive model doesn't seem too bad.
    Seeing how ollama can straight up use multiple GPUs without any problems, I may be come back to the my mining-rig idea, I've got a 8x p106-100 rig sitting around, I've bought for cents to the dollar to test if those cards may work for stable diffusion. But even pytorchcould use multiple (nvidia) gpus, there no way to use them with SD at the moment (except from using every gpu in a single instance and pulling those together.) Maybe Ollama can utilize them...

  • @-tsvk-
    @-tsvk- 19 часов назад +5

    What are the guidelines for epyc processor selection in LLM tasks? For example the 7003 series models range from 8c/16t to 64c/128t. What is "unusable", what is "acceptable", what is "good"?
    Also, I thought that a 671b model would need 671GB of RAM, how did you manage to run it with only 512GB RAM?

    • @arc8218
      @arc8218 12 часов назад +1

      Probably using int4

    • @vit3060
      @vit3060 11 часов назад +5

      @@-tsvk- 671b is the number of inner parameters for this model, not required RAM size.

    • @-tsvk-
      @-tsvk- 10 часов назад +1

      @@vit3060 Yes I knew that the number is the parameter count but I understood them to be the model weights and if each value is stored at atleast with the resolution of one byte per value (if not more), then you would need at least as many bytes of RAM as there are parameters in the model.

    • @vit3060
      @vit3060 10 часов назад +1

      @-tsvk- that is why a different quantization is used.

    • @-tsvk-
      @-tsvk- 10 часов назад +3

      @@vit3060 Sure, but if it's a quantized model where the model weights don't have their original resolution but they are stored in fewer number of bits per weight, then he's not using the "full model" as he is letting us understand, which is a bit misleading since the results won't be the same as with the original DeepSeek.

  • @shvmoz
    @shvmoz Час назад

    "RES" in htop is resident, not reserve: it's the amount of memory that's actually present in RAM, versus VIRT for the amount that is allocated address space. ollama mmaps the model (increasing VIRT), and then the weights actually get loaded in RAM, increasing RES roughly to match. 8:06

    • @DigitalSpaceport
      @DigitalSpaceport  Час назад +1

      TIL thanks! If you set parallel max 1, can you describe what happens on a new chat window opening? Are the weights reloaded?

    • @shvmoz
      @shvmoz 52 минуты назад

      @ I think the working set (like the "RAM" where the current chat is being processed, versus the "program/ROM" of the weights) is being reset, but I hope it's not reloading the weights every time! sort of a weird thing to observe with the memory measurements though, hmm!

  • @PeteJohnson1471
    @PeteJohnson1471 9 часов назад

    Needless to say I'll not be perusing this project, but fair play to You. As for PI, my puny brain currently has the first 14 digits logged ;-)

  • @fontenbleau
    @fontenbleau 27 минут назад

    I also use same Xeon setup, even cheaper on prev gen Xeons used board. I recommend getting a thermal camera asap, such 200 bucks from China like infiRay, because this boards and esp LRDIMM are getting very hot but you can't check anywhere(not reported), RAM without cooling easily gets to 90 C. Even power cable connector there heating above rated 80 C during the load by LLM. I managed to keep Ram in 60 C with direct fan for it, all thanks to thermal camera. If board fits in standard PC case, using it possible mostly only horizontal.

  • @nqaiser
    @nqaiser 10 часов назад

    Thanks for the tutorial. While it can work, I wouldn't opt for a 2k rig, even if you can can run a top notch model in 3k USD, in a general purpose hardware that could be later used for other VMs/stuff, wouldn't be a bad deal.

  • @bhargavk1515
    @bhargavk1515 6 часов назад

    God's work!

  • @moozoowizard
    @moozoowizard 11 часов назад +1

    Looking forward, I wonder how fast a project digits type machine with x10 the memory would run this model at Q8. And how much extra memory it would need to run the full 130k context window. And for what cost. Less than $10000?

  • @tristanvaillancourt5889
    @tristanvaillancourt5889 13 часов назад

    Still too expensive for me, BUT this is peanuts for this class of LLM. Thanks for this. Amazing AI is within reach. I liked the answers! It wasn't expected. haha. We have to be careful giving AI control of machines, apparently.

  • @iraqigeek8363
    @iraqigeek8363 6 часов назад

    On most boards with IPMI you can upgrade/downgrade BIOS from IPMI without a CPU installed or the system turned on. You just need power to the motherboard, login to IPMI, and flash BIOS from there. Done that countless times. That's why it's called Out of Band Management

  • @DavidVincentSSM
    @DavidVincentSSM 20 часов назад +3

    another awesome and very detailed video!!

  • @mategames9073
    @mategames9073 21 час назад +2

    Can you test the local PC model AI DeepSeek version 2.5?

  • @memeyoyo-c5q
    @memeyoyo-c5q 6 часов назад

    thank you ; that was interesting and informative;appreciate the effort; whats the budget way to increase tks tho?

  • @Grapheneolic
    @Grapheneolic 5 часов назад

    Love it! Quick question though, I basically copied your setup except am aiming to setup with 3 3090s as I am using the 4th in my daily PC until I can get a 5090. Will there be any load balancing/parallelization issues with this?

  • @zoo6062
    @zoo6062 20 часов назад +4

    Run Unsloth Dynamic Quants (DeepSeek-R1-UD-Q2_K_XL) model with huggingface

  • @elyoni-corp
    @elyoni-corp 8 часов назад

    Very helpful

  • @georgechang5994
    @georgechang5994 12 часов назад

    well done!

  • @MichaelAsgian
    @MichaelAsgian 12 часов назад +1

    Did you try it on your 4x3090 (96 vRam). Very curious about that one.

  • @freecode.ai-
    @freecode.ai- 13 часов назад

    Thank you for fixing the audio brother. I appreciate you.

  • @m0neez
    @m0neez 19 часов назад +2

    ​ @DigitalSpaceport How about 4 NVMe wit 7-10 GB/S on RAID on PCIe slot to act as one NVMe but really fast one, then load up the model on them.

    • @DigitalSpaceport
      @DigitalSpaceport  10 часов назад +2

      It would be very very slow. Possible but very very slow. If you test this out, please report back what you find.

  • @20windfisch11
    @20windfisch11 19 часов назад +2

    Try something similar on the cheap Chinese Xeon boards.

  • @allxallthetime
    @allxallthetime 19 часов назад +1

    I wonder how the quant 8 version would perform...

  • @madeniran
    @madeniran 17 часов назад +1

    Did you configure the correct temperature?
    It’s necessary if you are prompting Maths or Coding.

  • @alan83251
    @alan83251 20 часов назад

    Nice! What would be your go-to open source model for coding?

  • @Victornemeth
    @Victornemeth 21 час назад +2

    I like your content, but can you please try to fix the simple error's when trying to run the code: "unexpected indent" can be easily solved and the code may work perfectly for the rest part, but now we don't know. I don't suggest feeding the error back to the model, but just fix the indent spacing and run the code again.

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад +4

      No. You may not have been watching prior videos but I absolutely will not fix any of the code including silly things like indents in the video. This is a test and several other models have gotten very good one shots that are functional. FTR I did after the fact fix the indent and there was another error. Same as the r930 testing conducted prior. I may explain this better in the video however going forward so everyone understands I am providing a level playing field the best manner possible to all the models. Also I wouldnt be really intrigued unless a model got all the questions right in the set. Of note, this was the only question I gauged as outright missed. They are close.

  • @desild5869
    @desild5869 15 часов назад

    Do you have a dedicated video/article (or know a resource from elsewhere) about the inner workings of spreading an AI workload (e.g. LLM inference) across CPU+RAM and > 1 NVGPU1+VRAM1 ... NVGPUn+VRAMn?
    I don't understand how the workload is spread across GPUs and if it's all the workload or just parts of it. Until now I was under the firm impression that all "less than datacenter" AI workloads need to be executed in a single memory space so either in CPU+RAM (super slowly) or on a NVGPU+VRAM (faster but much more constrained due to how much VRAM you can get on a single NV card).
    In data centers I was aware of being able to spread the load across many GPUs in the same box but I thought this is possible mainly due to NVLink providing that special ultra high speed NVGPU interconnect. I don't know details about how they spread the load across boxes, I know there are East-West ultra fast 400+ Gbit NICs used for this purpose but that's about all I know.
    So now I'm looking at your video and see this talk about using multiple GPUs interconnected only via PCIe for a single LLM instance and I would like to know details about how this works.

  • @MarshallYang
    @MarshallYang 9 часов назад +4

    how is quant 4 the full model?

    • @ozerune
      @ozerune Час назад +1

      @@MarshallYang I believe FP precisions require GPUs which would send the price to the hundreds of thousands of dollars. It's full size because the parameter SIZE is the same, but being run at a lower precision.

  • @Jamey_ETHZurich_TUe_Rulez
    @Jamey_ETHZurich_TUe_Rulez 4 часа назад

    22:20 - A lot of boards allow upgrade of a main cpu "bios" without cpu present by using bmc. i have no experience around gigabyte boards, tho.

  • @parstigdeyu6259
    @parstigdeyu6259 3 часа назад

    Every company now could have a ai model🎉

  • @bartekb4191
    @bartekb4191 4 часа назад

    The possibility of running this tool on 2000 dollars machine is very impressive and great for us people. This hegemony of tech bros and closed systems destroy development and possibilities to improve our lives just because some dudes whanted more money from us by charging us 200 dollars for subscription even when they use our data and are selling it to other companies and having founds from governments.

  • @saminekunis8680
    @saminekunis8680 12 часов назад +1

    I am trying to build a dual Epyc CPU in Combination with 8 x 4090 GPU for analyzing & creating the readout of advanced medical Scan Machines like a 3 T MRI which is usually needed to get a high enough resolution for Areas like the C & T Spine. The reality is that almost all Hospitals perform these readouts themselves. But for late & over night they outsource the Readout of any CT or MRI scan which gets performed on an emergency base at these times. RUclips Creator’s which also own their own Data Center told me that the Dual 64 Core/128 Thread Epyc CPU setup together with maxed out Ram configuration working together with 8 x 4090 which have AI capabilities too should be sufficient. But even such a quite massive AI CPU & GPU System will need at least 30 minutes to analyze the Scanner Data & Transform it into a readout that Radiology Techs or even Radiologists will need to work together with the Doctors/Surgeons of that medical Center.

  • @DLLDevStudio
    @DLLDevStudio 6 часов назад

    I wish to see a test where the active parameters are offloaded onto two GPUs, while the rest are stored on the CPU (and the model itself in RAM or very fast storage). This approach could potentially work really well and lead to a server under $10,000 or $15,000 that can run the model at more than 10 tokens per second. I would immediately build such a server, but NO ONE is doing this benchmark. I want to see the full context size tested in FP16. By the way, Q4 is not ideal At the very least, Q5_K_M should be used!
    "Unsoth" is a different topic, but I would love to see someone test their 1.58-bit model and the largest one (2.51bit) against the full size original one!, as per the research paper 'The Era of 1-bit LLMs: All Large Language Models Are in 1.58 Bits.' It might offer almost the same quality as the full-sized original FP16 model, which would be insane!

  • @geor664
    @geor664 6 часов назад

    So if I wanted to run a VLM or CNN to analyse image and output textual array of parameters. That describe the image. I need to fine tune the transformer. Example it might be camera images of hand gestures and text output (hand up, 2 fingers) etc.
    The transformer would need fine tuning. How practical is it too use say LMstudio and a RAG mode then store the fine tuned model as an SSD image to run each time the system is rebooted or turned on. I figure this is less complex than fine tuning the model itself ? Is that a reasonable statement?

  • @youtou252
    @youtou252 5 часов назад

    23:20 did you actually look at the error message ? it was probably just a copy/paste issue and fixable in 10s by a junior intern

  • @onewhoraisesvoice
    @onewhoraisesvoice 13 минут назад

    Maybe I didn't unserstood perfectly, but is the inference speed on CPU and inference with GPU offload has equal speeds - 3.5 tokens per second? So why is the reason of GPU offload there if inference speed didn't increased?

  • @treksis
    @treksis 18 часов назад

    Thanks for the video. Is this mean if we use ddr5 instead of ddr4, token/s can significantly increase?

    • @DigitalSpaceport
      @DigitalSpaceport  11 часов назад +3

      I would expect you see tokens double but not more. I plan to test the unsloth version on both this rig and a gen 5 rig to determine a ratio that may be applicable to solve this question for others.

  • @siegfriedcxf
    @siegfriedcxf 20 часов назад +1

    i m always wondering, if you can run a zen5 EPYC with 12 channel ddr5-6000 bandwidth would be 576G/s, what would be the token per second? feel like it still be cheaper than any GPU solution? but will be definitely more expensive than $2000.

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад

      I do have a 7995WX that I will run unsloth version against. It has insane BW as I have all 8 channels on that board filled. Unfortunately I wont be able to hit 12 channels. The 9xxx epycs are awesome, but if you only end up doubling the tps to 8 that would not be a big enough win imho.

  • @xiv3r
    @xiv3r 12 часов назад

    insane

  • @nikozg2091
    @nikozg2091 21 час назад +4

    I just looked this up. Thanks man!!

  • @goodcitizen4587
    @goodcitizen4587 18 часов назад

    Wow 64GB DIMMs are sure handy.

  • @drax5965
    @drax5965 12 часов назад

    please show the load via bpytop in the next video, more detailed cool thing

  • @goodfractalspoker7179
    @goodfractalspoker7179 20 часов назад +3

    AGI in the garage! Nice video man!

  • @509tippy
    @509tippy 18 часов назад

    Might be a beginner question, but why choose Linux in this scenario instead of a Windows platform? Is it preference or is it strategic based on performance?

  • @James-pyon
    @James-pyon 3 часа назад

    In layman terms is this as powerful as MS copilot I use at work?

  • @gene61ify
    @gene61ify 12 часов назад

    Lets hope the hardware are gonna to be cheaper and cheaper

  • @christcombiccombichrist2651
    @christcombiccombichrist2651 16 часов назад

    The question is, does it work as what the say against Open AI 01?

  • @bittertruthnavin
    @bittertruthnavin 18 часов назад +1

    4T/s is respectable number but it's too slow which interprets 10 minutes of R1 thinking and then 4T/s like 3 words spitting out per second, super slow since we've been spoiled with ChatGPT's speed. I would hate to see that speed after going through all that hoola hoop. Thanks for letting me know in advance.

    • @arc8218
      @arc8218 12 часов назад

      Yeah, but its local though, if you want speed, just use the web version

  • @LuxmasterCZ
    @LuxmasterCZ 7 часов назад

    bro why can't can't write out results of t/s into description or pinned comment....

  • @justinknash
    @justinknash 12 часов назад

    Shouldn't you disable swap, or is it needed?

  • @KA-kp1me
    @KA-kp1me 5 часов назад

    If it works with systemd it will work with docker. Just use env properly. Go into container, type "env" and see if these are actually configured.

  • @trendingtopicresearch9440
    @trendingtopicresearch9440 35 минут назад

    what happens if you use ZRAM or ZSWAP to compress memory?

  • @betterwithrum
    @betterwithrum 13 часов назад +2

    Bruh, do you have solar? Because your power bill… woof

  • @jameschern2013
    @jameschern2013 11 часов назад

    precision training ,Lighter and faster.

  • @509tippy
    @509tippy 18 часов назад

    What's the limiting factor when wanting to increase your tokens per second performance?

    • @DigitalSpaceport
      @DigitalSpaceport  10 часов назад +5

      System bandwidth in all cases is the limiting performance factor. Both for VRAM or System ram. How fast the clock can move the bits roundtrip and how many basically. ASIC > FPGA > VRAM > RAM > Me with pen and paper is another way to think of it. It is conceivable a 9xxx AMD system could double the performance here, but at more than double the costs. I am preparing a quasi scientific test to isolate the performance of the unsloth 1.58b version to run against this machine and also a top end 7995wx workstation. Should be interesting. *inserts ring bell call to action*

  • @toothy
    @toothy 19 часов назад +3

    Hey! AGI in the garage!

  • @chaospatriot
    @chaospatriot 16 часов назад

    Hello, how fast would this be with the 32B or 70B model?

    • @yahm0n
      @yahm0n 12 часов назад

      Probably similar speed due to similar number of active parameters.

    • @arc8218
      @arc8218 12 часов назад

      Its not about how fast, but how smart the model would be, i try running 1.5B to 70B
      1.5 is meh for coding, only small task
      32 its okay for simple code
      70 6/10 for code

  • @정명호-q1t
    @정명호-q1t 20 часов назад

    I’ve considering 7352 (24c 48t) Because this is the fastest way I can buy in the region. I see your cpu load and tried to establish but,
    I can’t see your cpu load clearly in the video so Can i use lower cpu?

    • @DigitalSpaceport
      @DigitalSpaceport  20 часов назад +1

      CPU load is maxed out during testing on my 7702. The 7532 may not hit max bandwidth, you should investigate. I think it will but cant remember. The 7302 for sure can not due to lacking chiplets.

    • @DigitalSpaceport
      @DigitalSpaceport  19 часов назад

      My CPU reads as fully loaded. I do not believe it is good for me to advise you on this as I have not seen the performance of the 7352 myself. I can tell you the 7302 is not good for achieving high bandwidth however. I have owned one of those in the past.

  • @lutjenlee
    @lutjenlee Час назад +1

    marc andreessen sent me