Basically. Lol. Its not a very elegant solution, but it works pretty well actually. I was sure if I liked it at first, but it has been working out quite well and it keeps the GPUs surprisingly cool
I'm new to this field. could you make a video on pc build for AI/ML/DL. I'm thinking to build one to remotely access it using anydesk to do computational work when needed. what do you recommend server or pc build? could you please suggest
So I actually have another video series on a pc build for AI/ML/DL. I will link below. ruclips.net/video/8JGe3u7_eqU/видео.html I think the server route is more cost effective and faster to setup, however, when creating a custom build you have more control. Personally I prefer the server route for the cost and time savings.
Hi there. Thank you so much for the kind words and the question. I am not super familiar with the T630. However, from my brief research. It looks like there are technically enough x16 pcie slots. However, its going to depend on the form factor of the GPUs and if they will actually fit in the machine. My gut says no. I would say it looks like you most likely will only be able to fit in 2 3090s. Even that is still a maybe based on the space available inside of the T630. Hope this helps!
Hi @@TheDataDaddi , Thank you so much for taking the time to look into my question and for your insightful analysis. I really appreciate you pointing out the other factors that could be an issue when installing 4 RTX 3090 GPUs in the T630. Your observations made me reconsider, and now I’m thinking about possibly mounting the GPUs externally like you did. Besides the space, I’m also concerned about the heat these cards will generate inside the server. Wishing you continued success with your channel and thanks again for creating high-quality technical content! All the best! Your videos have been incredibly helpful, and I’m sure they’ll keep helping many others in the tech community. Keep up the great work!
Any packet losses due to the long cable lengths? I tried to convince GPUrisers to start making long PCIE cables that are non ribbon style. They claimed there would be issues with lengths that long.
Hi there. Really appreciate the question. Yeah, I have experienced this actually. It was not horrible, but once a week or so I would get a random error and have to reset the GPUs. I am not sure if this was due to the length or the poor build quality or both, but I a have sense found a better solution. I just have not had time to make a new video on it. I will make one soon though! www.eproductsolutions.com/slimsas-16i-sff-8654-to-pcie-x16-gen4-slot-backplane www.eproductsolutions.com/slimsas-16i-to-slimsas-16i-pcie-gen-4-cable-1-meter www.eproductsolutions.com/pcie-x16-with-redriver-to-slimsas-16i-add-in-card-pcie-4-0 It works flawlessly, but is much more expensive.
Wait, is the kde GUI not occupaing any ram on any GPU @41:07 ? When I was working in similarry sized workstation was annoying to see Gnome be a little to hungry.. P.S. Just new on your channel and subscribed, good job, waiting for the benchmarking, in particular how the P40 will perform with half precision, my experience with that generation is that maybe it's not convinient compared to double but may be wrong.
Hi there! Thank so much for the comment and the subscription! Really appreciate that. To my knowledge, the GUI rendering is handled by the Aspeed AST2400 BMC (Baseboard Management Controller) which is the mother board's built-in graphics card. Yeah I have read that Gnome can be like that. It was one of the reasons I went with KDE Plasma instead. Yep, I am working on finish up a preliminary benchmark suite. As soon as I get it finished, I will do a video on the p40 vs p100 and open source the project via Github repo. Thanks again!
@@TheDataDaddi yeah, the embedded gc, I suspected that, mine doesn't have it but I have to say with those kind of set up it is very convenient to have it, at least you will not experience the frustration of a gpu not fitting the whole model because someone is using firefox logged in :D Keep us posted with the benchmark project if I can I'll run it myself on the hardware I have.
When you get it running I will be very interested to see where the bottle neck is and whether it's the cpu or the gpus because the older xeons seem to be a bottle neck
I will eventually make a whole series devoted to LLMs and best setups just for that. In that series, I will definitely report back on where the bounds are. Unfortunately, my time is just incredibly limited these days. I will try to get this info out as soon as I can.
Just leaving a comment to help the channel grow and let you know that I appreciate the content you're putting out. You may have covered this in another video, but can you please explain the power setup for your lab? I'm assuming a dedicated 20A circuit for the lab, but any details around the power consumption, UPS(s), and any gotchas you've encountered would be very beneficial. Thanks and looking forward to seeing the channel move forward.
Hi there. Thank you so much for commenting. It really does help out the channel. So I have a power, heat, and sound video for the Dell R720s I have (linked below), but I have not done a one for the full home lab. I was planning on doing one soon one it is fully modified. Still have a few more things to do. I currently have one 20A circuit run for the lab. However, it seems to fine for my needs thus far. But I am going to run a second 20A circuit over there soon so that I can expand in the future if needed. Stay tuned though for a full video on this soon. ruclips.net/video/tmMx8AouTGA/видео.html
Maybe it’s a beginner question: what is the difference between your 1 year old video on beginner’s ml rig and this server? I assume the earlier is more likr a sandbox for machine learning hobby projects. This server is more for local hosting of LLM. Or maybe they are in different price range too. Could you please clarify it?
Hi there. Thanks so much for the question! Yep, that is a fairly accurate view. The first ml rig that was a custom build is more for smaller/single projects. It have also work for me well as a nice test bed before scaling up projects. I think this was about $1000 at the time I made the video. Here is what it would cost today: pcpartpicker.com/user/sking115422/saved/#view=zrXn23 The server in this video was designed to work with large scale DL application like working with some of the smaller open source LLMs, diffusions workloads, running many projects at once (this one has been a god send), large scale GNNS. I would say that this server is pretty much equipped to do anything, but work with very large LLMs (think Llama 70B as this pretty much requires a cluster to work with). I believe this rig cost me about $8300 to build. However, a large portion of that was RAM. I also bought RTX 3090s instead of p40s or p100s. So you could scale things down a bit and save a lot money here. I choose these specs for particular use cases that may not be applicable to everyone. I believe I address that in part of the video series if you are interested in more details there.
So ive got the same 4028GR-TRT. With the rtx 3090s you got in there you cant fit them in the server however the 3090 founder edition can fit in the case. take a little bit of forcing but they will fit as the power comes out the side with that weird adaptor. That way you dont have to run a whole bunch of other crazy risers and in run them all over the place.
Hey there. Thanks so much for the comment! Ah cool! This is great to know. I figured that the founders edition would at least have a shot a fitting in, but I wasn't sure so I went the external route. As for the adapter are you talking about something like this? www.ebay.com/itm/134077093756
40:55 my thoughts are that it would be much easier to cut the cover, or do some kind of 3d printed cover with holes for 3090s or 4090 psu connectors. The only obstacle would be to provide sufficient amount of power to 4090, but you can always limit the power of the card...or just add psu on the top of the server. still... 8x 4090/3090 won't fit on width
Yeah you are probably right here. Cutting the cover with a 3 printed cover is certainly and interesting idea. I actually thought about 3D printing a mount for the GPUs, but I just haven't had time to design one. I am not sure what the power output is for each of the 12V eps it may be enough actually to power the 3090s. Idk about the 4090s though. In any case, you could likely only fit 4 GPUs in the server if you went that route so there should be plenty of power in the server (theoretically). There actually may be some other form factors though that would allow you to fit all 8. I am just not sure.
@@TheDataDaddi I'm considering buying 6-8x 4090 and just play with the cover. Those 4090s would be 2-pci slots wide with individual water cooling factor. But still the server would be laud - ah, I want it to be silent.... ah, no good solutions. All these factors add up to a big mess ;)
Yeah the silent part is really going to be your biggest issue. The SM4028GR-TRT is truly the loudest server I have ever worked with. On boot and when under heavy load it can get above 90 dB. However, at idle its not really too bad. I would be really curious to know if you can fit 8x 4090s. Please keep me update with your journey if you remember!@@gileneusz
@@TheDataDaddi Sure, I'm leaning towards buying 2xA6000 to get 92GB of RAM on desktop, without any server... I have 2 cats, and they hate noise.... 😅 but I'm thinking about alternative to buy SXM4 server and populate it with 4xA100 40GB. I can buy them used on ebay for $5k each, but resell those things later would be almost impossible...
2xA6000 would be a good route if you are looking to keep things quiet. Thankfully unless mine are booting up my cat doesn't mind the noise. lol. As far as the SMX4 A100s are concerned, this would definitely be an interesting route. I think you could probably resell them actually. You just might have to be more patient as most people do not have SMX4 servers. @@gileneusz
Hey. Thanks for all the materials that you create. Gpu excel. etc. I have a question. do you recomend me invest a little bit more to build a SMX platform? for a home lab? something with cheaps p100 16gb SMX2. and later change for V100 32gb SMX2 too?.
Hi there. Thanks so much for the comment! So glad the content is useful to you. I will honest with you I am not the most knowledgeable about SMX in general, but I certainly think it would be cool to experiment with. I cannot speak on the performance benefits of SMX over more traditional architectures. One thing I have heard about SMX is it can be difficult to set up correctly. I would say this though. It is much more common to find non SMX gpus so they are likely more plentiful, easier trade upgrade etc., and there is more of a community around how to install and set them up. Personally, I would probably avoid the SMX architecture because of the added cost and murkiness around setup, but if you want the extra performance gains or just want to experiment with I would say go for it. I would be extremely interest to hear how that goes if you do go that route. Maybe one day when the channel gets bigger I can buy some SMX GPUs and make some video on how all that works and how the performance compares with the traditional architecture. I am sorry I could not provide more guidance here, but I hope this helps you! Please do let me know what you decide to do.
Hm im still watching but im looking to build some deep learning rig to host llama 3 400 dense , how many 3090s can i out together and how do i learn to do that?
Hi there! Thanks so much for the comment. I am not really sure it is possible for you host llama3 400 dense on 1 machine (without quantization). By my calculation, you would need close to 1TB of VRAM to hold the model. This would need to be split across many GPUs. Even if you used many H100 or A100 GPUs, you would likely not be able to host them in the same machine. It would take something like 12 or more of either. I do not know of any single servers that could support this. In this particular case, the super micro 4028GR-TRT shown in this video could theoretically handle up to 8 rtx 3090s rigged externally. That is only going to give you about 192 GB of VRAM. You might be able to get away with hosting the full 70B model without quantization with that amount of VRAM. However, for the 400 dense you are still a long ways off. To be able to host a model of that size, a much more practical and realistic way to do it would be to setup a distributed computing cluster with many GPUs on several different servers. In a distributed computing setup, each server would handle a portion of the model, allowing you to distribute the computational load and memory requirements across several nodes. This approach not only makes it feasible to host large models like llama3 400 dense, but it also enhances overall performance through parallel processing. To implement this, you might try utilizing frameworks that support distributed deep learning, such as TensorFlow with tf.distribute.Strategy or PyTorch with torch.distributed. These frameworks are designed to help manage the distribution of data and computations, ensuring that the workload is evenly spread and that the nodes synchronize effectively.
Tossing up between 3090s, A4000, and P40/P100 cards for my use case which would not exactly be ML/DL but rather local LLM usage hosted using something the likes of OLlama and various (I assume at least q4) models of higher parameters. I'm also dabbling with Stable Diffusion as well - at the moment I am shocked I'm able to run q4 quantized LLMs via LM Studio as well as Stable Diffusion models, on my little old aging M1 2020 Macbook Air with 16GB ram. I'm getting into the homelab idea, especially the idea of using a Proxmox server to spin up different VMs (including the Mac ecosystem) with way higher resources than what I'm working with currently. I'm also looking to integrate a NAS and other homelab services for media - but the GPU component is where I'm a little hung up - just what tier of card, exactly, is needed for this sort of use case? Am I nuts to think I could run some of the lesser quantized (as in, higher q number) LLMs on the low profile cards, as well as SD? It's been 10+ years since I've build a PC and am totally out of my element in terms of knowing just how good I've got it using the M series of chips - I've even been hearing of people running this sort of setup on a 192GB RAM M2 Ultra Mac Mini Studio, but would really love to get out of the Apple hardware if possible. I realize this was several questions by now... but, to distill this down, GPU thoughts? lol
Hi there. Thanks so much for you question! Yeah so this is a really good question. It really depends on the size of model you are trying to run. For example, to host Llama2 70B for FP16 you need approximately 140GB of VRAM. However, you could run quantized versions with much less. Or you could always work with the smaller model sizes. In terms of GPUs, I would recommend GPUs that have at least 24GB VRAM. I have been looking at this a lot for my next build, and I think I actually like the RTX titan best. The RTX 3090 would also be a good choice its FP16 performance just isn't as good. I think the P40/P100 are also great GPUs for the price, but for LLMs specifically they may not be the greatest options because the p100 has only 16 GB of VRAM and the p40 has very poor FP 16 performance. Another off the wall option is to look at the V100 SMX2 32GB. Since these are are SMX2, they are cheaper, but there are a lot fewer servers that they will fit in. The only one I know of off the top of my head is the Dell C4140/C4130. From my research, they the SMX2 GPUs are also fairly tricky to install. Anyway, these are the routes I would go to make a rig to host these models locally. I will eventually build a cluster to host and train these models locally so stay tuned for videos to come on that subject if you are interested
Hi there. Thanks so much for the comment. First and foremost, NVIDIA SLI is different than NVLink. SLI is primarily designed for linking two or more GPUs together to produce a single output for graphically intensive application (gaming in particular). It is not really designed for AI/ML/DL, and is not really used for this purpose to my knowledge. Also, for the 3090 it does not use SLI it uses NVLink. For NVLink, you do not necessarily need it. It does not make the total memory pool for each GPU any different, but it is certainly a nice to have. It will significantly speed up most operations as it allows communication directly between the GPUs at hundreds of GB/s. So, it will not prevent you from working with LLMs, but it will make you much faster when dealing with them if that makes sense.
Would love to see inference speed of something like llama on the P100 and P40 cards. I have dual 3090's so I'm familiar with that, but looking to 8x to gain more vram, but don't want the complexity of consumer cards. Have you considered the AMD MI100 card by chance?
I am working on some benchmarks now. I will try to get some quick ones for some of the open source LLMs because everyone seems to be most interested in those at the moment. Stay tunned, and I will try my best to get them out as quick as I can. I have not personally gone the AMD route yet, but its something I plan on experimenting with down the road. Its funny you mention the MI 100 card though. I was actually talking to a viewer the other day and he was telling me about his experiences with AMD and using the MI 100. To summarize his experience: "AMD is not worth it if you value your time, but once it's working it is fairly decent and a good alternative to Nvidia." If you are interested in this route, please reach out to me. You can find my contact info in my RUclips bio, and I can try to put you in touch with the other viewer.
Lookup the PSU connector pinout. There is one pin that the motherboard connects to ground, to make the PSU switch on. Pretty sure that is all the "tester" does. Historically, PSU's had trouble providing the rated 12V power without pulling any of the 5V power, so you had to put some load on the 5V lines, but I think this is no longer needed for modern PSU's. I have build the famous IKEA clusters, using one power supply for each pair of motherboard, you can do some crazy things with power supplies.
If you have different video card models like your Tesla p40 and 3090, do they still share vram and resources to do AI on Ollama with good efficiency? Is there one slowing others down?
Hi there. Thanks so much for you question! I have not actually tried this yet with respect to Ollama specifically so take what I say with a grain of salt. However, my intuition is that when making use of data parallelism (copying the same model across both GPUs and splitting batches across GPUs) the slower GPU (the p40 in this case) will not create a bottleneck. It should still improve the throughput overall. However, when trying to use 2 GPUs to host larger models, specifically models that cannot fit on single GPUs and must be sharded, the p40 would be a bottleneck. Throughput would be constrained by the slowest GPU. Hope this helps!
Cool video! Liked and subbed! Is there a reason for not using the blower design 3090s? Asus and Gigabyte has turbo versions that I believe would fit into your Supermicro case without any mods.
Hi there. Thanks so much the comment, like, and sub! Really really appreciate that. The main reason I went with that particular GPU was price. I found them for a really good deal so figured I would make them work. lol. There a couple form factors that would likely work without mods. The 2 your are mentioning form what I remember would probably work. I checked explicitly on the founders edition 3090s. By my calculations, these should work with no extenders.
Hello, as a noob, would you happen to know if you can cluster two GPU servers together. I've seen a couple videos of servers with 8 GPU's however the cost for those rigs are way out of my range. If I remember correctly, those machines were running threadripper cpu's and I believe the guy in the video stated that the machine he was using was on loan and cost about $50k. So, suffice it to say, I can't afford that. 😂
Hey there! Thanks for the great question! Yes, you can definitely cluster two GPU servers together, and there are several approaches to doing this, some more formal and advanced than others. A key consideration when clustering GPU servers is high-speed networking-technologies like InfiniBand or 100GbE Ethernet are commonly used to ensure fast, low-latency communication between nodes. Additionally, there are various software management options for distributed GPU computing, such as SLURM, Docker Swarm, and PyTorch Distributed, which help manage and scale workloads across multiple servers. I'm actually planning to build a GPU cluster later this year and will be making videos on the most cost-effective ways to set it up! Regarding the cost, you’re absolutely right-there are much cheaper alternatives to a $50k rig. If you opt for refurbished servers and older GPUs, the cost can drop to around 1/5 of that price (or even less in some cases). While $10k might still seem high for many, it’s much more manageable than $50k, and it's definitely possible to build a powerful setup on that budget.
I wonder if PCIe to SFF-8643 and then to SFF-8644 and then back will work (there is something like EPCIE16XRDCA02A from IOI Technology Corporation, but they definitely cost a fortune)
Hi there. Thanks so much for the comment. So there is something I have found here that works, but it is really expensive. Check out the links below and let me know if this solution works for you. www.microsatacables.com/slimsas-16i-sff-8654-to-pcie-x16-gen4-slot-backplane?srsltid=AfmBOopW-iTPCxqUSrN6NX8635QgsuJ_YJXou2cZfmQcC2ozFhuR9U0d www.microsatacables.com/slimsas-16i-to-slimsas-16i-pcie-gen-4-cable-1-meter www.microsatacables.com/pcie-x16-with-redriver-to-slimsas-16i-add-in-card-pcie-4-0 a.co/d/8NgXBSz
Hi there. Thanks so much for your comment! Could you provide a bit more context for your question? I would be happy to help. I am just not sure exactly what you are asking here.
Hi, was wondering if you could give me the basic specs of what you think is the optimal 2,000$ ML rig, I'm imagining it might have 4 P40s or similar but I can find any cheap server for 4 GPUs. The 1u 720s appear to support max of 2 GPUs, then the expensive Supermicro supports up to 8. Is there an intermediate solution? Thanks for your help
Hey Eric! I spent some time this morning looking for something that would fit your particular situation. Please take a look at following and let me know what you think: ASUS ESC4000 Server - $499.00 www.ebay.com/itm/134879048174?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=rI5jpFxtSLW&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY P40 GPUs x4 - $149.99 x 4 = $599.96 www.ebay.com/itm/196310785399?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=DDOYB0ZoRzO&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY 3.5" HDD 10TB - 69.99 x 4 = $279.96 www.ebay.com/itm/156130335844?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=kE7Z0-UgRW6&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY Power Cords - $9.56 x 2 = $19.12 64 GB DDR4 RDIMM Modules (Optional) - $99.99 x 4 = $399.99 (DOUBLE CHECK THIS RAM WILL WORK WITH THIS SERVER) www.ebay.com/itm/224440216180?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=hCBgAHuqSCi&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY Server Rails (Optional) (DOUBLE CHECK COMPATIBILITY) - $150.23 www.newegg.com/p/1B4-005K-01646 GRAND TOTAL: $1,948.26 Other considerations: This server does not come with a raid controller so if you want one that will be a bit more. I would personally recommend software RAID as there are some pretty good options out there. Also, you do not need RAM, but I would recommend getting at least 256GB. You can also just add to the RAM you already have which would be cheaper, but I like to go with more memory dense modules to allow room for expansion. Finally, this is just a suggestion so please do you own research with this before you actually buy everything to make sure it will all work together. I have done my best to ensure compatibility, but double check me for sure before you buy. You will also need to take into account shipping costs and taxes for your area so I will likely be a little above the $2K mark. If that is a hard limit for you, you can always remove things from the suggested setup and add them later (or not at all). With that said, please feel free to change the setup for whatever makes the most sense for you. This is just what I would do. Hope this helps and please let me how how it goes for you!
There are risers with an oculink interface. They are more expensive, but they have more compact and longer cables (up to 1m I belive). You can connect up to 4 cards to a single x16 PCIe slot, if 4 PCIe lanes per GPU are enough for your tasks.
Hi there! Thanks so much for the comment. Interesting. I have never heard of oculink, but I will certainly check it out! Thanks so much for the heads up.
I'mma have to fourth this suggestion. As-is that's a mildly terrifying setup. Typical PCIe extensions only come in certain lengths due to data integrity concerns with PCIe 4.0 and up. Quality Oculink cables have embedded redrivers to ensure link integrity and are tear-out resistant, which is MUCH safer...
Interesting. That explains why the long ones were so difficult to find. Haven't seemed to have any issues with data integrity, but I will certainly check into this issue. Thanks for the comment!@@KiraSlith
I have been looking and so far I can really only find Oculink cables in X4 and X8 lane configurations. I have not seen anything for the full X16 lanes. They seem like they are mostly for connecting SSDs. Do you have an example of one that could be used for x16 lanes to replace the PCIE extenders I used in my setup? I am struggling to find anything that looks like it would work. @@KiraSlith
@@TheDataDaddi You'll have to run 2 Oculink cables if you want the full 16x on both ends. Chenyang makes the x16 to dual Oculink i8 card for the server's end, and "Micro SATA Cables" (it's their brand name) sells the receiving adapter for the GPU, "Oculink 8i Dual Port to PCIe x16 Slot". Slot A on the card goes to CN1 on the receiver, Slot B on the card goes to CN2 on the receiver. Don't mix them up or you'll probably get caught in a boot loop.
32:50 you could essentially do it using a clothes drying rack
Basically. Lol. Its not a very elegant solution, but it works pretty well actually. I was sure if I liked it at first, but it has been working out quite well and it keeps the GPUs surprisingly cool
thanks for the information and video! helped a lot! Keep up the good job!
Hi there. Thanks so much for the kind words, and I am so glad that this video helped you!
I'm new to this field. could you make a video on pc build for AI/ML/DL. I'm thinking to build one to remotely access it using anydesk to do computational work when needed. what do you recommend server or pc build? could you please suggest
So I actually have another video series on a pc build for AI/ML/DL. I will link below.
ruclips.net/video/8JGe3u7_eqU/видео.html
I think the server route is more cost effective and faster to setup, however, when creating a custom build you have more control. Personally I prefer the server route for the cost and time savings.
@@TheDataDaddi Thanks sir!
@@HemanthSatya-eo4rq Sure! No problem. Happy to help!
This is awesome, that’s exactly what I was looking for!
Hi there. So glad this is what you needed! Thanks for the comment!
Omg You are awesome that m you I was trying to find a solution to fit 2 3090 ti in my design
7 case with 14 drives I found the solution thank you
Congratulations on your channel, I would like some help please, could you tell me if the Dell T630 supports 4 RTX 3090 24Gb GPUs?
Hi there. Thank you so much for the kind words and the question.
I am not super familiar with the T630. However, from my brief research. It looks like there are technically enough x16 pcie slots. However, its going to depend on the form factor of the GPUs and if they will actually fit in the machine. My gut says no. I would say it looks like you most likely will only be able to fit in 2 3090s. Even that is still a maybe based on the space available inside of the T630. Hope this helps!
Hi @@TheDataDaddi ,
Thank you so much for taking the time to look into my question and for your insightful analysis. I really appreciate you pointing out the other factors that could be an issue when installing 4 RTX 3090 GPUs in the T630. Your observations made me reconsider, and now I’m thinking about possibly mounting the GPUs externally like you did. Besides the space, I’m also concerned about the heat these cards will generate inside the server.
Wishing you continued success with your channel and thanks again for creating high-quality technical content!
All the best! Your videos have been incredibly helpful, and I’m sure they’ll keep helping many others in the tech community. Keep up the great work!
Any packet losses due to the long cable lengths? I tried to convince GPUrisers to start making long PCIE cables that are non ribbon style. They claimed there would be issues with lengths that long.
Hi there. Really appreciate the question.
Yeah, I have experienced this actually. It was not horrible, but once a week or so I would get a random error and have to reset the GPUs. I am not sure if this was due to the length or the poor build quality or both, but I a have sense found a better solution. I just have not had time to make a new video on it. I will make one soon though!
www.eproductsolutions.com/slimsas-16i-sff-8654-to-pcie-x16-gen4-slot-backplane
www.eproductsolutions.com/slimsas-16i-to-slimsas-16i-pcie-gen-4-cable-1-meter
www.eproductsolutions.com/pcie-x16-with-redriver-to-slimsas-16i-add-in-card-pcie-4-0
It works flawlessly, but is much more expensive.
Wait, is the kde GUI not occupaing any ram on any GPU @41:07 ?
When I was working in similarry sized workstation was annoying to see Gnome be a little to hungry..
P.S.
Just new on your channel and subscribed, good job, waiting for the benchmarking, in particular how the P40 will perform with half precision, my experience with that generation is that maybe it's not convinient compared to double but may be wrong.
Hi there! Thank so much for the comment and the subscription! Really appreciate that.
To my knowledge, the GUI rendering is handled by the Aspeed AST2400 BMC (Baseboard Management Controller) which is the mother board's built-in graphics card. Yeah I have read that Gnome can be like that. It was one of the reasons I went with KDE Plasma instead.
Yep, I am working on finish up a preliminary benchmark suite. As soon as I get it finished, I will do a video on the p40 vs p100 and open source the project via Github repo.
Thanks again!
@@TheDataDaddi yeah, the embedded gc, I suspected that, mine doesn't have it but I have to say with those kind of set up it is very convenient to have it, at least you will not experience the frustration of a gpu not fitting the whole model because someone is using firefox logged in :D
Keep us posted with the benchmark project if I can I'll run it myself on the hardware I have.
When you get it running I will be very interested to see where the bottle neck is and whether it's the cpu or the gpus because the older xeons seem to be a bottle neck
I will eventually make a whole series devoted to LLMs and best setups just for that. In that series, I will definitely report back on where the bounds are. Unfortunately, my time is just incredibly limited these days. I will try to get this info out as soon as I can.
Just leaving a comment to help the channel grow and let you know that I appreciate the content you're putting out. You may have covered this in another video, but can you please explain the power setup for your lab? I'm assuming a dedicated 20A circuit for the lab, but any details around the power consumption, UPS(s), and any gotchas you've encountered would be very beneficial. Thanks and looking forward to seeing the channel move forward.
Hi there. Thank you so much for commenting. It really does help out the channel.
So I have a power, heat, and sound video for the Dell R720s I have (linked below), but I have not done a one for the full home lab. I was planning on doing one soon one it is fully modified. Still have a few more things to do. I currently have one 20A circuit run for the lab. However, it seems to fine for my needs thus far. But I am going to run a second 20A circuit over there soon so that I can expand in the future if needed. Stay tuned though for a full video on this soon.
ruclips.net/video/tmMx8AouTGA/видео.html
Maybe it’s a beginner question: what is the difference between your 1 year old video on beginner’s ml rig and this server?
I assume the earlier is more likr a sandbox for machine learning hobby projects. This server is more for local hosting of LLM.
Or maybe they are in different price range too. Could you please clarify it?
Hi there. Thanks so much for the question!
Yep, that is a fairly accurate view.
The first ml rig that was a custom build is more for smaller/single projects. It have also work for me well as a nice test bed before scaling up projects. I think this was about $1000 at the time I made the video. Here is what it would cost today: pcpartpicker.com/user/sking115422/saved/#view=zrXn23
The server in this video was designed to work with large scale DL application like working with some of the smaller open source LLMs, diffusions workloads, running many projects at once (this one has been a god send), large scale GNNS. I would say that this server is pretty much equipped to do anything, but work with very large LLMs (think Llama 70B as this pretty much requires a cluster to work with). I believe this rig cost me about $8300 to build. However, a large portion of that was RAM. I also bought RTX 3090s instead of p40s or p100s. So you could scale things down a bit and save a lot money here. I choose these specs for particular use cases that may not be applicable to everyone. I believe I address that in part of the video series if you are interested in more details there.
So ive got the same 4028GR-TRT. With the rtx 3090s you got in there you cant fit them in the server however the 3090 founder edition can fit in the case. take a little bit of forcing but they will fit as the power comes out the side with that weird adaptor. That way you dont have to run a whole bunch of other crazy risers and in run them all over the place.
Hey there. Thanks so much for the comment!
Ah cool! This is great to know. I figured that the founders edition would at least have a shot a fitting in, but I wasn't sure so I went the external route. As for the adapter are you talking about something like this?
www.ebay.com/itm/134077093756
40:55 my thoughts are that it would be much easier to cut the cover, or do some kind of 3d printed cover with holes for 3090s or 4090 psu connectors. The only obstacle would be to provide sufficient amount of power to 4090, but you can always limit the power of the card...or just add psu on the top of the server. still... 8x 4090/3090 won't fit on width
Yeah you are probably right here. Cutting the cover with a 3 printed cover is certainly and interesting idea. I actually thought about 3D printing a mount for the GPUs, but I just haven't had time to design one. I am not sure what the power output is for each of the 12V eps it may be enough actually to power the 3090s. Idk about the 4090s though. In any case, you could likely only fit 4 GPUs in the server if you went that route so there should be plenty of power in the server (theoretically). There actually may be some other form factors though that would allow you to fit all 8. I am just not sure.
@@TheDataDaddi I'm considering buying 6-8x 4090 and just play with the cover. Those 4090s would be 2-pci slots wide with individual water cooling factor. But still the server would be laud - ah, I want it to be silent.... ah, no good solutions. All these factors add up to a big mess ;)
Yeah the silent part is really going to be your biggest issue. The SM4028GR-TRT is truly the loudest server I have ever worked with. On boot and when under heavy load it can get above 90 dB. However, at idle its not really too bad. I would be really curious to know if you can fit 8x 4090s. Please keep me update with your journey if you remember!@@gileneusz
@@TheDataDaddi Sure, I'm leaning towards buying 2xA6000 to get 92GB of RAM on desktop, without any server... I have 2 cats, and they hate noise.... 😅 but I'm thinking about alternative to buy SXM4 server and populate it with 4xA100 40GB. I can buy them used on ebay for $5k each, but resell those things later would be almost impossible...
2xA6000 would be a good route if you are looking to keep things quiet. Thankfully unless mine are booting up my cat doesn't mind the noise. lol. As far as the SMX4 A100s are concerned, this would definitely be an interesting route. I think you could probably resell them actually. You just might have to be more patient as most people do not have SMX4 servers. @@gileneusz
Hey. Thanks for all the materials that you create. Gpu excel. etc. I have a question. do you recomend me invest a little bit more to build a SMX platform? for a home lab? something with cheaps p100 16gb SMX2. and later change for V100 32gb SMX2 too?.
Hi there. Thanks so much for the comment! So glad the content is useful to you.
I will honest with you I am not the most knowledgeable about SMX in general, but I certainly think it would be cool to experiment with. I cannot speak on the performance benefits of SMX over more traditional architectures. One thing I have heard about SMX is it can be difficult to set up correctly. I would say this though. It is much more common to find non SMX gpus so they are likely more plentiful, easier trade upgrade etc., and there is more of a community around how to install and set them up. Personally, I would probably avoid the SMX architecture because of the added cost and murkiness around setup, but if you want the extra performance gains or just want to experiment with I would say go for it. I would be extremely interest to hear how that goes if you do go that route.
Maybe one day when the channel gets bigger I can buy some SMX GPUs and make some video on how all that works and how the performance compares with the traditional architecture.
I am sorry I could not provide more guidance here, but I hope this helps you! Please do let me know what you decide to do.
Hm im still watching but im looking to build some deep learning rig to host llama 3 400 dense , how many 3090s can i out together and how do i learn to do that?
Hi there! Thanks so much for the comment.
I am not really sure it is possible for you host llama3 400 dense on 1 machine (without quantization). By my calculation, you would need close to 1TB of VRAM to hold the model. This would need to be split across many GPUs. Even if you used many H100 or A100 GPUs, you would likely not be able to host them in the same machine. It would take something like 12 or more of either. I do not know of any single servers that could support this.
In this particular case, the super micro 4028GR-TRT shown in this video could theoretically handle up to 8 rtx 3090s rigged externally. That is only going to give you about 192 GB of VRAM. You might be able to get away with hosting the full 70B model without quantization with that amount of VRAM. However, for the 400 dense you are still a long ways off.
To be able to host a model of that size, a much more practical and realistic way to do it would be to setup a distributed computing cluster with many GPUs on several different servers. In a distributed computing setup, each server would handle a portion of the model, allowing you to distribute the computational load and memory requirements across several nodes. This approach not only makes it feasible to host large models like llama3 400 dense, but it also enhances overall performance through parallel processing.
To implement this, you might try utilizing frameworks that support distributed deep learning, such as TensorFlow with tf.distribute.Strategy or PyTorch with torch.distributed. These frameworks are designed to help manage the distribution of data and computations, ensuring that the workload is evenly spread and that the nodes synchronize effectively.
Tossing up between 3090s, A4000, and P40/P100 cards for my use case which would not exactly be ML/DL but rather local LLM usage hosted using something the likes of OLlama and various (I assume at least q4) models of higher parameters. I'm also dabbling with Stable Diffusion as well - at the moment I am shocked I'm able to run q4 quantized LLMs via LM Studio as well as Stable Diffusion models, on my little old aging M1 2020 Macbook Air with 16GB ram. I'm getting into the homelab idea, especially the idea of using a Proxmox server to spin up different VMs (including the Mac ecosystem) with way higher resources than what I'm working with currently. I'm also looking to integrate a NAS and other homelab services for media - but the GPU component is where I'm a little hung up - just what tier of card, exactly, is needed for this sort of use case? Am I nuts to think I could run some of the lesser quantized (as in, higher q number) LLMs on the low profile cards, as well as SD? It's been 10+ years since I've build a PC and am totally out of my element in terms of knowing just how good I've got it using the M series of chips - I've even been hearing of people running this sort of setup on a 192GB RAM M2 Ultra Mac Mini Studio, but would really love to get out of the Apple hardware if possible. I realize this was several questions by now... but, to distill this down, GPU thoughts? lol
Hi there. Thanks so much for you question!
Yeah so this is a really good question. It really depends on the size of model you are trying to run. For example, to host Llama2 70B for FP16 you need approximately 140GB of VRAM. However, you could run quantized versions with much less. Or you could always work with the smaller model sizes. In terms of GPUs, I would recommend GPUs that have at least 24GB VRAM. I have been looking at this a lot for my next build, and I think I actually like the RTX titan best. The RTX 3090 would also be a good choice its FP16 performance just isn't as good. I think the P40/P100 are also great GPUs for the price, but for LLMs specifically they may not be the greatest options because the p100 has only 16 GB of VRAM and the p40 has very poor FP 16 performance. Another off the wall option is to look at the V100 SMX2 32GB. Since these are are SMX2, they are cheaper, but there are a lot fewer servers that they will fit in. The only one I know of off the top of my head is the Dell C4140/C4130. From my research, they the SMX2 GPUs are also fairly tricky to install. Anyway, these are the routes I would go to make a rig to host these models locally. I will eventually build a cluster to host and train these models locally so stay tuned for videos to come on that subject if you are interested
do we need nvidia SLI to fine tune LLM with multiple GPU?
Hi there. Thanks so much for the comment.
First and foremost, NVIDIA SLI is different than NVLink. SLI is primarily designed for linking two or more GPUs together to produce a single output for graphically intensive application (gaming in particular). It is not really designed for AI/ML/DL, and is not really used for this purpose to my knowledge. Also, for the 3090 it does not use SLI it uses NVLink.
For NVLink, you do not necessarily need it. It does not make the total memory pool for each GPU any different, but it is certainly a nice to have. It will significantly speed up most operations as it allows communication directly between the GPUs at hundreds of GB/s. So, it will not prevent you from working with LLMs, but it will make you much faster when dealing with them if that makes sense.
Would love to see inference speed of something like llama on the P100 and P40 cards. I have dual 3090's so I'm familiar with that, but looking to 8x to gain more vram, but don't want the complexity of consumer cards. Have you considered the AMD MI100 card by chance?
I am working on some benchmarks now. I will try to get some quick ones for some of the open source LLMs because everyone seems to be most interested in those at the moment. Stay tunned, and I will try my best to get them out as quick as I can.
I have not personally gone the AMD route yet, but its something I plan on experimenting with down the road. Its funny you mention the MI 100 card though. I was actually talking to a viewer the other day and he was telling me about his experiences with AMD and using the MI 100. To summarize his experience: "AMD is not worth it if you value your time, but once it's working it is fairly decent and a good alternative to Nvidia."
If you are interested in this route, please reach out to me. You can find my contact info in my RUclips bio, and I can try to put you in touch with the other viewer.
Lookup the PSU connector pinout. There is one pin that the motherboard connects to ground, to make the PSU switch on. Pretty sure that is all the "tester" does. Historically, PSU's had trouble providing the rated 12V power without pulling any of the 5V power, so you had to put some load on the 5V lines, but I think this is no longer needed for modern PSU's. I have build the famous IKEA clusters, using one power supply for each pair of motherboard, you can do some crazy things with power supplies.
Interesting. This definitely makes sense. Thanks so much for the tip! Really appreciate that.
If you have different video card models like your Tesla p40 and 3090, do they still share vram and resources to do AI on Ollama with good efficiency? Is there one slowing others down?
Hi there. Thanks so much for you question!
I have not actually tried this yet with respect to Ollama specifically so take what I say with a grain of salt. However, my intuition is that when making use of data parallelism (copying the same model across both GPUs and splitting batches across GPUs) the slower GPU (the p40 in this case) will not create a bottleneck. It should still improve the throughput overall. However, when trying to use 2 GPUs to host larger models, specifically models that cannot fit on single GPUs and must be sharded, the p40 would be a bottleneck. Throughput would be constrained by the slowest GPU. Hope this helps!
@@TheDataDaddi thank you
Cool video! Liked and subbed! Is there a reason for not using the blower design 3090s? Asus and Gigabyte has turbo versions that I believe would fit into your Supermicro case without any mods.
Hi there. Thanks so much the comment, like, and sub! Really really appreciate that.
The main reason I went with that particular GPU was price. I found them for a really good deal so figured I would make them work. lol. There a couple form factors that would likely work without mods. The 2 your are mentioning form what I remember would probably work. I checked explicitly on the founders edition 3090s. By my calculations, these should work with no extenders.
great video
Hi there. Thanks so much for the kind words! Really appreciate you watching!
Hello, as a noob, would you happen to know if you can cluster two GPU servers together. I've seen a couple videos of servers with 8 GPU's however the cost for those rigs are way out of my range. If I remember correctly, those machines were running threadripper cpu's and I believe the guy in the video stated that the machine he was using was on loan and cost about $50k. So, suffice it to say, I can't afford that. 😂
Hey there! Thanks for the great question!
Yes, you can definitely cluster two GPU servers together, and there are several approaches to doing this, some more formal and advanced than others. A key consideration when clustering GPU servers is high-speed networking-technologies like InfiniBand or 100GbE Ethernet are commonly used to ensure fast, low-latency communication between nodes. Additionally, there are various software management options for distributed GPU computing, such as SLURM, Docker Swarm, and PyTorch Distributed, which help manage and scale workloads across multiple servers.
I'm actually planning to build a GPU cluster later this year and will be making videos on the most cost-effective ways to set it up!
Regarding the cost, you’re absolutely right-there are much cheaper alternatives to a $50k rig. If you opt for refurbished servers and older GPUs, the cost can drop to around 1/5 of that price (or even less in some cases). While $10k might still seem high for many, it’s much more manageable than $50k, and it's definitely possible to build a powerful setup on that budget.
I wonder if PCIe to SFF-8643 and then to SFF-8644 and then back will work (there is something like EPCIE16XRDCA02A from IOI Technology Corporation, but they definitely cost a fortune)
Hi there. Thanks so much for the comment. So there is something I have found here that works, but it is really expensive. Check out the links below and let me know if this solution works for you.
www.microsatacables.com/slimsas-16i-sff-8654-to-pcie-x16-gen4-slot-backplane?srsltid=AfmBOopW-iTPCxqUSrN6NX8635QgsuJ_YJXou2cZfmQcC2ozFhuR9U0d
www.microsatacables.com/slimsas-16i-to-slimsas-16i-pcie-gen-4-cable-1-meter
www.microsatacables.com/pcie-x16-with-redriver-to-slimsas-16i-add-in-card-pcie-4-0
a.co/d/8NgXBSz
tokens/s?
Hi there. Thanks so much for your comment!
Could you provide a bit more context for your question? I would be happy to help. I am just not sure exactly what you are asking here.
@@TheDataDaddi most likely they are asking about speed of inference of LLMs. e.g. ruclips.net/video/Z-JHgFs5BE0/видео.html
Good stuff
Thanks so much for the kind words!
Hi, was wondering if you could give me the basic specs of what you think is the optimal 2,000$ ML rig, I'm imagining it might have 4 P40s or similar but I can find any cheap server for 4 GPUs. The 1u 720s appear to support max of 2 GPUs, then the expensive Supermicro supports up to 8. Is there an intermediate solution? Thanks for your help
Hey Eric! I spent some time this morning looking for something that would fit your particular situation. Please take a look at following and let me know what you think:
ASUS ESC4000 Server - $499.00
www.ebay.com/itm/134879048174?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=rI5jpFxtSLW&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY
P40 GPUs x4 - $149.99 x 4 = $599.96
www.ebay.com/itm/196310785399?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=DDOYB0ZoRzO&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY
3.5" HDD 10TB - 69.99 x 4 = $279.96
www.ebay.com/itm/156130335844?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=kE7Z0-UgRW6&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY
Power Cords - $9.56 x 2 = $19.12
64 GB DDR4 RDIMM Modules (Optional) - $99.99 x 4 = $399.99 (DOUBLE CHECK THIS RAM WILL WORK WITH THIS SERVER)
www.ebay.com/itm/224440216180?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=hCBgAHuqSCi&sssrc=2047675&ssuid=xv3fo9_stiq&widget_ver=artemis&media=COPY
Server Rails (Optional) (DOUBLE CHECK COMPATIBILITY) - $150.23
www.newegg.com/p/1B4-005K-01646
GRAND TOTAL: $1,948.26
Other considerations:
This server does not come with a raid controller so if you want one that will be a bit more. I would personally recommend software RAID as there are some pretty good options out there. Also, you do not need RAM, but I would recommend getting at least 256GB. You can also just add to the RAM you already have which would be cheaper, but I like to go with more memory dense modules to allow room for expansion. Finally, this is just a suggestion so please do you own research with this before you actually buy everything to make sure it will all work together. I have done my best to ensure compatibility, but double check me for sure before you buy. You will also need to take into account shipping costs and taxes for your area so I will likely be a little above the $2K mark. If that is a hard limit for you, you can always remove things from the suggested setup and add them later (or not at all). With that said, please feel free to change the setup for whatever makes the most sense for you. This is just what I would do.
Hope this helps and please let me how how it goes for you!
There are risers with an oculink interface. They are more expensive, but they have more compact and longer cables (up to 1m I belive). You can connect up to 4 cards to a single x16 PCIe slot, if 4 PCIe lanes per GPU are enough for your tasks.
Hi there! Thanks so much for the comment.
Interesting. I have never heard of oculink, but I will certainly check it out! Thanks so much for the heads up.
I'mma have to fourth this suggestion. As-is that's a mildly terrifying setup. Typical PCIe extensions only come in certain lengths due to data integrity concerns with PCIe 4.0 and up. Quality Oculink cables have embedded redrivers to ensure link integrity and are tear-out resistant, which is MUCH safer...
Interesting. That explains why the long ones were so difficult to find. Haven't seemed to have any issues with data integrity, but I will certainly check into this issue. Thanks for the comment!@@KiraSlith
I have been looking and so far I can really only find Oculink cables in X4 and X8 lane configurations. I have not seen anything for the full X16 lanes. They seem like they are mostly for connecting SSDs. Do you have an example of one that could be used for x16 lanes to replace the PCIE extenders I used in my setup? I am struggling to find anything that looks like it would work. @@KiraSlith
@@TheDataDaddi You'll have to run 2 Oculink cables if you want the full 16x on both ends. Chenyang makes the x16 to dual Oculink i8 card for the server's end, and "Micro SATA Cables" (it's their brand name) sells the receiving adapter for the GPU, "Oculink 8i Dual Port to PCIe x16 Slot". Slot A on the card goes to CN1 on the receiver, Slot B on the card goes to CN2 on the receiver. Don't mix them up or you'll probably get caught in a boot loop.