One thing to note: The time to first token as discussed here is incorrect. it is representing the overall time to response of the LLM instead of when the first token comes. This was due to a bug in the benchmark itself.
I have 3 p102-100 GPUs. One by itself runs great, however combined with larger models they do struggle. For example a 8B Q8 model runs at 32tk/s but a 27B Q6 runs at 6tk/s. Also Ollama uploads the models into memory sequentially, meaning you have to wait 10seconds per GPU with PCIe 1 x4 until processing starts. Still, these are cheap to play around.
Hi. Interesting. A couple of questions, if you're interested in doing a follow up. What's the power usage per operation like? How does it do on non-AI compute? How well does NVLink work? Here's my thinking. These might be decent for something like a personal Blender render farm. You could make a rig with maybe 2-4 of them. Especially for the RAM, 10 GB seems reasonable for the price.
these are great questions, thank you for them power usage per operation is really interesting and something i would love to follow up on in the future. chipsandcheese.com probably has the most extensive information on things like this currently, but i am not sure if they are measuring that much in depth. will see what i can do with regards to that, but it may take some time the p102 does not have nvlink, but the older SLI. definitely could be curious to test it and see how it does right now i am not looking at too many other workloads, but i could be interested to explore them. i just don't work directly in them so my knowledge would be a bit less. if you could point me at certain workloads you would be interested in having benchmarks for i can try and do what i can
I share my experience that can help other ppl :) I have modded my M40, I installed a cooler heatpipe of 980ti with 3 fans with 4k rpm speed and it works; the temp max is 65 C ;) Look on web if someone has modded your tesla gpu . Now i m trying to mod my Tesla P100 (has a special chip ,not other nvidia gt gpu has same chip) and I will install a AIO watercool like I see on internet. Sorry for bad english
@@cj-pais inference, training, fine-tuning. What would be the best config for a multi-GPU setup with them? What would be the performance hit with row/layer split? Something like that. It would be useful for large LLMs and training as it is PCIe 1x. We can pack up to 16 to one PCIe 16 slot with 160GB of VRAM.
@@blarhblerh3436 this all sounds great, will see what's possible. i would suspect training is going to be impacted fairly heavily by the lack of pcie bandwidth, but would love to try and test it and find out for real!
@@denismaleev3848 Ok. I know it may be slow to load weights due to PCIE speed. That should not be an issue because you do not change models very often. But what about interference speed?
The benchmarking rig is: * AMD EPYC 7352 24 Core CPU * 128GB RAM Run on Ubuntu 22.04 Kernel: 6.5.0-41-generic Drivers: NVIDIA 555.42.06. Drivers were installed via: sudo apt-get install -y cuda-drivers (proprietary, and supports the older GPUs)
@@cj-paisno other software or any modifications required for getting this cards to work? They are locked out of performing some things as far as I know
@@def7782 nope! worked just like any nvidia card does for CUDA application at least. for other applications it may be different. but the workloads tested are all just compute, so it just worked
That is not a good card, 5GB vram and 250W with only 3200 cuda cores. There's better options than this. I'll even recommend a couple v100s as they're 150$ CAD and each have 300W with 5020cuda cores. So for nearly 2x the price you get 3x the vram for the same wattage. Also setting up old nvlink servers is cheap.
It is almost the P40 but with 10GB vram once firmware flashed. Best part, you can find them for $50. The V100 is good for Exlama, P40 or this better in Ollama.
ook into mini pc with amd cpu / gpu some can use 96gb of ram you can assign 16 to 42gb of ram to the gpu this is the best value for A.I. hands down Alex Ziskind "Cheap mini runs a 70B LLM "
One thing to note: The time to first token as discussed here is incorrect. it is representing the overall time to response of the LLM instead of when the first token comes. This was due to a bug in the benchmark itself.
I have 3 p102-100 GPUs. One by itself runs great, however combined with larger models they do struggle. For example a 8B Q8 model runs at 32tk/s but a 27B Q6 runs at 6tk/s. Also Ollama uploads the models into memory sequentially, meaning you have to wait 10seconds per GPU with PCIe 1 x4 until processing starts. Still, these are cheap to play around.
I have 2 P40 units and one M40Telsa as well as a RTX 3060 12 GB for A.I. spread around my lab, I love the P40's as they both have 24GB vram.
Please, how did you use 2 P40 and one M40 in the same computer ? I didnt understant why you need the M40. Thanks
@@bourgogneguillaume I did not use them in the same computer, I used the 2 P40's in one machine, the M40 and RTX3060 together in another computer.
Cool video, novel thoughts! Thank you!
Thank you!
I just bought a 3090 because i was frustrated to not be able to do everything with my vega 56 for my usage the p102 would have made more sense i think
Hi. Interesting. A couple of questions, if you're interested in doing a follow up. What's the power usage per operation like? How does it do on non-AI compute? How well does NVLink work? Here's my thinking. These might be decent for something like a personal Blender render farm. You could make a rig with maybe 2-4 of them. Especially for the RAM, 10 GB seems reasonable for the price.
these are great questions, thank you for them
power usage per operation is really interesting and something i would love to follow up on in the future. chipsandcheese.com probably has the most extensive information on things like this currently, but i am not sure if they are measuring that much in depth. will see what i can do with regards to that, but it may take some time
the p102 does not have nvlink, but the older SLI. definitely could be curious to test it and see how it does
right now i am not looking at too many other workloads, but i could be interested to explore them. i just don't work directly in them so my knowledge would be a bit less. if you could point me at certain workloads you would be interested in having benchmarks for i can try and do what i can
I saw that someone tried to unlock x16 by soldering additional capacitors and hacking the BIOS. Do you know anything about this?
no, this is super interesting! would be really cool to see
I heard that it is impossible to unlock x16 on this device, because of lack of some connections inside chip
I share my experience that can help other ppl :) I have modded my M40, I installed a cooler heatpipe of 980ti with 3 fans with 4k rpm speed and it works; the temp max is 65 C ;) Look on web if someone has modded your tesla gpu . Now i m trying to mod my Tesla P100 (has a special chip ,not other nvidia gt gpu has same chip) and I will install a AIO watercool like I see on internet. Sorry for bad english
It has PCIE x1 4.0. How about multiple GPUs setup and transfers between the cards? How big a bottleneck it will be?
Will follow up and test soon! Are you thinking mostly inference or?
@@cj-pais inference, training, fine-tuning. What would be the best config for a multi-GPU setup with them? What would be the performance hit with row/layer split? Something like that. It would be useful for large LLMs and training as it is PCIe 1x. We can pack up to 16 to one PCIe 16 slot with 160GB of VRAM.
@@blarhblerh3436 this all sounds great, will see what's possible. i would suspect training is going to be impacted fairly heavily by the lack of pcie bandwidth, but would love to try and test it and find out for real!
have the same setup with 8gpu. x1 4.0 its so slow for first upload data to the GPU(
@@denismaleev3848 Ok. I know it may be slow to load weights due to PCIE speed. That should not be an issue because you do not change models very often. But what about interference speed?
What library/program did you use for inferencing using this gpu?
For language, vision, and speech to text I used the ggml based llamafile and whisperfile
For diffusion it was done with ComfyUI
I'm from Indonesia, how much would it cost to get a p102-100?
I am not sure the best way in Indonesia, here in the US we buy them off ebay
What is your system setup? Which drivers did you use to get it to work?
The benchmarking rig is:
* AMD EPYC 7352 24 Core CPU
* 128GB RAM
Run on Ubuntu 22.04
Kernel: 6.5.0-41-generic
Drivers: NVIDIA 555.42.06.
Drivers were installed via: sudo apt-get install -y cuda-drivers (proprietary, and supports the older GPUs)
@@cj-paisno other software or any modifications required for getting this cards to work? They are locked out of performing some things as far as I know
@@def7782 nope! worked just like any nvidia card does for CUDA application at least. for other applications it may be different. but the workloads tested are all just compute, so it just worked
nice landing huahuaha
ahahahaha face first
That is not a good card, 5GB vram and 250W with only 3200 cuda cores. There's better options than this. I'll even recommend a couple v100s as they're 150$ CAD and each have 300W with 5020cuda cores. So for nearly 2x the price you get 3x the vram for the same wattage. Also setting up old nvlink servers is cheap.
It is almost the P40 but with 10GB vram once firmware flashed. Best part, you can find them for $50. The V100 is good for Exlama, P40 or this better in Ollama.
ook into mini pc with amd cpu / gpu some can use 96gb of ram
you can assign 16 to 42gb of ram to the gpu
this is the best value for A.I. hands down
Alex Ziskind "Cheap mini runs a 70B LLM "