We're not 32X faster than a 980ti today, so expecting us to advance that fast 10 years from now is more than optimistic. The only way this would happen is if we had a radical breakthrough and shifted to an entirely different way of manufacturing chips.
Most of the articles and people kicking around those rumors are going off of the supposed help in architecture and Fabrication with AI. Having the AI build the most optimal path for the process and the taper is what is supposed to incrementally help . . . *Supposedly*
A 3060 would be perfectly fine for ALL inference in LLM if it only had 40GB of RAM. No need for stupid H100s. But nVidia themselves are scalping by not allowing your RAM controller to address more RAM, and I bet its locked, and the hardware can in fact address more RAM, just like it is locked for overclocking. (you think that wouldn't be the first thing I do when buying a graphics card, pick my hot solder station and replace the memory chips for bigger ones)
I know, I was so hoping I’d be able to score a used one for cheap to replace my 3080 since all the whales would go for a 5090 but even the whales are backing out of this gen more than I thought
But the level of investment in manufacturing and development has increased by almost unimaginable levels, especially in the last two years. Ten years ago, NVIDIA's market cap was 11B and GPUs were a rounding error in global semiconductors. NVIDIA's market cap is now 3000B. The amount of money pouring into this space is wild. R&D at NVIDIA has increased over 10x in that period and that doesn't even take into account TSMC and every startup in the world working on AI hardware. It's getting harder to shrink transistors but the effort going into improving the process in increasing at ever faster rates. I don't know if we'll get 16x but progress is coming.
@@Swiftriverrunning Market cap has nothing to do with the money that a company has on their hands fior R&D. Often it does not even matter to the company at all. You don't understand what stocks are. How many stocks has NVIDIA sold in the last 5 years? And i don't mean employees from NVIDIA
This was me. I had 2070s and skipped entire 3000 gen. Got 4090 at discount last year at spring time. Cant believe cards are scalped again like prev gen cards
GDDR is 32bits wide, moving from 2GB to 3GB modules which ARE coming in the next year or 2. Nvidia may have problems because they moved to GDDR7 which only Samsung supply currently. The lower end cards are going to need more VRAM, while bandwidth is improved by large caches.
@@rkan2they aren't, the 5090 is so big because all of the outside is driving i/o memory controllers at 512bits there's no room left. VRAM is organised differently from the DDR modules used with CPU to maximise bandwidth.
@@RobBCactive I can understand the 5090 being a bin of the datacenter SKUs that have more RAM or less defects, but surely you could still at least double the amount of RAM by limiting the processing performance..
@@rkan2 have you seen the 5090 pcb? its fucking full no more space they can make it 48gb if they REALLY want by using 16 3gb gddr7 instead of 16 2gb chips but 5080 chip could have been larger with 2-3 more chips they just didnt want to give you 20-24 gb of vram for 1000$ this gen
If you mostly care about LLM inference, and especially if you're on Linux, AMD is perfectly fine. Ollama works, llama.cpp works, VLLM works. Performance is pretty good, and you get a lot of VRAM for cheap. Things only really get hairy (sometimes) when Pytorch enters the picture. Also, current AMD cards don't support FP8 and FP4, which is a bit of a problem for image generation, but doesn't really matter for LLMs. I believe the 9070 will introduce FP8 support at least, but only has 16GB VRAM. That said, the upcoming Ryzen AI Max 395 might be a very interesting option for LLM inference, with 128GB unified RAM and a much wider memory bus than previous APUs.
Didn't somebody run 408B llama on 8 mining AMD GPUs at 50 tokens/s? To be honest, the more I look into that stuff AMD cards make sense for inference, cheap and high vram.
I have an AMD GPU and I'm on Linux. I tried to apt install AMD ROCm, it asked for 50GB worth of library downloads 💀. Tried to push ahead anyways, and ended being bottlenecked on space in my root partition :(
@@comradepeter87 Find where it put the files then make a symlink to an other drive. It might also be possible to only install Rocm runtime. It is a way smaller but some software may need the full dev if they to compile stuff. Anyway I use symlink a lot to keep my most used model/checkpoint on the NVME drive while offloading everything else on the Sata drive. And the temporary download files are also sent to a HDD with a symlink to avoid filling my home SSD with temporary trash.
I see chat spamming "Naive" / "Just you wait" to Casey's comment how we are at a point where we can barely push these new GPUs further, how dumb can people be. People are hearing this from a veteran game developer who has some of the greatest insight into these things and don't believe him. We are living at a time where you can have like a 1080 ti a goddamned 8 year old GPU and it can still compete with the lower tier of current generation of graphics cards (and it was also not that expensive at release btw, at least before the crypto boom). There's a reason Nvidia is pushing AI and software hard, because they know current rate of hardware improvement is ass, Moore's Law died a while ago, it's not early 2000s anymore.
Moore's law is "dead" because of nvidia and/or tsmc monopoly. You can only innovate so far with a single brain. The world needs more fab from other countries
The reticle limit is a thing, but then so are chiplet designs, both from AMD and Nvidia, even though only AMD have sold such cards as gaming GPUs. Nvidia didn't go for N3 this gen and stuck with N5 derivative node N4. So Nvidia are holding back, they could have gone further and didn't. N2 is just about to release and there are pathways to 18, 16 and 14 Angstrom nodes so whilst some aspects of chips are not really scaling anymore there's more than enough room for logic to keep shrinking and so GPUs to get more powerful in the next 10 years
@@gggggggggghhhhoost my dawg, finfet and gate all around are literally scraping the bottom of the barrel. Silicon wafers don't have the atomic radii for your precious electrons not to escape your increasingly delicate gates. We're relying on asml here, not even tsmc! You don't even know what euv is! There isn't any nanometers below 9nm, it's all marketing!!!
@@gggggggggghhhhoost the monopoly for sure doesn't help the situation but Casey is right about physics. A silicon atom has a diameter of 0.2 nanometers and our best process nodes are right around 2nm. We only have 10 atoms to play with between traces and at that level. At that scale everything from simple optics (diffraction) to quantum mechanics like tunneling become limiting factors. At 4nm, a single atom out of place is within a 10% (+-5%) manufacturing tolerance while at 2nm you need a 20% margin of error. Until we have tech that individually place atoms, lithography process improvements are dramatically slowing down the closer we get. I also didn't even talk about die size growth and how that affects yields. AMD does chiplet design which helps mitigate yield defects, but they are currently not that competative at the top end and the stranglehold of CUDA adoption does hurt them as well.
I don't know enough to claim that Moore's law is dead, but we are at some physical limits with chip production. Most people claiming /naive probably don't understand any of the manufacturing challenges. I mean, they already can't use optical lenses because the EUV light won't pass through glass.
if they can't produce enough of those chips, all they have to do is to activate mfgx4 to interpolate between two existing chips and it'll all be fine...
Prime, if you feel bad for asking for a 5090.. Ask for a H100 to use at home. Will be a 1st YT content about it being using at home 😂 and I want to see this content
I like the people saying the 4090 will be the 5th fastest videocard after all the 50 series are out and think its a own. It quite didn't work out that way.
should just get a rx 6600, it is $190 at the cheapest right now (or could splurge 50 bucks more for the XT, but the Arc B570 is the better choice at that price range), ~50% more performance than the card you have right now at a relatively good price
@ that is still a very decent card, can play GTA V with very high settings at 1080p with 60+ fps and Cyberpunk with low settings at 1080p with 50-60 fps (can squeeze some more with FSR). People tend to forget about the older cards since everyone wants to have the latest shiny thing, however these cards still got a lot of potential, especially if you don't plan on playing the latest, most demanding AAA games. Though I don't know how well that one is going to perform if you used it for mining haha
26:38 bear in mind that combining architectures (Ampere and Ada) might give unexpected edge cases. most often it will result in either disabled Ada features (best case) or, depending on what you're doing, simply refuse to combine the vram
There's a one word explanation for this phenomenon: MONOPOLY. ASML is a monopoly, which has little incentive to boost production and reduce sales prices. The high price/scarcity in turn raises the barrier of entry for chip manufacturers, resulting in TSMC being almost a monopoly with just a little more competition and a little more incentive to reduce scarcity/prices. That in turn makes Nvidia just a little less of a monopoly for the same reasons. The AI companies and their investors have been hoping that the same concept will make them monopoly/oligopolies, which is why the Deepseek advancements tanked stock prices.
ASML does have competitors. Except not for their High-end machines that are being used to create those chips. They paid a high price to create a machine to create chips using EUV, which nog gives them a competitive edge. Just like nVidia also has a competitor in AMD and Intel. Except also there AMD has given up on the high-end chip and Intel is just beginning (again). So for now, we must wait until the AI hype is over. Just like for the beginning of the 4090 release we had to do with GPU usage for blockchain.
I bought my RTX 4090's two years ago when they came out, and now they are somehow worth 30% - 50% more than I paid for them. Wild times. I feel your disillusionment, prime.
There are GPUs in China you can buy that are ~1080 performance, for more than a year now. They struggle with driver support and aren't really viable commercially, but supporting AI applications with software is a lot easier than supporting all of gaming. China's bottleneck remains whether TSMC is allowed to take orders from China on the latest node or not.
@@Kwazzaaapharbin uni had an euv lithography breakthrough so more like 3 years. We dont need chinese gaming GPUs, all we need are AI ones to slash nvidia's margins and make gaming attractive again.
I agree but the tariffs will hurt. The 5090 shortage is a manufactured event. The chinese are at least 10 years behind. Withe economic espionage then could shorten that gap. And they are trying the espionage route.
deepseek-r1:70b runs fine on a 64 GB M3 MacBook, at around 30 characters/second output, using ollama. To run the full deepseek r1 model, you will need 800GB of memory, to train it 1.5TB. You can use a few big CPUs with 128-256 cores. It will be slow, but it will work. Otherwise you need like 10 GPUs with 80GB Memory, or 20 with 48GB to run your model. First one might draw up to 5kW of power, 2nd up to 10kW. Thats $1-2 per hour in power alone 24-50 per pday, 1500-3000 per month. Double that if you want to train your model.
Database Mart - you can rent a cloud machine there for like $450/mo with an A6000 - I think you can even get an A100 for under $1k/mo. Or LambdaLabs - rent by the hour.
Welcome to budgeting hell, have a nice stay. And be REALLY conscious about the VRAM aspect. The moment your model can't fit, you have different parts moving at different speeds, and the fast side will always be waiting on the slow side. Rough comparison, memory cache vs on disk. Thus, you have to account for what the speed up on GPU side means compared to the slow down on the "non GPU" side means. Ah well, me still waiting on the 5050, assuming it comes out, 5060 if not. It's a blessing when your models can fit on "small stuff" because they're not trying to be everything for everyone.
CAN'T U JUST PUT IT IN MAIN SYSTEM MEMORY AND IF U CAN LOAD IN THE PARAMETERS YOU NEED FASTER THAN THE GPU CAN PROCESS THEM YOU'RE GOOD RIGHT? (I'm being fr)
@@markdatton1348 Yeah, but it depends why, if the goal is to run it offline to avoid using their chatbot to preserve data leaks, you can still do so by using your own cloud space, if it was to use it without a network connection, yes you have to pay for your own machine, unlucky.
Dude all the people in the chats going “ITS THE SCALPERS” are so clueless. If you magically snapped your fingers and made scalping impossible, that wouldn't make magically more cards be available. They just aren’t making enough cards. You wouldn’t get one either way.
Not to mention, scalping thrives on supply and demand. This wouldn't be an issue if you were always able to go to the store and get one at msrp. Scalpers will always exist, but they're only successful when supply falls short of demand.
I'd much rather have a lottery system at MSRP than high prices. Although I don't think fighting supply/demand like that will work (at least not without high costs and being intrusive - like suing everyone, implementing hardware limitations, background checks, and monitoring individual customers - that's not gonna happen, and even if it did it might just add barriers and raise the prices more).
Gtx 980 (165W) was released in 2014, and it scores 11110 benchmark points on passmark, Gtx 5080 (360W) was released in 2025, and it scores 37287 benchmark points on passmark, Basically you have a x3,34 improvement on this specific software benchmark. I believe that's not even fair because Gtx 5080 consumes x2.19 more power than Gtx 980. It was a bigger jump between 8800GTX released in 2006 and gtx 1080 released in 2016 , basically x26,9 better the newest one. I tend to agree that, if nothing special will not be discovered, this way of building GPUs will not bring much improvement.
Comparing models of that age difference has caveats as they will be 3x in some aspects and 10x in others but your point stands that it is nowhere near 16x or 32x. AI hypers and Nvidia fanboys will just keep lying for free until the end of time though.
In raster gaming its more like 6x plus more performance. Not even talking about RT performance which can leverage something like optix for path traced 3d renders etc. with much better boost than 6x. And lastly the tensor cores AI performance that is another world and it's uncomparable. Also games can leverage tensor cores now so performance difference is multiplied by double digit numbers.
To add to the AMD side, AMD on windows also works pretty good. Never really ran into torch directml issues, and ollama itself runs nicely. XTX is such an underrated card.
Now considering enormous HW accelerated AI power, transformer model DLSS features, RT performance, CUDA and OptiX and many more that Nvidia cards have, 7900xtx compared to rtx 4080 is much worse value. Basically its outdated already.
Hey @prime, also, I hear you saying you want to not only run the models (inference) but train models. Training models require more VRAM then running them for inference. If a 16B model takes 24GB, then for training you’d need about 100 GB VRAM. This is because in training you also need to store the gradients for back propagation.
If I recall, the reviewers get a review unit which they have to send back after. While it is true that some of the big youtubers do get GPUs like that for free (see the video game industry), it's mainly due to network connections, getting into the big club, and years of being towing the line.
The 3090s used Samsung’s custom 8nm (8N) process for its GA102 die, packing 28 billion transistors. While powerful, this node was less efficient than TSMC’s alternatives, leading to higher power draw and thermal output. What are you even talking about, bro?
The root of the problem seems to be TSMC. It's not like we had this shortage problem happen right now, this shi has been going on for years and TSMC seems either unable or unwilling to scale up production. At these premium prices you would expect competition to prosper, but we are going in the opposite direction. I still don't understand how billions of USD can't reproduce whatever they are doing there.
Sad they never could get the SLI work properly since the professional cards, with NVLink you’re just stacking more. 6gpus working together as one big GPU.
They could, it worked but the method they used was to put it on the developer to integrate. They have more than enough staff and money to make a functional version now but they didn't like how you could get 2 cheaper cards and beat the flagship for cheaper. They are 100% gonna move to chiplets making any form of newer sli kinda pointless now
I’ve tried running DeepSeek with ollama on RTX 6000 Ada. 32B param model takes about 20+GB vram on the GPU, so should fit on 3090/4090. 70B model takes like 43GB and although fits on my GPU it’s quite slow - really depending on a question. Don’t ask “loaded” questions. I haven’t tried to optimise the models so those are out of the box as is. I’d say 5090 will be much more future proof, however still might be limited by its memory. Obviously if you use unified memory to let model spill over to RAM performance will suffer like in swapping scenario. Hope that helps someone. TBH I was impressed by DeepSeek at first but now kind of disillusioned. I’ve got some better answers from ChatGPT and Claude on some C++ libav programming. But maybe the model is not trained much on that.
Well, it's called 1.58 bit quatization because a model is rounded to ternary weights instead FP32, FP16, or whatever and the new weights have only {-1,0,1} elements, this reduces the multiplications of matrices complexity to a binary operation level in LLMs, 1.58 bit comes from 2^(1.58) =~ 3 , and 3 is the said ternary weights quantization. Prime you are considering on having many apple's you can also parallelize many NVIDIA 3090, it's really hard to get a 4090, or just wait and try to buy them over time. You can also parallelize different NVIDIA brands as long it runs in CUDA with auxiliar pytorch libraries.
The 'racket' is the pricing for less performance. Not necessary the current limitations of the tech. I learned about the idea that lower chips are actually more defective chips. It is nuts.
What's nuts about it? You have a factory that spits out products. Some of them have more defects and some have fewer. Are you suggesting only keeping the perfect few and throwing the rest into landfill?
The only things you should look at are: VRAM (nothing matters if you can not fit the model) Tensor core precision support. You really want BF16 since you keep the exponent size of fp32 with half the cost. Ampere and newer support this. Working with lower precision is annoying if you want to do it yourself. You have to do a lot of work to maintain stability and accuracy.
VRAMis most important. and the mother board and cpu needs to have enough PCIe lane support. alot of mainboards only support 1 full PCIe lane while the remaining ports are nerfed.
@@Henrik_Holst yep, the plot twist is that "benchmark" was done using LLM slop and the Fortran code seemed faster because it limited the char size to 100. Lol😂
Yep. It isn't that they're selling all their Lemonade to one customer because it's easier for them no. It's actually WAY WAY WORSE. They're selling us what they call "Lemonade" and maybe at some point in the past (GTX1080) It was still mostly Lemonade, but now they've been slipping in so much of this synthetic oil to the product because they want to mainly sell it as lubricant to their giant Corporate Oligarchs at a lower cost and get all the gamers to pay for it still. That's what's essentially going on here. Poisoning us, while charging larger premiums, so giant Corporations can profit even MORE from our labour. Welcome to late stage Capitalism, baby. It's only going to get worse from here. They haven't been making graphics chips in quite some time. They're only pretending to.
Until 2 years ago I worked in HPC and I can tell you that there are 2 classes of cards in the "enterprisy" category. There are the RTX A(Ampere/3000) series, which replaced the qudro cards built around being pun into workstations, have their own fans/coolers are more consumer friendly, and stuff like the A10/A40/A400 class passive cooled cards that go into server chassis. At the base, they are pretty much the same thing as the consumer cards of similar class but with double the VRAM, same or better TDPs, higher VRAM bandwidth. They perform almost identically to the consumer card. The A40 or RTX6000 is in margin of error to the 3090 for this use case with the difference that the 3090 uses a lot more power.
@ThePrimeTime Going for the beefiest single gpu you can get is probably the most satisfying setup right now, especially compared to using just 2 or 3 gpus in total (and not many more). Data transfer rate between the cards during inference puts a pretty hard cap on tokens/sec when the model is spread on multiple gpus, with the main benefit being you can run bigger models without parts of the model going into ram. If you can fit a model entirely in one gpus vram, then you can really see them fly on modern gpus.
I made 2 A.I. builds in 2022. All server parts and welding and 3d printing and fans. This is disheartening: I want a 5090, but the situation is not a smart investment.
So not only are people punching air over AI, they’re punching air over pricing and availability - despite the fact this overlap doesn’t even care (so they claim anyway).
Mac mini route should be fine for your use cases. You probably only need RAG for the doc search / coding anyway. Finetuning without a sufficient dataset often only hurts performance.
Prime if you read this what I recommend is get a cheap x99 motherboard + cpu and minimum 128 GB ram and 3090 card. Setup linux + CUDA run podman / lxd and setup Ollama + open webui You’ll be able to do pretty much exactly what you want without finetuning. Or if you want to experiment with finetuning you can do that too. I’d be happy to walk you through my setup and help you get up and running. I spent about 3800 USD on my ai rig.
This is what I did and at my current level of learning, it's more than enough. Training on small models isn't bad at all, Inference is very usable on FP4 and FP8, Training on billions of parameters will be painful if I were to guess.
Not exactly, because that reduces thermal transfer so much. What AMD is doing is putting their 3D V-cache below the CPUs, and that could be done for the GPUs too. But right now, if you stacked cores vertically, they would cook themselves.
There is a product called SCALE (a compiler toolkit) that is library compatible with cuda. It creates rocm (AMD GPU) linked binaries with almost no source code changes to a project that normally would use cuda directly. So instead of requiring these incredibly scarce blackwell chips, you can buy twice the navi31 cards (with 24GB vram), and end up in basically the same ballpark.The gddr6 vs. 7 will be a slight performance downgrade, but the price per unit of compute is WAY lower. As for PCI lanes, go with an EPYC server board with hundreds of lanes (vs. 24 usable for a regular AM5 cpu), so you can put a bunch of gpus in one box.
Putting tariff on TSMC is absurd when the US don't have a competing product. Basically the quantity of the imported GPUs will remain the same since the big techs can't get enough of them, and price will rise because of the tariff, but TSMC is not paying for that, the US companies will.
Bro use your status, nobody here will be mad at you. You earned it. You're not doing anything nefarious with it and neither are you part of the GPU shortage problem because you get a single one.
I would suggest to use a cloud solution where you rent a GPU (cluster), do your work and end it (to save costs). And for local development only use your current GPU (or update that to the best available). You don't have to process a big AI/LLM while streaming. Just use a small one. Or use a full spec macbook, where the RAM is also usable by the GPU because of their chip design.
It is time chip manufacturing went open source as well. Chips have become ubiquitous and putting a break on technology because of 'mine' mentality did have a nice run but is no longer going to cut it for the future. ASML can lead the way or go bust when others do go full open source on their chip making tech so much so that we will end up in an era of at home chip fabrication akin to a 3d printer anyone can have at home.
I have seen a YT channel that created a chip in his garage. The basic principle is not that difficult. Only the small scale of a commercial chip makes it so difficult. You cannot be an atom off or you have a failing chip. That's not possible for DIY.
I bought a 3090 some time ago for $1500. Seemed like it was way over priced at the time, but i wanted to do AI dev. Works well enough I don't feel need to upgrade.
You may be very interested in the tests done on AMD GPUs with the DeepSeek models. The 7900 XTX outperforms the 4090 on the 14G distilled R1 and all smaller distills, and barely loses for 32G.
Id honestly go for amd 7900xtx, if im just trying run smaller models. If im going for deepseek r1 671B, cheapest way is somewhere between Mac Studio or some retired server parts with huge numbers of ram. Gpus are too expensive and hard to get rn
There is a market open for anyone that just figures out how to min max this decision making. Instead of selling via specs of the gpu, motherboard, etc. just sell based on the model size you can run on a rig and the tokens per second.
They won’t manufacture the 3090 because of the Sinclair lesson: don’t compete with your own product or you will end up with massive inventory you cannot move
My hot take / understanding (please corect if I'm wrong) - the fact that you can do 4bit and below (there are even 2-bit quantizations !!!) suggests that the current LLM architectures are oversized in terms of parameters for their compute ability. I think that if the neurons were "saturated", nearly any further quantization should significantly degrade the model's output.
Things are expensive until we do not have options. And all these intel and nvidia or amd are taking advantage of these things. If we have multiple options for cpu and gpu these tin cans would have been under price not overpriced. And also we see major innovation each years.
Not really. There are alternatives but they aren’t as good. You’re absolutely allowed to do that. What they step in for would be anti-competitive behavior. Hard to say if they meet that bar.
The problem is partially manufacturing capacity. There's no way TSMC can accommodate demand at this point, let alone allow for a competitive market. The other problem is that companies are using traditional gpu compute rather than ASICs. nVidia GPU prices will drop like a rock once some company figures out how to build a competitive AI-focused chip, cut costs from not needing 3D graphics support, cut costs by keeping traditional compute hardware external, and transpile CUDA (at least in some capacity) for adoption. It must be a very hard problem as this has been a needed area for about 15 years when scientific computing needed cheaper, more scalable alternatives to supercomputing clusters with thousands of traditional Intel/AMD CPU cores.
It's possible to run PyTorch code on Apple Metal API and I believe AMD ROCm as well. You just need to set PyTorch device to 'mps' for Apple instead of 'cuda'.
Could probably use lambda labs (or similar) and figure out a way to easily spin an instance up/down (Terraform / OpenTofu?). Might be more interesting for watchers, too, since it's hard to drop $2-3k on a machine when just starting to experiment.
oml prime I am doing that now about docs on my MacBook 64gigs of ram and using(now) distilled models. You can just start exploring creating agents that add to a RAG for that. Is still fun if you want to build out a crawler/agent to ingest, summarize and then add it to RAG that goes recursively through the site/docs
The best thing you can hope for if you don't go for a threadripper in terms of socket is 8 lanes to two different GPU:s on a highend motherboard. The screw up would be going to a motherboard where second socket goes through the chipset and has 4 slow lanes. Another issue is does the second GPU physically fit in the case motherboard combo, and the support racket for the GPU weight not getting blocked. I decided to stop using my desktop with two users at same time because my primary GPU couldn't handle its own weight without support bracket too well, and secondary GPU socket blocked using the support bracket. Instead of getting tower getting a horizontal case could have solved that issue. But issues I had with getting physically 1650 with 2070super model that uses 2080ti cooler, makes me think you would have serious problems if you don't think it through to fit two of your 3090's or 4090's or 5090's in your system. Ideal probably would be a threadripper, with horizontal case. But still it would be tight fit to get multiple cards inside one PC.
Did you watch Digital Spaceports videos? Ask the twitter guy you interviewed how to obtain the GPUs ... He had one in that video. Also, how many GPUs is Nvidia going to sell due to DeepSeek R1?
It's gonna sell more, because now every company that can do a 30-200k investment into a local AI assistant will. You no longer have to worry about your trade secrets leaking.
You can rent an H200 (140GB VRAM, 256GB RAM) for ~$3/hr, if you know where to look. Unless you're gonna run them 24/7, there is no good reason to buy any of the RTX cards mentioned here.
People don’t understand economies of scale and continue to yap about “1080ti” price to performance or older gen as if production, engineering, and R&D cost scales linearly. Nearly everything is described by a hyperbolic function, not a linear function.
If you are trying to train models, renting 8xA100 or 8xA6000 is pretty cheap. Then you can just turn them off when you aren’t training anymore. You will end up spending less money almost guaranteed.
We're not 32X faster than a 980ti today, so expecting us to advance that fast 10 years from now is more than optimistic. The only way this would happen is if we had a radical breakthrough and shifted to an entirely different way of manufacturing chips.
We can generate 32x the frames in 10 years :)
Most of the articles and people kicking around those rumors are going off of the supposed help in architecture and Fabrication with AI. Having the AI build the most optimal path for the process and the taper is what is supposed to incrementally help . . . *Supposedly*
For real, absolute moronic take.
A 3060 would be perfectly fine for ALL inference in LLM if it only had 40GB of RAM. No need for stupid H100s. But nVidia themselves are scalping by not allowing your RAM controller to address more RAM, and I bet its locked, and the hardware can in fact address more RAM, just like it is locked for overclocking. (you think that wouldn't be the first thing I do when buying a graphics card, pick my hot solder station and replace the memory chips for bigger ones)
32x fake frames maybe😂
You HAVE to talk to asianometry. He's the man to talk to about chip developments.
THIS! If you want to deep dive into the geopolitics and the future of CPU/GPU architecture, Asianometry is your guy.
A Collab between these two would be wild
This would be awesome.
But did he do a face reveal?
Or Ian Cutress
crypto, covid and now llm.. us normal consumers just can't catch a break.
But they are producing so many chips. The second hand market goes crazy
Don’t forget about tariffs!
Investment and progress is going nuts though. The newer cards are literally hitting the limits of currently known physics.
Don't worry, civilization will have collapsed by 2035. It might happen this year; if not, it won't be for lack of certain people trying their hardest.
@__Brandon__ most of the US chip factories are in Taiwan. So if China invades Taiwan, it will belong to China 🤣🤣
I bought a 4090 RTX FE about a year ago not realizing I just bought an appreciating asset.
I know, I was so hoping I’d be able to score a used one for cheap to replace my 3080 since all the whales would go for a 5090 but even the whales are backing out of this gen more than I thought
Whoever said 16-32x is out of their minds. We got only 6.31x since Titan X in 10 years(released in 2015), and the pace slowed way down.
But the level of investment in manufacturing and development has increased by almost unimaginable levels, especially in the last two years. Ten years ago, NVIDIA's market cap was 11B and GPUs were a rounding error in global semiconductors. NVIDIA's market cap is now 3000B. The amount of money pouring into this space is wild. R&D at NVIDIA has increased over 10x in that period and that doesn't even take into account TSMC and every startup in the world working on AI hardware. It's getting harder to shrink transistors but the effort going into improving the process in increasing at ever faster rates. I don't know if we'll get 16x but progress is coming.
The only way to do 16x is by linking 4 cards and going from the current N4 to a N1 process
@@Swiftriverrunning Market cap has nothing to do with the money that a company has on their hands fior R&D. Often it does not even matter to the company at all. You don't understand what stocks are. How many stocks has NVIDIA sold in the last 5 years? And i don't mean employees from NVIDIA
@@Swiftriverrunning More investment have a strong diminishing return on the speed of improvement.
2x every two generations. Which isn't yearly. More like 2.3 years per gen.
I see Casey, I click. Always an awesome conversation.
Thanks for this video, this was exactly what I have been thinking about right now.
why doesnt he include his name in the description? anyways these two are worthless retards.
As a GPU owner, I approve this message
which kidney did you sell?
@@EnterpriseKnightboth ☠️
@@EnterpriseKnight why not both?
But u have one…
Holding onto my 2070 till I see how this shakes out
Bought a 7800xt last month, from a 1060. It is pretty good value/benefit imo
What are you doing with that ?
@@Cahnisama I went from a 1060 to a 3060.
Was running 1080ti until winter 2023 :)
This was me. I had 2070s and skipped entire 3000 gen. Got 4090 at discount last year at spring time. Cant believe cards are scalped again like prev gen cards
Gamers Nexus put out a video about how they couldn't find a 5090 on day one. Truly a wild market.
There's no market if the market never existed to being with
"I just want to pay a $300-$500 more and have 48GB vram" nope. impossible.
I can understand the GPU processing gates limitation relative to price, but more memory lanes and chips should be cheaper in comparison...
GDDR is 32bits wide, moving from 2GB to 3GB modules which ARE coming in the next year or 2.
Nvidia may have problems because they moved to GDDR7 which only Samsung supply currently.
The lower end cards are going to need more VRAM, while bandwidth is improved by large caches.
@@rkan2they aren't, the 5090 is so big because all of the outside is driving i/o memory controllers at 512bits there's no room left.
VRAM is organised differently from the DDR modules used with CPU to maximise bandwidth.
@@RobBCactive I can understand the 5090 being a bin of the datacenter SKUs that have more RAM or less defects, but surely you could still at least double the amount of RAM by limiting the processing performance..
@@rkan2 have you seen the 5090 pcb?
its fucking full no more space
they can make it 48gb if they REALLY want by using 16 3gb gddr7 instead of 16 2gb chips
but 5080 chip could have been larger with 2-3 more chips they just didnt want to give you 20-24 gb of vram for 1000$ this gen
If you mostly care about LLM inference, and especially if you're on Linux, AMD is perfectly fine. Ollama works, llama.cpp works, VLLM works. Performance is pretty good, and you get a lot of VRAM for cheap. Things only really get hairy (sometimes) when Pytorch enters the picture. Also, current AMD cards don't support FP8 and FP4, which is a bit of a problem for image generation, but doesn't really matter for LLMs. I believe the 9070 will introduce FP8 support at least, but only has 16GB VRAM. That said, the upcoming Ryzen AI Max 395 might be a very interesting option for LLM inference, with 128GB unified RAM and a much wider memory bus than previous APUs.
I'm by no means a power user and I've only ever wanted to do inference but AMD ROCm has always worked fine for me.
This is solid advice (for LLM) its a pain and a half to work with AMD GPUs for image gen (Flux, Stable Diffusion, etc)
Didn't somebody run 408B llama on 8 mining AMD GPUs at 50 tokens/s?
To be honest, the more I look into that stuff AMD cards make sense for inference, cheap and high vram.
I have an AMD GPU and I'm on Linux. I tried to apt install AMD ROCm, it asked for 50GB worth of library downloads 💀. Tried to push ahead anyways, and ended being bottlenecked on space in my root partition :(
@@comradepeter87 Find where it put the files then make a symlink to an other drive. It might also be possible to only install Rocm runtime. It is a way smaller but some software may need the full dev if they to compile stuff. Anyway I use symlink a lot to keep my most used model/checkpoint on the NVME drive while offloading everything else on the Sata drive. And the temporary download files are also sent to a HDD with a symlink to avoid filling my home SSD with temporary trash.
I see chat spamming "Naive" / "Just you wait" to Casey's comment how we are at a point where we can barely push these new GPUs further, how dumb can people be.
People are hearing this from a veteran game developer who has some of the greatest insight into these things and don't believe him.
We are living at a time where you can have like a 1080 ti a goddamned 8 year old GPU and it can still compete with the lower tier of current generation of graphics cards (and it was also not that expensive at release btw, at least before the crypto boom).
There's a reason Nvidia is pushing AI and software hard, because they know current rate of hardware improvement is ass, Moore's Law died a while ago, it's not early 2000s anymore.
Moore's law is "dead" because of nvidia and/or tsmc monopoly. You can only innovate so far with a single brain. The world needs more fab from other countries
The reticle limit is a thing, but then so are chiplet designs, both from AMD and Nvidia, even though only AMD have sold such cards as gaming GPUs.
Nvidia didn't go for N3 this gen and stuck with N5 derivative node N4. So Nvidia are holding back, they could have gone further and didn't. N2 is just about to release and there are pathways to 18, 16 and 14 Angstrom nodes so whilst some aspects of chips are not really scaling anymore there's more than enough room for logic to keep shrinking and so GPUs to get more powerful in the next 10 years
@@gggggggggghhhhoost my dawg, finfet and gate all around are literally scraping the bottom of the barrel. Silicon wafers don't have the atomic radii for your precious electrons not to escape your increasingly delicate gates. We're relying on asml here, not even tsmc! You don't even know what euv is! There isn't any nanometers below 9nm, it's all marketing!!!
@@gggggggggghhhhoost the monopoly for sure doesn't help the situation but Casey is right about physics. A silicon atom has a diameter of 0.2 nanometers and our best process nodes are right around 2nm. We only have 10 atoms to play with between traces and at that level. At that scale everything from simple optics (diffraction) to quantum mechanics like tunneling become limiting factors. At 4nm, a single atom out of place is within a 10% (+-5%) manufacturing tolerance while at 2nm you need a 20% margin of error.
Until we have tech that individually place atoms, lithography process improvements are dramatically slowing down the closer we get. I also didn't even talk about die size growth and how that affects yields. AMD does chiplet design which helps mitigate yield defects, but they are currently not that competative at the top end and the stranglehold of CUDA adoption does hurt them as well.
I don't know enough to claim that Moore's law is dead, but we are at some physical limits with chip production. Most people claiming /naive probably don't understand any of the manufacturing challenges. I mean, they already can't use optical lenses because the EUV light won't pass through glass.
if they can't produce enough of those chips, all they have to do is to activate mfgx4 to interpolate between two existing chips and it'll all be fine...
Pulled the trigger on a 7900 XTX, cause I can't keep waiting forever. Nvidia only leaving crumbs to consumers.
Prime, if you feel bad for asking for a 5090.. Ask for a H100 to use at home. Will be a 1st YT content about it being using at home 😂 and I want to see this content
I like the people saying the 4090 will be the 5th fastest videocard after all the 50 series are out and think its a own. It quite didn't work out that way.
I just bought a RX 570 (2017). Maybe I'll get a 3090 in 10 years time...
should just get a rx 6600, it is $190 at the cheapest right now (or could splurge 50 bucks more for the XT, but the Arc B570 is the better choice at that price range), ~50% more performance than the card you have right now at a relatively good price
Just stay on AMD a 7700 XT is 400 USD and is way than enough for most users in 1080p and 1440p
@ that is still a very decent card, can play GTA V with very high settings at 1080p with 60+ fps and Cyberpunk with low settings at 1080p with 50-60 fps (can squeeze some more with FSR). People tend to forget about the older cards since everyone wants to have the latest shiny thing, however these cards still got a lot of potential, especially if you don't plan on playing the latest, most demanding AAA games.
Though I don't know how well that one is going to perform if you used it for mining haha
@ I got the RX570 for $40 and the games I play aren't really that intensive
@@Definesleepaltyeah, you can use them to play videogames, who knew
I'm so happy that I don't need a beefy GPU. Mine is like a decade old or something.
26:38 bear in mind that combining architectures (Ampere and Ada) might give unexpected edge cases. most often it will result in either disabled Ada features (best case) or, depending on what you're doing, simply refuse to combine the vram
There's a one word explanation for this phenomenon: MONOPOLY. ASML is a monopoly, which has little incentive to boost production and reduce sales prices. The high price/scarcity in turn raises the barrier of entry for chip manufacturers, resulting in TSMC being almost a monopoly with just a little more competition and a little more incentive to reduce scarcity/prices. That in turn makes Nvidia just a little less of a monopoly for the same reasons. The AI companies and their investors have been hoping that the same concept will make them monopoly/oligopolies, which is why the Deepseek advancements tanked stock prices.
ASML does have competitors. Except not for their High-end machines that are being used to create those chips. They paid a high price to create a machine to create chips using EUV, which nog gives them a competitive edge.
Just like nVidia also has a competitor in AMD and Intel. Except also there AMD has given up on the high-end chip and Intel is just beginning (again).
So for now, we must wait until the AI hype is over. Just like for the beginning of the 4090 release we had to do with GPU usage for blockchain.
5090 costs ~€3,800 - €5,000 in Europe
5080 €1,300 - €2,300
Europe only has prices after taxes
Remember the crypto bubble that make me unable to buy a dream pc I save up
Prime, I'll sell you my RTX 4090 for like $2.3k. Would ship from MN. Excellent condition, never used for mining or AI.
Get this to the top
I just bought one new for 1.8k :0
Prime, just give it 1-3 months, you'll be able to get a 5090 by then if you're faster than a snail checking out at an online retailer.
I bought my RTX 4090's two years ago when they came out, and now they are somehow worth 30% - 50% more than I paid for them. Wild times. I feel your disillusionment, prime.
's ??? damn.
"Why Buying GPUs Is a Disaster". Sorry.
Good job, A.
Good job 47, Fall back to base.
I tought for a sec that "buying gpus" was a new category, as opposed to "gaming gpus" 😂
As a non-native English speaker, I could understand that he did it on purpose, why couldn't some others?
Prime is biting the forbidden apple of rage bait
Just wait for the Chinese GPUs to prosper and the US will suddenly have a lot of chips
so another 5 years + then
@@Psikeadelic More like 2 years, if not 1. I have solid sources.
There are GPUs in China you can buy that are ~1080 performance, for more than a year now. They struggle with driver support and aren't really viable commercially, but supporting AI applications with software is a lot easier than supporting all of gaming. China's bottleneck remains whether TSMC is allowed to take orders from China on the latest node or not.
@@Kwazzaaapharbin uni had an euv lithography breakthrough so more like 3 years. We dont need chinese gaming GPUs, all we need are AI ones to slash nvidia's margins and make gaming attractive again.
I agree but the tariffs will hurt. The 5090 shortage is a manufactured event. The chinese are at least 10 years behind. Withe economic espionage then could shorten that gap. And they are trying the espionage route.
deepseek-r1:70b runs fine on a 64 GB M3 MacBook, at around 30 characters/second output, using ollama.
To run the full deepseek r1 model, you will need 800GB of memory, to train it 1.5TB. You can use a few big CPUs with 128-256 cores. It will be slow, but it will work. Otherwise you need like 10 GPUs with 80GB Memory, or 20 with 48GB to run your model. First one might draw up to 5kW of power, 2nd up to 10kW. Thats $1-2 per hour in power alone 24-50 per pday, 1500-3000 per month. Double that if you want to train your model.
Well i can get a 1.5TB RAM server with 44 cores for around $3000 at the moment.
@@llothar68bro how?
Database Mart - you can rent a cloud machine there for like $450/mo with an A6000 - I think you can even get an A100 for under $1k/mo. Or LambdaLabs - rent by the hour.
Welcome to budgeting hell, have a nice stay. And be REALLY conscious about the VRAM aspect. The moment your model can't fit, you have different parts moving at different speeds, and the fast side will always be waiting on the slow side. Rough comparison, memory cache vs on disk. Thus, you have to account for what the speed up on GPU side means compared to the slow down on the "non GPU" side means.
Ah well, me still waiting on the 5050, assuming it comes out, 5060 if not. It's a blessing when your models can fit on "small stuff" because they're not trying to be everything for everyone.
CAN'T U JUST PUT IT IN MAIN SYSTEM MEMORY AND IF U CAN LOAD IN THE PARAMETERS YOU NEED FASTER THAN THE GPU CAN PROCESS THEM YOU'RE GOOD RIGHT? (I'm being fr)
Unfortunately, the most cost effective option for someone not running a continuous service is to rent space on the cloud...
That’s not really ungortunate, it’s exactly why the cloud was born
@@christianferrario It IS unfortunate if the whole goal was to run these models offline
@@markdatton1348 well it is offline but just on someone else system lol
@@markdatton1348 Yeah, but it depends why, if the goal is to run it offline to avoid using their chatbot to preserve data leaks, you can still do so by using your own cloud space, if it was to use it without a network connection, yes you have to pay for your own machine, unlucky.
Dude all the people in the chats going “ITS THE SCALPERS” are so clueless. If you magically snapped your fingers and made scalping impossible, that wouldn't make magically more cards be available. They just aren’t making enough cards. You wouldn’t get one either way.
apparently
Not to mention, scalping thrives on supply and demand. This wouldn't be an issue if you were always able to go to the store and get one at msrp. Scalpers will always exist, but they're only successful when supply falls short of demand.
I'd much rather have a lottery system at MSRP than high prices. Although I don't think fighting supply/demand like that will work (at least not without high costs and being intrusive - like suing everyone, implementing hardware limitations, background checks, and monitoring individual customers - that's not gonna happen, and even if it did it might just add barriers and raise the prices more).
Gtx 980 (165W) was released in 2014, and it scores 11110 benchmark points on passmark,
Gtx 5080 (360W) was released in 2025, and it scores 37287 benchmark points on passmark,
Basically you have a x3,34 improvement on this specific software benchmark.
I believe that's not even fair because Gtx 5080 consumes x2.19 more power than Gtx 980.
It was a bigger jump between 8800GTX released in 2006 and gtx 1080 released in 2016 , basically x26,9 better the newest one. I tend to agree that, if nothing special will not be discovered, this way of building GPUs will not bring much improvement.
Comparing models of that age difference has caveats as they will be 3x in some aspects and 10x in others but your point stands that it is nowhere near 16x or 32x. AI hypers and Nvidia fanboys will just keep lying for free until the end of time though.
In raster gaming its more like 6x plus more performance. Not even talking about RT performance which can leverage something like optix for path traced 3d renders etc. with much better boost than 6x. And lastly the tensor cores AI performance that is another world and it's uncomparable. Also games can leverage tensor cores now so performance difference is multiplied by double digit numbers.
To add to the AMD side, AMD on windows also works pretty good. Never really ran into torch directml issues, and ollama itself runs nicely. XTX is such an underrated card.
Now considering enormous HW accelerated AI power, transformer model DLSS features, RT performance, CUDA and OptiX and many more that Nvidia cards have, 7900xtx compared to rtx 4080 is much worse value. Basically its outdated already.
Tariffs are only going to inflate the costs of GPUs. It's about to get a lot worse.
which will just crash this ridiculous hype train. ppl need to get grounded.
Hey @prime, also, I hear you saying you want to not only run the models (inference) but train models. Training models require more VRAM then running them for inference. If a 16B model takes 24GB, then for training you’d need about 100 GB VRAM. This is because in training you also need to store the gradients for back propagation.
Mostly an FYI in case you didnt know
If I recall, the reviewers get a review unit which they have to send back after. While it is true that some of the big youtubers do get GPUs like that for free (see the video game industry), it's mainly due to network connections, getting into the big club, and years of being towing the line.
The market needs a 48gb $5k Titan to relieve some of the datacenter market pressure off the 5090
That doesn't benefit Nvidia. They can overprice both Datacenters and Prosumers with their current strategy.
The 3090s used Samsung’s custom 8nm (8N) process for its GA102 die, packing 28 billion transistors. While powerful, this node was less efficient than TSMC’s alternatives, leading to higher power draw and thermal output. What are you even talking about, bro?
The root of the problem seems to be TSMC. It's not like we had this shortage problem happen right now, this shi has been going on for years and TSMC seems either unable or unwilling to scale up production. At these premium prices you would expect competition to prosper, but we are going in the opposite direction. I still don't understand how billions of USD can't reproduce whatever they are doing there.
Because money is not enough and you need EXTREMELY competent people to do it right.
For hardware config maybe a collab with Wendell from L1techs would be cool.
Yes, fully agree
I kinda wanna see wendell and casey nerding out on a call
Sad they never could get the SLI work properly since the professional cards, with NVLink you’re just stacking more. 6gpus working together as one big GPU.
Yeah, honestly you would think this would be built into the OS (or standard drivers I guess) by now, and SLI would have been a temporary bandaid.
They could, it worked but the method they used was to put it on the developer to integrate. They have more than enough staff and money to make a functional version now but they didn't like how you could get 2 cheaper cards and beat the flagship for cheaper. They are 100% gonna move to chiplets making any form of newer sli kinda pointless now
@@taylor-worthingtonthats what dx12 tried to do but nobody cared.
I’ve tried running DeepSeek with ollama on RTX 6000 Ada. 32B param model takes about 20+GB vram on the GPU, so should fit on 3090/4090. 70B model takes like 43GB and although fits on my GPU it’s quite slow - really depending on a question. Don’t ask “loaded” questions. I haven’t tried to optimise the models so those are out of the box as is. I’d say 5090 will be much more future proof, however still might be limited by its memory. Obviously if you use unified memory to let model spill over to RAM performance will suffer like in swapping scenario. Hope that helps someone. TBH I was impressed by DeepSeek at first but now kind of disillusioned. I’ve got some better answers from ChatGPT and Claude on some C++ libav programming. But maybe the model is not trained much on that.
Well, it's called 1.58 bit quatization because a model is rounded to ternary weights instead FP32, FP16, or whatever and the new weights have only {-1,0,1} elements, this reduces the multiplications of matrices complexity to a binary operation level in LLMs, 1.58 bit comes from 2^(1.58) =~ 3 , and 3 is the said ternary weights quantization. Prime you are considering on having many apple's you can also parallelize many NVIDIA 3090, it's really hard to get a 4090, or just wait and try to buy them over time. You can also parallelize different NVIDIA brands as long it runs in CUDA with auxiliar pytorch libraries.
The 'racket' is the pricing for less performance. Not necessary the current limitations of the tech. I learned about the idea that lower chips are actually more defective chips. It is nuts.
What's nuts about it? You have a factory that spits out products. Some of them have more defects and some have fewer. Are you suggesting only keeping the perfect few and throwing the rest into landfill?
The only things you should look at are:
VRAM (nothing matters if you can not fit the model)
Tensor core precision support. You really want BF16 since you keep the exponent size of fp32 with half the cost. Ampere and newer support this. Working with lower precision is annoying if you want to do it yourself. You have to do a lot of work to maintain stability and accuracy.
VRAMis most important. and the mother board and cpu needs to have enough PCIe lane support. alot of mainboards only support 1 full PCIe lane while the remaining ports are nerfed.
I litterally started programming Fortran because of your guy's rant lol. I love whenever you two start cookin' on stream
I hope you limit every string to 100 chars ;)
With an ever increasing amount of transformers, fortran might be back in business!
@@Henrik_Holst yep, the plot twist is that "benchmark" was done using LLM slop and the Fortran code seemed faster because it limited the char size to 100. Lol😂
Yep. It isn't that they're selling all their Lemonade to one customer because it's easier for them no. It's actually WAY WAY WORSE.
They're selling us what they call "Lemonade" and maybe at some point in the past (GTX1080) It was still mostly Lemonade, but now they've been slipping in so much of this synthetic oil to the product because they want to mainly sell it as lubricant to their giant Corporate Oligarchs at a lower cost and get all the gamers to pay for it still. That's what's essentially going on here.
Poisoning us, while charging larger premiums, so giant Corporations can profit even MORE from our labour.
Welcome to late stage Capitalism, baby. It's only going to get worse from here.
They haven't been making graphics chips in quite some time. They're only pretending to.
Until 2 years ago I worked in HPC and I can tell you that there are 2 classes of cards in the "enterprisy" category. There are the RTX A(Ampere/3000) series, which replaced the qudro cards built around being pun into workstations, have their own fans/coolers are more consumer friendly, and stuff like the A10/A40/A400 class passive cooled cards that go into server chassis. At the base, they are pretty much the same thing as the consumer cards of similar class but with double the VRAM, same or better TDPs, higher VRAM bandwidth. They perform almost identically to the consumer card. The A40 or RTX6000 is in margin of error to the 3090 for this use case with the difference that the 3090 uses a lot more power.
I am wanting to explore the use of Project Digits unit over setting up old server hardware loaded with GPUs.
@ThePrimeTime Going for the beefiest single gpu you can get is probably the most satisfying setup right now, especially compared to using just 2 or 3 gpus in total (and not many more).
Data transfer rate between the cards during inference puts a pretty hard cap on tokens/sec when the model is spread on multiple gpus, with the main benefit being you can run bigger models without parts of the model going into ram.
If you can fit a model entirely in one gpus vram, then you can really see them fly on modern gpus.
Strix Halo could also be good setup, seeing how it can address quite a lot of memory
I made 2 A.I. builds in 2022. All server parts and welding and 3d printing and fans.
This is disheartening: I want a 5090, but the situation is not a smart investment.
Public cloud gpu instances perhaps?
So not only are people punching air over AI, they’re punching air over pricing and availability - despite the fact this overlap doesn’t even care (so they claim anyway).
Mac mini route should be fine for your use cases. You probably only need RAG for the doc search / coding anyway. Finetuning without a sufficient dataset often only hurts performance.
We need competition, from producing chips to GPU/cpu
1:30 NVIDIA calls this Speed of Light, it’s a company value
Prime if you read this what I recommend is get a cheap x99 motherboard + cpu and minimum 128 GB ram and 3090 card.
Setup linux + CUDA run podman / lxd and setup Ollama + open webui
You’ll be able to do pretty much exactly what you want without finetuning. Or if you want to experiment with finetuning you can do that too.
I’d be happy to walk you through my setup and help you get up and running.
I spent about 3800 USD on my ai rig.
This is what I did and at my current level of learning, it's more than enough. Training on small models isn't bad at all, Inference is very usable on FP4 and FP8, Training on billions of parameters will be painful if I were to guess.
I’d expect the gpus to go to 3d layouts. Since they are emabarassingly parallel, it’s the perfect case for 3d layouts.
Not exactly, because that reduces thermal transfer so much. What AMD is doing is putting their 3D V-cache below the CPUs, and that could be done for the GPUs too. But right now, if you stacked cores vertically, they would cook themselves.
I am at least 500% more likely to click a prime video when I see Casey. Prime you're great, but Casey is a GOD.
There is a product called SCALE (a compiler toolkit) that is library compatible with cuda. It creates rocm (AMD GPU) linked binaries with almost no source code changes to a project that normally would use cuda directly. So instead of requiring these incredibly scarce blackwell chips, you can buy twice the navi31 cards (with 24GB vram), and end up in basically the same ballpark.The gddr6 vs. 7 will be a slight performance downgrade, but the price per unit of compute is WAY lower. As for PCI lanes, go with an EPYC server board with hundreds of lanes (vs. 24 usable for a regular AM5 cpu), so you can put a bunch of gpus in one box.
Putting tariff on TSMC is absurd when the US don't have a competing product. Basically the quantity of the imported GPUs will remain the same since the big techs can't get enough of them, and price will rise because of the tariff, but TSMC is not paying for that, the US companies will.
You know what we are not pushing enough? GAME OPTIMIZATION! Its a joke...
Bro use your status, nobody here will be mad at you. You earned it. You're not doing anything nefarious with it and neither are you part of the GPU shortage problem because you get a single one.
I would suggest to use a cloud solution where you rent a GPU (cluster), do your work and end it (to save costs). And for local development only use your current GPU (or update that to the best available). You don't have to process a big AI/LLM while streaming. Just use a small one.
Or use a full spec macbook, where the RAM is also usable by the GPU because of their chip design.
It is time chip manufacturing went open source as well. Chips have become ubiquitous and putting a break on technology because of 'mine' mentality did have a nice run but is no longer going to cut it for the future. ASML can lead the way or go bust when others do go full open source on their chip making tech so much so that we will end up in an era of at home chip fabrication akin to a 3d printer anyone can have at home.
I have seen a YT channel that created a chip in his garage. The basic principle is not that difficult. Only the small scale of a commercial chip makes it so difficult. You cannot be an atom off or you have a failing chip. That's not possible for DIY.
do it then, the knowledge is out there, get a degree if needed, and opensource your findings/process, you can be the pioneer
I bought a 3090 some time ago for $1500. Seemed like it was way over priced at the time, but i wanted to do AI dev. Works well enough I don't feel need to upgrade.
Somebody commented "here comes a 10 minute answer" when Casey started his explanation of CPU sockets. Must have been a zoomer. Where's the respect?
You may be very interested in the tests done on AMD GPUs with the DeepSeek models. The 7900 XTX outperforms the 4090 on the 14G distilled R1 and all smaller distills, and barely loses for 32G.
Id honestly go for amd 7900xtx, if im just trying run smaller models.
If im going for deepseek r1 671B, cheapest way is somewhere between Mac Studio or some retired server parts with huge numbers of ram.
Gpus are too expensive and hard to get rn
There is a market open for anyone that just figures out how to min max this decision making. Instead of selling via specs of the gpu, motherboard, etc. just sell based on the model size you can run on a rig and the tokens per second.
This, good catch
Hey Editor, is it possible to add the date the clip was recorded in one of the corners at the beginning of the video?
De Beers withheld diamonds in the 1800s. Its no longer the case that there is manufactured scarcity FYI
They won’t manufacture the 3090 because of the Sinclair lesson: don’t compete with your own product or you will end up with massive inventory you cannot move
My hot take / understanding (please corect if I'm wrong) - the fact that you can do 4bit and below (there are even 2-bit quantizations !!!) suggests that the current LLM architectures are oversized in terms of parameters for their compute ability. I think that if the neurons were "saturated", nearly any further quantization should significantly degrade the model's output.
Things are expensive until we do not have options. And all these intel and nvidia or amd are taking advantage of these things. If we have multiple options for cpu and gpu these tin cans would have been under price not overpriced. And also we see major innovation each years.
The definition of a monopoly. Government has to step in and divide NVIDIA up.
no
@@LtdJorge yes. But the correct answer is they will never do that
Not really. There are alternatives but they aren’t as good. You’re absolutely allowed to do that.
What they step in for would be anti-competitive behavior. Hard to say if they meet that bar.
The problem is partially manufacturing capacity. There's no way TSMC can accommodate demand at this point, let alone allow for a competitive market. The other problem is that companies are using traditional gpu compute rather than ASICs. nVidia GPU prices will drop like a rock once some company figures out how to build a competitive AI-focused chip, cut costs from not needing 3D graphics support, cut costs by keeping traditional compute hardware external, and transpile CUDA (at least in some capacity) for adoption. It must be a very hard problem as this has been a needed area for about 15 years when scientific computing needed cheaper, more scalable alternatives to supercomputing clusters with thousands of traditional Intel/AMD CPU cores.
Never been happier I forked out MSRP for a 4090 over a year+ ago when I found one in stock. Big OOF
same, i thought the supply issue was over. guess not.
Who's the other guy though, no mention in the description. Ouch!
Which stream is this one? Which date? I would like to watch the whole vod
Why is it so hard to create a graphic card competitor. If one factory /process was built why can’t we build 2 or 100?
And that's why deepseek was such a shake up. Proved that good AI models don't need cuda. And if you don't need cuda, it's *much cheaper* to run AI.
It’s going to get worse, they recently had a 6.4 magnitude earthquake. It will take them a bit to recalibrate the machines in the fab.
I'd just wait for the digits platform personally.
good point, but i bet they are gonna be hard to get as well.
It's possible to run PyTorch code on Apple Metal API and I believe AMD ROCm as well. You just need to set PyTorch device to 'mps' for Apple instead of 'cuda'.
So where is the market for used data centre chips?
3090s are going for $1600!? Geez, almost makes me want to sell my dual 4090s
Or half of them...
asianometry would absolutely blow your mind with the depth of the process
@ThePrimeTime the 4090 is available right now in Amazon Spain for example, for just under 3000€.
Could probably use lambda labs (or similar) and figure out a way to easily spin an instance up/down (Terraform / OpenTofu?). Might be more interesting for watchers, too, since it's hard to drop $2-3k on a machine when just starting to experiment.
oml prime I am doing that now about docs on my MacBook 64gigs of ram and using(now) distilled models. You can just start exploring creating agents that add to a RAG for that. Is still fun if you want to build out a crawler/agent to ingest, summarize and then add it to RAG that goes recursively through the site/docs
The best thing you can hope for if you don't go for a threadripper in terms of socket is 8 lanes to two different GPU:s on a highend motherboard. The screw up would be going to a motherboard where second socket goes through the chipset and has 4 slow lanes. Another issue is does the second GPU physically fit in the case motherboard combo, and the support racket for the GPU weight not getting blocked. I decided to stop using my desktop with two users at same time because my primary GPU couldn't handle its own weight without support bracket too well, and secondary GPU socket blocked using the support bracket. Instead of getting tower getting a horizontal case could have solved that issue.
But issues I had with getting physically 1650 with 2070super model that uses 2080ti cooler, makes me think you would have serious problems if you don't think it through to fit two of your 3090's or 4090's or 5090's in your system.
Ideal probably would be a threadripper, with horizontal case. But still it would be tight fit to get multiple cards inside one PC.
Did you watch Digital Spaceports videos? Ask the twitter guy you interviewed how to obtain the GPUs ... He had one in that video. Also, how many GPUs is Nvidia going to sell due to DeepSeek R1?
It's gonna sell more, because now every company that can do a 30-200k investment into a local AI assistant will. You no longer have to worry about your trade secrets leaking.
You can rent an H200 (140GB VRAM, 256GB RAM) for ~$3/hr, if you know where to look. Unless you're gonna run them 24/7, there is no good reason to buy any of the RTX cards mentioned here.
People don’t understand economies of scale and continue to yap about “1080ti” price to performance or older gen as if production, engineering, and R&D cost scales linearly. Nearly everything is described by a hyperbolic function, not a linear function.
Doesn't take much rnd to add more vram for almost no extra price.
If you are trying to train models, renting 8xA100 or 8xA6000 is pretty cheap. Then you can just turn them off when you aren’t training anymore. You will end up spending less money almost guaranteed.
"What happens if you start selling GPUs in capitalism? For a long time: nothing. Then GPUs are getting scarce." SCNR.
We are slamming I to the limits of Moore's Law with CPU and GPUs is my opinion.
You need to look into problems with “specialisation” of a model (fine tuning) for a domain.
Catastrophic forgetting hasn’t been solved.
It used to be crypto. Now it's AI. What will we spin GPU cycles on next that is worthless?