Training hardware and electricity is far more expensive than what's needed for inference after the model has been trained. We will eventually discover a point of diminishing returns, especially as systems ingest more synthetic data that will eventually lead to model collapse. But once the training is done, what can we do with these machines? ASICs, by nature, can only perform one type of operation. Do they become the ultimate form of eWaste?
I was at a conference yesterday where one of the panelists said he didn't believe in the self ingestion collapse. His argument was that the content that is propagated is curated. A person picked the best outputs or the ones that had a quality that they liked. Now I personally think that is a way too optimistic view. Loads of unfiltered trash is getting uploaded constantly. But I do think it's a good enough argument to make me question if models will ever collapse completely. It feel like they'll just enshittify, but frankly... So does everything else... Also the metrics shared from an environmental researcher (same conference different panelist) about 3M gpu hours to train an LLM and then depending on the size of the resulting model 200 to 600M queries to match that training energy use. And you can napkin math what that means based on where any research lab is or datacenters running these tasks.
@@SquintyGears Hi guys. I'm a noob. So, pardon my stupidity. Why not we try to use ML models to discover/establish a theory of how these things actually work? Say, why use LLMs just to parrot language and not figure out a theory of it? Wouldn't that be a worthwhile investment as such will not only reduce computation needed, will end the need of further training. Just as you don't need to train your rocket on Newton's Law.
@@aniksamiurrahman6365 mathematically speaking what LLM's do is probabilistic guessing. a guess on how things work is also basically how humans came to conclusions on gravity and everything else. but we aren't able to make the computer run proper tests. in models that have tried this the computer ended up editing the limits of the test rather than work harder to get the desired output values. and because of a number of different mathematical proof, you are more incredibly more likely to fall into a "good enough" final result rather than the actual ground truth. That's why if your goal is finding out the underlying theory, machine learning will never find it. If you're trying to get something that will help engineers build something with tight tolerances based on test data they collected, it will almost always find a solution that fits the desired precision.
i have known a handful of of organizations that have started to move their customer service away from call centres towards ai chatbots in various cloud providers. it seems to be more than 10x the price of human labour. i am aware of one company that got a surprise bill one month to the effect of 20 million USD
Most, most companies do not need LLMs. Most companies want a fine-tuned small, or medium sized model (something < ~8B parameters) for their specific needs. You can run a small or medium model on an A5500 and that can service many multiple users.
To me this sounds like the equivalent of this in the 80's/90's: Most, most companies do not need powerful desktop or portable computers. a few low-powered desktops with a single moderate mainframe for their specific needs. When models will outperform humans on any thinking related tasks, we'll continue to exponentially build out compute infrastructure I believe. Even several percent increase in model intelligence will continuously be worth chasing with billions of dollars.
I believe that in memory compute + analog matrix multipliers has way more potential than trying to scale the current architectures. There is several orders of magnitude of efficiency and density improvements possible in there. Imagine a chip that combines flash cells for the weights, DRAM cells for the inputs and activations (and the gradientes for training ) and that does an analog matrix multiplication followed by a ADC and a more general purpose gpu like thing to do things like activation functions. The internal bandwidth between those components could be crazy high and even at moderate clocks it could be real fast. You wouldn't need to fit a whole model per chip but fitting a whole layer would be great as the bandwidth to communicate between layers is much smaller.
As Bill Dally has explained many times, analog multiplication fails whenever the data has to eventually be converted back into digital form. It's one of those things that sounds like a good idea but can't be done in practice. Apparently Nvidia has tried this many times and the hurdle of AD conversion can't be overcome.
This isn't possible. Even putting dac and adc problems asside. You're suggesting building a machine to the specific neuron network topology spec. It's inherently a single purpose machine at that point. Bandwidth doesn't matter if you need to scrap the whole hardware every time you update the model. And if the chip dynamically scales it's input and output space and loads in weights per query it's not faster than a GPU.
@@SquintyGears Kinda but not really. It only assumes a that there is a matrix multiplication and a maximum number of weights and activations per layer (and even that can be worked around). Those assumptions are true of most neural networks. As for the specific topology, you would need at least as many chips as you have layers. If the topology can be static it is easy but if you want to be able to change it, some kind of bus or routing between the chips is needed, it could look like a small fpga or even a cpld or several of them in a mesh for a larger scale (same thing can be built in inside of the chip affording more flexibility, maybe allowing chips to process more than one layer if the layers are small). This of course would mean that the hardware would be under utilized if the model is smaller than what the hardware can do and if the model is larger, it won't work (at least with a naive implementation) but that already happens to some extent with GPUs. Regarding the DACs and ADCs issue, I am aware of it but I believe that it can and will be overcome. If you do enough work without converting (like a whole matrix multiplication of a large matrix) the overhead doesn't seem too bad. Specially in the context of neural networks, at least for inference where you don't need much precission. I don't think that what I proposed here is any less flexible than google's TPUs.
@@eruiluvatar236 It's more complicated than you seem to think to have partially occupied layers in a system like this. The multi connectivity of neural networks means that it will throw off the output measurements completely and probably even reflect signals in the dead ends. Grounding doesn't magically happen internally. And you have to be aware how silly it sounds to say you would just line them up on a bus. Having a package for each layer is like walking back all the way to the 80s. Your memory card was a bunch of 10pin packages side by side and the whole paper sheet sized thing was 256kB. No matter how you run the bus it can't be faster. Even if you imagine that this system would have virtually no actual computing execution time, you've introduced so many sources of signalling latency that it'll always end up slower than the current paradigm. Load, wait, execute, wait, output read. Replacing computation time with hyper complex tracing and bus topologies is not likely to be a valid solution. It's not stupid to consider something like this at all. But when you take a close look at the complexity the industry faces implementing the existing bus standards and upgrading them each generation... Ddr, pcie, it should be clear what the trade off we're talking about is here.
but how fast will be the npu that Samsung or micron can put in the dram modules? 45 tops npu in lunar lake is almost as big as the 4 p cores i prefer npu card with 8 dimm slots and large sram cache. 8x 125 GB will suffice llama 405B model in native bfloat16. large sram cache is what GeForce 4000 does to compensate its slower/narrower gddr bus soldered hbm or gddr is much faster but loading large model needs several accelerators. the interconnect is much slower than dimm slots. for example, 1 ddr5 slot is equivalent of pcie5 x16 lanes
I’ve seen an interesting trend in AI research if you’re on the floor working in training models and so on. In 2017 all the money in the world would not have gotten you a modern AI model’s performance. In around 2019, or 2020 or so, we had GPT 1 and GPT 2, which were kind of the first modern LLMs, and it took a large research team, and a lot of money (perhaps in the millions) to train it, and yet we look at those models and think of them as almost unusably small. And yet, we can replicate those models for under $100 today. This is due to better training dynamics, better hyperparameters, better GPUs, better frameworks, better data, and more. Stable Diffusion certainly costed either tens or hundreds of millions to make originally, and nowadays, one can produce an analogous (or even superior) model for anywhere between $100-6000 depending on your existing hardware and patience. I don’t really think that we’ll ever see a complete drop off in AI, but we might see a stop in frontier-class (AGI lab) models. If a hobbyist can run around and replicate previous flagships but trained from scratch for specific use cases (as well as produce models in new modalities), then I don’t see why an enterprise can’t do a slightly larger model for their own internal use case.
I mean, isn't the final goal to reach AGI so that you can replace employees and thus sell B2B? I assume the current B2C side of the business is just to make revenue while they're rushing to the final goal of AGI, after which B2C will not really matter anymore.
definitely agree, particularly as corpus and resultant models specifically tuned to "take actions" and "do reasoning with rollback" become more common. A 15W TPD unified memory ARM based edge node or mobile phone can already do 40Tops on 7b-ish models and that can do amazing things already - way more impressive than the "cloud tethered" latency impacted things like Rabbit R1.
Running software neural networks is insanely energy wasteful. There's no way we'll ever achieve affordable/practical AGI this way. We have to find a practical way of storing the weights and biases in the hardware substrate, a type of neuromorphic processor design where energy is only used as required to perform each task. I know this brings its own challenges but I can't see much of a future for AI that relies solely on software.
people are designing and building specialized hardware, but most of them fail as the way the models work/demands/requirements change keeps changing to fast.
It'll be used massively if it reaches superhuman? Or even subhuman at increased work rates. Once you reach those you have things that can simply do more than humans. Look at how many insane things humans scale crazily for, now imagine a 24/7 actor that's even better or simply faster. It becomes a requirement to use it all over the place? If they keep on scaling like that then you're talking about entirely new regimes where it's simply required. It has got to a position where Microsoft and Google are funding nuclear fission reactors... That's simply an insane requirement. Do you think Google doesn't value a developer that can output 5x the amount of high quality code etc and costs $1 million/year, when the top developers can already be hitting $400k/year? People are really underestimating how much these costs can scale. The 5x the output thing is simple, but what about when that output is simply qualitatively better? Then you enter an entirely new regime of costs. Can the US afford to not switch to model based fighter pilots if they're qualitatively better? I mean they immediately become qualitatively better on maximum g force, and then you start building airframes to handle way more than humans can. These all have the potential for huge amounts of scaling on the cost side...
At those power levels (even 5x, let alone 20x) we're talking about rapidly accelerating global warming in terms of power demands. This AI revolution, at scale, could be analogous to the crypto boom, but on a far larger scale if the AI gold rush continues on into the future. Musk recently was talking about a new datacenter for AI that runs on 12-14 ish diesel generators so it's not on the grid. This isn't sustainable.
But it's also funding huge levels of fission now, with entirely new economic regimes? Yeah duh Musk is going to suggest the dumbest thing. But realistically chasing fission seems to be the better solution. This also has the ability to solve climate change in multiple ways that weren't thinkable before. We can't just look at the potential energy scaling and go on that only? If we get the ability to build fission reactors rapidly and on a cost scale that deal with modern power economics, then that's an unimaginably large win for climate change?
Optical connections are not limited by the speed of light. In a fiber optic link for instance the light bounces around the tube so much that its actually comparable to a copper link in latency.
I am interested to see how far the infrastructure gets pushed before the industry right sizes. It would be funny to me if all this money gets spent to build out inference infrastructure in the cloud using data centers and afterwards inference compute gets done on local machines anyway.
Consumer motherboards are driven by cost more than any other factor. There really isn't anything in a desktop PC that needs optics, it doesn't solve any real problems and if it increases cost then I don't see a use for it in that application.
@@aldarrinYeah but I think his point is more that they'll scale down the capabilities and segment these features in HEDT instead of making them mainstream. And "you'll be happy with the 5% improvement we give you"...
It definitely makes me wonder why so many are investing so much $$$ in the hardware right now. But I guess they don't want to fall behind in market share.
as written in the meta hugging face page, creating midrange level llama 70B model needs 100 years in single Nvidia dgx server. the chat gpt equivalent llama 405B model needs more than 400 years that was why they bought so many dgx. but amd etc. is catching up. open ai etc. don't want to be locked into Nvidia too
Optics will enable scaling, but if i have to take a guess, playing smart on data locality can bring more effective solutions cost and energy wise. Huge models may bring some benefits and make the technology thrive but we already are going into diminishing returns without even covering the costs.
I like that we are talking end to end here. I don’t think the end to end math works or is close. Too many field of dreams use cases at the user level assuming this arms race of AI spending is the new normal.
@notDacian most things related to silicon photonics that I've heard about is all related to research in chip design, I agree that I've heard nothing about such devices being used commercially
They mothballed their programmable tofino switch business, that was meant to be their lead CPO platform. Now it's a bit of a mix - they're showing stuff but it's the same as a few months ago? I can't get a good read - even former employees of that division are baffled
it can be argued now that manufacturers, mainly nvidia, have had a taste of real business customers, that they have forgotten their legacy- value proposition. its been a few generations of gpus now from nvidia but also in general, where customers are only receiving minimal gains, or equal performance to previous flagships. that doesn't go unnoticed. everyone in gaming is still talking completely unjokingly about the lack of any price:performance generic graphics cards, because nothing else offers previous flagship performance for a reasonable price. they are paying extremely close attention to these slim offerings in benchmarking- there is no value proposition as there used to be. even with mid-year refreshes to existing models there is no value proposition. there is worry mounting for these companies to stay relevant across computing, including gaming, where the audience is ever-increasing and yet a tonne of traction has been lost already. yes memory is way too slow and too expensive for current needs. memory being in demand they claim, while having no valid excuse for how overdue their timeline is for a breakthrough or a major expansion. if these minor gains take so long, they ought to be expanding and offering quantity of previous flagship memory for a cheaper price. completely seriously, the memory industry's timeline is so slow and disrespectful to current trends (16:32 holy sh**) that it is a part of what makes them more vulnerable to natural disasters; as we have seen the effects of natural disasters on memory manufacturing in the past, we can compare this against how other sectors handle it and it does not take a lot of research to show that sectors in computer hardware outside of memory, are more capable of recovery or switching gears when necessary. memory has typically not shown this. their slow pace of progress could even indicate that we are subsidizing their recovery right now without knowing it. i can get all this power from a $20 LLM subscription, but why doesn't that kind of value scale back to hardware customers? it makes no sense, until you start digging into what the heck is going on with memory
InfiniBand ? Has supported scaled optical distributed compute/shared memory model for some time ? As to the value chain - I have significant use-cases that demonstrably save money and time for tasks humans are slower/worse at with current model API costs and model capabilities. I actually think existing models are barely explored from a value POV even as new models generate even newer and "maybe" needed capabilities. Another concern is that perhaps API costs are "discounted" at the moment and that enshitification will ensue. Finally, aware of workloads that worked well on certain proprietary models that are no longer work well as their endpoints where replaced with updated "sort of identical" fine-tunings that were WORSE for those use cases, leaving people the scramble to find other models for the same tasks.
AI can't solve every problem. Every one is just trying to cover up their bad search algorithms. Look at the horrendous state of autocorrect, filtering in the bad just as much as the good.
Add to that, as also visible in autocorrect, purposeful censorship, and you have a lot more e.g. health relevant problems that AI explicitly won't help solve, either at all or in a non-political manner
Nobody is using LLMs for autocorrect yet. You want a sophisticated, contextually correct answer in realtime. An LLM can give you that, but you need a big model that won't fit locally on a phone.
Not even the stranded assets that the AI boom is leaving can be 'fixed'. Those coal fired power plants are opening at record pace and will continue to operate for a couple decades or more, undoing much of the work that even the big tech companies were bragging about becoming green and sustainable.
But Microsoft and Google are already doing things we never predicted before, like private fission reactors for specific uses? If the industry collapses we'll suddenly have a ton of research into more scalable fission reactors that have a different economic regime? Not to mention the reactors themselves... Also if she's close to AGI than you suddenly have entirely new regimes and potential solutions? Fusion clearly isn't usable on the timescales we need if we continue with humans only. But if you can get human level models that run faster then we have the potential for building them much more quickly and rapidly. It goes from too far out to a smaller solution.
@@Nobody_Of_InterestTo the best of my knowledge, CXL and bandwidth are…Weird. I haven’t played around with any modules myself (I don’t have $4000+ to drop on an experiment), but from what I’ve read I’m pretty sure the bandwidth works additively, in the sense that to some extent you can add the bandwidth of the CXL module’s PCIe connection to the memory bandwidth of the accelerator (or CPU, making them surprisingly viable inference and training devices), but there’s obviously some sort of limitation on the amount of bandwidth you can add to a single PCIe device (defined presumably by the PCIe slot it’s hosted in), so there’s either a premium ratio between CXL expanders and PCIe devices, or you have to get more out of it via software and the architecture of your AI model. As an example, you might imagine a Mixture of Experts model where a Transformer’s feed forward network is replaced by several small MLPs and the appropriate expert is selected for each token inferred. As each expert is smaller, it’s less “expensive” to load it, and by virtue of allowing the experts to specialize, loading the is more “valuable” than loading the same number of parameters in a dense network. This is a weird way to look at it, but you might say that it allows for a higher “effective” bandwidth, or “software enabled” bandwidth, and so you could see storing the expert parameters in CXL expanders, and leaving the main parameters on-device. There’s a happy synergy there where CXL expanders usually have a lot more capacity per unit of bandwidth (similar to the tradeoff seen in CPUs), and the backwards pass is calculated per-expert, so you could also store the gradient in the CXL modules, as well, meaning you could potentially train a monstrously sized model in a single accelerator, like an MI300X. Back of the napkin math suggests with 64 32GB CXL modules (obviously this is a ridiculous configuration but bear with me), a single MI300X could probably train a 100B parameter model (in contrast to its base capacity of around a 4-8B model depending on exact setup), though I think in this specific configuration it would have to be a low-performance MoE (mixture of experts) because I think you would have to limit to best-of-one/topK MoE, whereas generally a fine-grained mixture of experts with multiple experts usually works better.
anyone remember folding at home? a distributed model for doing the calculations for folding a protein.another one was SETI at home. Ultimately i can imagine 'training at home' in secret. model builders will offload training bits to the edge to reduce their costs and users might not even notice. It will be bundled into the client so that you do training for openAI's model every time you use chat GPT. After all, training is just mat mul massively parallel. And billions of people with cell phones is a massively parallel resource just waiting to be used. Sure it's SLOW but it will be FREE for openAI. Sort of like the way facebook leveraged your desire to validate yourself by allowing you to be an exhiibitionist online all the while collating your data and selling it to databrokers
This won't happen because training has huge memory and bandwidth requirements compared to the usual computing. People would notice that the app is using much more cell data than expected. At OpenAI's scale you could not even hold the whole matrix in a cell phone's memory.
@@niamhleeson3522 he's intending for inference. Once you have all the weights figured out there's no reason you couldn't load or bake them into a completely different kind of system. Even a mechanical, analog or one time use system. It's still very impossible, but not for the reasons you mentioned.
@@SquintyGears the original commenter was literally talking about distributing training to edge devices. what do you mean by "intending for inference"?
yeah I don't know how many times in my life I have heard optical computing will resolve our hardware limitations :) Ultimately, depending on an almost century old algorithm for training ''Neural Networks", a.k.a. (stochastic) gradient descent will be the demise of all those systems. This is not intelligence just silly slow statistics with gargantuan electrical bills. You need real RnD to come up with new + radical models, not engineering scotch tape. For what is worth, wetware seems more plausible to me right now. Instead of hopelessly trying to create intelligence out of sand, hack it and use whatever Nature has developed for millennia.
They make the mistake of over investing in parallel compute nodes? Why cause Nvidia and ai they are now just making the gpu to legacy compute nodes to mostly gpu based compute nodes. Take a look at say xai new compute nodes they are building now
So we should train the AI model to better utilize the training that it already trained for depending on the load throughout the day.....and more bandwidth please lol
Practically, the deviation between "AI" and LLMs is a problem. So many orgs are hocking "AI" which varies from pure lies to just Excel. Usable LLMs (with a real world purpose) are a curio/crypto 2.0/the singularity.
Out of curiosity, have you built out an LLM agent yet? I’m going to make an assumption that the only case where you’ve used an LLM is as a chatbot, as that’s typically the type of person I see this take from. “Oh, I asked ChatGPT to write me a song and it was kind of bland” or “I asked it to solve this problem, but it couldn’t get it write” ignoring that they just gave it a cold start, no background, no examples of how to solve it, and an unclear premise surrounding a problem. What makes AI legitimately useful, isn’t necessarily using it as a Chatbot. It’s going in, documenting your workflow, identifying common themes over a long period of time, and building individual modules that can help a model do that step, and then finally chaining it all together to get a robust **system** (not a model, a system) that can handle a specific task. If you’ve ever played Factorio, you can kind of think of it as the difference between running around, hand-crafting and mining everything like it’s Minecraft versus playing the game as intended with abstracted out pipelines and production centers. But what fundamentally is an agent? It’s essentially just a loop where the LLM can add or remove information from it’s context, and call functions that let it do other things (calling itself to do a more specialized, smaller chunk of the work, calling a tool like a calculator, calling a symbolic AI program, getting some information out of a database for RAG, checking it’s own work for errors and hallucinations (please don’t write this part off, it’s surprisingly effective and makes a profound change in the areas you can use them system), etc). I find the real difference in people who can understand what use AI is in people who haven’t built agents and those who have. There are a small number of people who have built agents and still been left unsatisfied (usually they are very practical, down-to-earth people who’ve worked in a job that handles information that isn’t cleanly handleable by an AI model), but almost everyone I know who has actually gone through the effort of engineering an agentic pipeline has realized that it’s an incredibly powerful abstraction for getting repetitive work done and focusing on the remaining unique / one-off problems they run into otherwise.
Correction: every element in the chain must create more value than it costs to perform its function. That is a pro tip, especially for socialists. Without profit, nothing justifies the capital the element requires. That capital could be allocated elsewhere, but this function is competing for it.
Training hardware and electricity is far more expensive than what's needed for inference after the model has been trained. We will eventually discover a point of diminishing returns, especially as systems ingest more synthetic data that will eventually lead to model collapse. But once the training is done, what can we do with these machines? ASICs, by nature, can only perform one type of operation. Do they become the ultimate form of eWaste?
I was at a conference yesterday where one of the panelists said he didn't believe in the self ingestion collapse. His argument was that the content that is propagated is curated. A person picked the best outputs or the ones that had a quality that they liked.
Now I personally think that is a way too optimistic view. Loads of unfiltered trash is getting uploaded constantly. But I do think it's a good enough argument to make me question if models will ever collapse completely. It feel like they'll just enshittify, but frankly... So does everything else...
Also the metrics shared from an environmental researcher (same conference different panelist) about 3M gpu hours to train an LLM and then depending on the size of the resulting model 200 to 600M queries to match that training energy use. And you can napkin math what that means based on where any research lab is or datacenters running these tasks.
they will use it to train larger model
i don't think meta throws away hardware they used for llama 2
IMO AI research alone can keep those ASICS busy for their lifespan.
@@SquintyGears Hi guys. I'm a noob. So, pardon my stupidity. Why not we try to use ML models to discover/establish a theory of how these things actually work? Say, why use LLMs just to parrot language and not figure out a theory of it? Wouldn't that be a worthwhile investment as such will not only reduce computation needed, will end the need of further training. Just as you don't need to train your rocket on Newton's Law.
@@aniksamiurrahman6365 mathematically speaking what LLM's do is probabilistic guessing. a guess on how things work is also basically how humans came to conclusions on gravity and everything else.
but we aren't able to make the computer run proper tests. in models that have tried this the computer ended up editing the limits of the test rather than work harder to get the desired output values.
and because of a number of different mathematical proof, you are more incredibly more likely to fall into a "good enough" final result rather than the actual ground truth.
That's why if your goal is finding out the underlying theory, machine learning will never find it. If you're trying to get something that will help engineers build something with tight tolerances based on test data they collected, it will almost always find a solution that fits the desired precision.
28 minutes, this will be a good listen.
i have known a handful of of organizations that have started to move their customer service away from call centres towards ai chatbots in various cloud providers. it seems to be more than 10x the price of human labour. i am aware of one company that got a surprise bill one month to the effect of 20 million USD
Most, most companies do not need LLMs. Most companies want a fine-tuned small, or medium sized model (something < ~8B parameters) for their specific needs. You can run a small or medium model on an A5500 and that can service many multiple users.
As mentioned, this changes when you move to an agentic workflows. A dozen or more specialised 8B models adds up
@@TechTechPotatobut LLMs talking to LLMs instead of well defined APIs is just stupid, error prone, and crazy inefficient
To me this sounds like the equivalent of this in the 80's/90's:
Most, most companies do not need powerful desktop or portable computers. a few low-powered desktops with a single moderate mainframe for their specific needs.
When models will outperform humans on any thinking related tasks, we'll continue to exponentially build out compute infrastructure I believe. Even several percent increase in model intelligence will continuously be worth chasing with billions of dollars.
Tommi, imagine three specialised 8B models working end to end to replace one 405B model.
@@TechTechPotato just 24B strong, but stronger than the 405B model; how could that work?
I believe that in memory compute + analog matrix multipliers has way more potential than trying to scale the current architectures. There is several orders of magnitude of efficiency and density improvements possible in there. Imagine a chip that combines flash cells for the weights, DRAM cells for the inputs and activations (and the gradientes for training ) and that does an analog matrix multiplication followed by a ADC and a more general purpose gpu like thing to do things like activation functions. The internal bandwidth between those components could be crazy high and even at moderate clocks it could be real fast. You wouldn't need to fit a whole model per chip but fitting a whole layer would be great as the bandwidth to communicate between layers is much smaller.
As Bill Dally has explained many times, analog multiplication fails whenever the data has to eventually be converted back into digital form. It's one of those things that sounds like a good idea but can't be done in practice. Apparently Nvidia has tried this many times and the hurdle of AD conversion can't be overcome.
This isn't possible. Even putting dac and adc problems asside. You're suggesting building a machine to the specific neuron network topology spec. It's inherently a single purpose machine at that point.
Bandwidth doesn't matter if you need to scrap the whole hardware every time you update the model.
And if the chip dynamically scales it's input and output space and loads in weights per query it's not faster than a GPU.
@@SquintyGears Kinda but not really. It only assumes a that there is a matrix multiplication and a maximum number of weights and activations per layer (and even that can be worked around). Those assumptions are true of most neural networks.
As for the specific topology, you would need at least as many chips as you have layers. If the topology can be static it is easy but if you want to be able to change it, some kind of bus or routing between the chips is needed, it could look like a small fpga or even a cpld or several of them in a mesh for a larger scale (same thing can be built in inside of the chip affording more flexibility, maybe allowing chips to process more than one layer if the layers are small).
This of course would mean that the hardware would be under utilized if the model is smaller than what the hardware can do and if the model is larger, it won't work (at least with a naive implementation) but that already happens to some extent with GPUs.
Regarding the DACs and ADCs issue, I am aware of it but I believe that it can and will be overcome. If you do enough work without converting (like a whole matrix multiplication of a large matrix) the overhead doesn't seem too bad. Specially in the context of neural networks, at least for inference where you don't need much precission.
I don't think that what I proposed here is any less flexible than google's TPUs.
@@eruiluvatar236 It's more complicated than you seem to think to have partially occupied layers in a system like this. The multi connectivity of neural networks means that it will throw off the output measurements completely and probably even reflect signals in the dead ends. Grounding doesn't magically happen internally.
And you have to be aware how silly it sounds to say you would just line them up on a bus. Having a package for each layer is like walking back all the way to the 80s. Your memory card was a bunch of 10pin packages side by side and the whole paper sheet sized thing was 256kB. No matter how you run the bus it can't be faster. Even if you imagine that this system would have virtually no actual computing execution time, you've introduced so many sources of signalling latency that it'll always end up slower than the current paradigm. Load, wait, execute, wait, output read.
Replacing computation time with hyper complex tracing and bus topologies is not likely to be a valid solution. It's not stupid to consider something like this at all. But when you take a close look at the complexity the industry faces implementing the existing bus standards and upgrading them each generation... Ddr, pcie, it should be clear what the trade off we're talking about is here.
but how fast will be the npu that Samsung or micron can put in the dram modules?
45 tops npu in lunar lake is almost as big as the 4 p cores
i prefer npu card with 8 dimm slots and large sram cache.
8x 125 GB will suffice llama 405B model in native bfloat16.
large sram cache is what GeForce 4000 does to compensate its slower/narrower gddr bus
soldered hbm or gddr is much faster but loading large model needs several accelerators.
the interconnect is much slower than dimm slots.
for example, 1 ddr5 slot is equivalent of pcie5 x16 lanes
I’ve seen an interesting trend in AI research if you’re on the floor working in training models and so on. In 2017 all the money in the world would not have gotten you a modern AI model’s performance. In around 2019, or 2020 or so, we had GPT 1 and GPT 2, which were kind of the first modern LLMs, and it took a large research team, and a lot of money (perhaps in the millions) to train it, and yet we look at those models and think of them as almost unusably small. And yet, we can replicate those models for under $100 today. This is due to better training dynamics, better hyperparameters, better GPUs, better frameworks, better data, and more.
Stable Diffusion certainly costed either tens or hundreds of millions to make originally, and nowadays, one can produce an analogous (or even superior) model for anywhere between $100-6000 depending on your existing hardware and patience.
I don’t really think that we’ll ever see a complete drop off in AI, but we might see a stop in frontier-class (AGI lab) models. If a hobbyist can run around and replicate previous flagships but trained from scratch for specific use cases (as well as produce models in new modalities), then I don’t see why an enterprise can’t do a slightly larger model for their own internal use case.
We tend to forget that these are technologies and by extension, tools.
We should learn about them, see its pros n cons and apply them when needed
I mean, isn't the final goal to reach AGI so that you can replace employees and thus sell B2B? I assume the current B2C side of the business is just to make revenue while they're rushing to the final goal of AGI, after which B2C will not really matter anymore.
If you can run an agentic workflow with a SLM on an edge device, it might be more cost effective.
Yeah, if you can still get useful results based on lower quality inference.
definitely agree, particularly as corpus and resultant models specifically tuned to "take actions" and "do reasoning with rollback" become more common. A 15W TPD unified memory ARM based edge node or mobile phone can already do 40Tops on 7b-ish models and that can do amazing things already - way more impressive than the "cloud tethered" latency impacted things like Rabbit R1.
Running software neural networks is insanely energy wasteful. There's no way we'll ever achieve affordable/practical AGI this way.
We have to find a practical way of storing the weights and biases in the hardware substrate, a type of neuromorphic processor design where energy is only used as required to perform each task. I know this brings its own challenges but I can't see much of a future for AI that relies solely on software.
people are designing and building specialized hardware, but most of them fail as the way the models work/demands/requirements change keeps changing to fast.
It'll be used massively if it reaches superhuman? Or even subhuman at increased work rates. Once you reach those you have things that can simply do more than humans. Look at how many insane things humans scale crazily for, now imagine a 24/7 actor that's even better or simply faster. It becomes a requirement to use it all over the place?
If they keep on scaling like that then you're talking about entirely new regimes where it's simply required. It has got to a position where Microsoft and Google are funding nuclear fission reactors... That's simply an insane requirement.
Do you think Google doesn't value a developer that can output 5x the amount of high quality code etc and costs $1 million/year, when the top developers can already be hitting $400k/year?
People are really underestimating how much these costs can scale. The 5x the output thing is simple, but what about when that output is simply qualitatively better? Then you enter an entirely new regime of costs.
Can the US afford to not switch to model based fighter pilots if they're qualitatively better? I mean they immediately become qualitatively better on maximum g force, and then you start building airframes to handle way more than humans can. These all have the potential for huge amounts of scaling on the cost side...
At those power levels (even 5x, let alone 20x) we're talking about rapidly accelerating global warming in terms of power demands. This AI revolution, at scale, could be analogous to the crypto boom, but on a far larger scale if the AI gold rush continues on into the future.
Musk recently was talking about a new datacenter for AI that runs on 12-14 ish diesel generators so it's not on the grid. This isn't sustainable.
But it's also funding huge levels of fission now, with entirely new economic regimes? Yeah duh Musk is going to suggest the dumbest thing. But realistically chasing fission seems to be the better solution.
This also has the ability to solve climate change in multiple ways that weren't thinkable before. We can't just look at the potential energy scaling and go on that only? If we get the ability to build fission reactors rapidly and on a cost scale that deal with modern power economics, then that's an unimaginably large win for climate change?
Optical connections are not limited by the speed of light. In a fiber optic link for instance the light bounces around the tube so much that its actually comparable to a copper link in latency.
Good overview of the technical and economical challenges. It makes a lot of sense to solve them together.
I am interested to see how far the infrastructure gets pushed before the industry right sizes. It would be funny to me if all this money gets spent to build out inference infrastructure in the cloud using data centers and afterwards inference compute gets done on local machines anyway.
Super insightful Ian!
Wouldn't optical lines be less prone to interference than copper traces? Will we see these in consumer MBs in a few -> 10 years?
Consumer motherboards are driven by cost more than any other factor. There really isn't anything in a desktop PC that needs optics, it doesn't solve any real problems and if it increases cost then I don't see a use for it in that application.
@@noname-gp6hk So, there are absolutely no signal issues with memory and/or PCIe lane traces? None whatsoever?
@@aldarrinYeah but I think his point is more that they'll scale down the capabilities and segment these features in HEDT instead of making them mainstream. And "you'll be happy with the 5% improvement we give you"...
It definitely makes me wonder why so many are investing so much $$$ in the hardware right now. But I guess they don't want to fall behind in market share.
as written in the meta hugging face page,
creating midrange level llama 70B model needs 100 years in single Nvidia dgx server.
the chat gpt equivalent llama 405B model needs more than 400 years
that was why they bought so many dgx.
but amd etc. is catching up.
open ai etc. don't want to be locked into Nvidia too
Optics will enable scaling, but if i have to take a guess, playing smart on data locality can bring more effective solutions cost and energy wise. Huge models may bring some benefits and make the technology thrive but we already are going into diminishing returns without even covering the costs.
I like that we are talking end to end here. I don’t think the end to end math works or is close.
Too many field of dreams use cases at the user level assuming this arms race of AI spending is the new normal.
I remember Intel had Silicon Photonics for years now, never heard about it being used in any products thou, wonder why?
@notDacian most things related to silicon photonics that I've heard about is all related to research in chip design, I agree that I've heard nothing about such devices being used commercially
They mothballed their programmable tofino switch business, that was meant to be their lead CPO platform. Now it's a bit of a mix - they're showing stuff but it's the same as a few months ago? I can't get a good read - even former employees of that division are baffled
Sounds as if they're gagging on the tech, whether to continue or just throw it out. Def not my kink though
it can be argued now that manufacturers, mainly nvidia, have had a taste of real business customers, that they have forgotten their legacy- value proposition. its been a few generations of gpus now from nvidia but also in general, where customers are only receiving minimal gains, or equal performance to previous flagships. that doesn't go unnoticed.
everyone in gaming is still talking completely unjokingly about the lack of any price:performance generic graphics cards, because nothing else offers previous flagship performance for a reasonable price. they are paying extremely close attention to these slim offerings in benchmarking- there is no value proposition as there used to be. even with mid-year refreshes to existing models there is no value proposition. there is worry mounting for these companies to stay relevant across computing, including gaming, where the audience is ever-increasing and yet a tonne of traction has been lost already.
yes memory is way too slow and too expensive for current needs. memory being in demand they claim, while having no valid excuse for how overdue their timeline is for a breakthrough or a major expansion. if these minor gains take so long, they ought to be expanding and offering quantity of previous flagship memory for a cheaper price.
completely seriously, the memory industry's timeline is so slow and disrespectful to current trends (16:32 holy sh**) that it is a part of what makes them more vulnerable to natural disasters; as we have seen the effects of natural disasters on memory manufacturing in the past, we can compare this against how other sectors handle it and it does not take a lot of research to show that sectors in computer hardware outside of memory, are more capable of recovery or switching gears when necessary. memory has typically not shown this. their slow pace of progress could even indicate that we are subsidizing their recovery right now without knowing it.
i can get all this power from a $20 LLM subscription, but why doesn't that kind of value scale back to hardware customers? it makes no sense, until you start digging into what the heck is going on with memory
when are we getting optical DDR traces in the motherboard?
InfiniBand ? Has supported scaled optical distributed compute/shared memory model for some time ?
As to the value chain - I have significant use-cases that demonstrably save money and time for tasks humans are slower/worse at with current model API costs and model capabilities.
I actually think existing models are barely explored from a value POV even as new models generate even newer and "maybe" needed capabilities.
Another concern is that perhaps API costs are "discounted" at the moment and that enshitification will ensue.
Finally, aware of workloads that worked well on certain proprietary models that are no longer work well as their endpoints where replaced with updated "sort of identical" fine-tunings that were WORSE for those use cases, leaving people the scramble to find other models for the same tasks.
AI can't solve every problem. Every one is just trying to cover up their bad search algorithms. Look at the horrendous state of autocorrect, filtering in the bad just as much as the good.
Add to that, as also visible in autocorrect, purposeful censorship, and you have a lot more e.g. health relevant problems that AI explicitly won't help solve, either at all or in a non-political manner
Nobody is using LLMs for autocorrect yet. You want a sophisticated, contextually correct answer in realtime. An LLM can give you that, but you need a big model that won't fit locally on a phone.
_yet_
Not even the stranded assets that the AI boom is leaving can be 'fixed'. Those coal fired power plants are opening at record pace and will continue to operate for a couple decades or more, undoing much of the work that even the big tech companies were bragging about becoming green and sustainable.
But Microsoft and Google are already doing things we never predicted before, like private fission reactors for specific uses?
If the industry collapses we'll suddenly have a ton of research into more scalable fission reactors that have a different economic regime? Not to mention the reactors themselves...
Also if she's close to AGI than you suddenly have entirely new regimes and potential solutions? Fusion clearly isn't usable on the timescales we need if we continue with humans only. But if you can get human level models that run faster then we have the potential for building them much more quickly and rapidly. It goes from too far out to a smaller solution.
AFAIK optical has more cost, latency, power draw and unreliability.
1st densify compute. Only then consider optical connects.
11:56 Isn't this why NVidea bought Mellanox?
How about adding support for CXL memory expanders into GPUs?
I think MI300X has CXL support. Whether anyone uses it effectively, is anyone's guess. The bandwidth may be too low to be practical for LLM inference.
@@Nobody_Of_InterestTo the best of my knowledge, CXL and bandwidth are…Weird. I haven’t played around with any modules myself (I don’t have $4000+ to drop on an experiment), but from what I’ve read I’m pretty sure the bandwidth works additively, in the sense that to some extent you can add the bandwidth of the CXL module’s PCIe connection to the memory bandwidth of the accelerator (or CPU, making them surprisingly viable inference and training devices), but there’s obviously some sort of limitation on the amount of bandwidth you can add to a single PCIe device (defined presumably by the PCIe slot it’s hosted in), so there’s either a premium ratio between CXL expanders and PCIe devices, or you have to get more out of it via software and the architecture of your AI model.
As an example, you might imagine a Mixture of Experts model where a Transformer’s feed forward network is replaced by several small MLPs and the appropriate expert is selected for each token inferred. As each expert is smaller, it’s less “expensive” to load it, and by virtue of allowing the experts to specialize, loading the is more “valuable” than loading the same number of parameters in a dense network. This is a weird way to look at it, but you might say that it allows for a higher “effective” bandwidth, or “software enabled” bandwidth, and so you could see storing the expert parameters in CXL expanders, and leaving the main parameters on-device. There’s a happy synergy there where CXL expanders usually have a lot more capacity per unit of bandwidth (similar to the tradeoff seen in CPUs), and the backwards pass is calculated per-expert, so you could also store the gradient in the CXL modules, as well, meaning you could potentially train a monstrously sized model in a single accelerator, like an MI300X. Back of the napkin math suggests with 64 32GB CXL modules (obviously this is a ridiculous configuration but bear with me), a single MI300X could probably train a 100B parameter model (in contrast to its base capacity of around a 4-8B model depending on exact setup), though I think in this specific configuration it would have to be a low-performance MoE (mixture of experts) because I think you would have to limit to best-of-one/topK MoE, whereas generally a fine-grained mixture of experts with multiple experts usually works better.
anyone remember folding at home? a distributed model for doing the calculations for folding a protein.another one was SETI at home. Ultimately i can imagine 'training at home' in secret. model builders will offload training bits to the edge to reduce their costs and users might not even notice. It will be bundled into the client so that you do training for openAI's model every time you use chat GPT. After all, training is just mat mul massively parallel. And billions of people with cell phones is a massively parallel resource just waiting to be used. Sure it's SLOW but it will be FREE for openAI. Sort of like the way facebook leveraged your desire to validate yourself by allowing you to be an exhiibitionist online all the while collating your data and selling it to databrokers
Truely dystopian... Reliant on the possibility of splitting and collapsing the training workload effectively. Which isn't proven?
This won't happen because training has huge memory and bandwidth requirements compared to the usual computing. People would notice that the app is using much more cell data than expected. At OpenAI's scale you could not even hold the whole matrix in a cell phone's memory.
@@niamhleeson3522 he's intending for inference. Once you have all the weights figured out there's no reason you couldn't load or bake them into a completely different kind of system. Even a mechanical, analog or one time use system.
It's still very impossible, but not for the reasons you mentioned.
@@SquintyGears the original commenter was literally talking about distributing training to edge devices. what do you mean by "intending for inference"?
@@niamhleeson3522 oh I'm sorry I was replying to comments on a different thread and i got it mixed up.
So basically we are getting more silicon photonics ? Broader market adoption ?
yeah I don't know how many times in my life I have heard optical computing will resolve our hardware limitations :) Ultimately, depending on an almost century old algorithm for training ''Neural Networks", a.k.a. (stochastic) gradient descent will be the demise of all those systems. This is not intelligence just silly slow statistics with gargantuan electrical bills. You need real RnD to come up with new + radical models, not engineering scotch tape. For what is worth, wetware seems more plausible to me right now. Instead of hopelessly trying to create intelligence out of sand, hack it and use whatever Nature has developed for millennia.
More videos like that please :)
How Ayar Labs compares to POET ?
They make the mistake of over investing in parallel compute nodes? Why cause Nvidia and ai they are now just making the gpu to legacy compute nodes to mostly gpu based compute nodes. Take a look at say xai new compute nodes they are building now
Thank you for caring about us. With current subscription pricing of LLM's, less than 2% of Indian people can afford them.
Running against an exponential Wall?
Agentic is not a word. It is called “multi agent”
So we should train the AI model to better utilize the training that it already trained for depending on the load throughout the day.....and more bandwidth please lol
sounds like intel got it right with their focus on inferance!
Soon there won't be a single economic of AI standing.
What happened to tachyum prodigy?
It still exists, they got a contract recently.
Real sufficient AI will start from
14a gaafet hardware
time to start growing brain organoids and training them.
Practically, the deviation between "AI" and LLMs is a problem. So many orgs are hocking "AI" which varies from pure lies to just Excel. Usable LLMs (with a real world purpose) are a curio/crypto 2.0/the singularity.
Out of curiosity, have you built out an LLM agent yet? I’m going to make an assumption that the only case where you’ve used an LLM is as a chatbot, as that’s typically the type of person I see this take from. “Oh, I asked ChatGPT to write me a song and it was kind of bland” or “I asked it to solve this problem, but it couldn’t get it write” ignoring that they just gave it a cold start, no background, no examples of how to solve it, and an unclear premise surrounding a problem.
What makes AI legitimately useful, isn’t necessarily using it as a Chatbot. It’s going in, documenting your workflow, identifying common themes over a long period of time, and building individual modules that can help a model do that step, and then finally chaining it all together to get a robust **system** (not a model, a system) that can handle a specific task.
If you’ve ever played Factorio, you can kind of think of it as the difference between running around, hand-crafting and mining everything like it’s Minecraft versus playing the game as intended with abstracted out pipelines and production centers.
But what fundamentally is an agent? It’s essentially just a loop where the LLM can add or remove information from it’s context, and call functions that let it do other things (calling itself to do a more specialized, smaller chunk of the work, calling a tool like a calculator, calling a symbolic AI program, getting some information out of a database for RAG, checking it’s own work for errors and hallucinations (please don’t write this part off, it’s surprisingly effective and makes a profound change in the areas you can use them system), etc).
I find the real difference in people who can understand what use AI is in people who haven’t built agents and those who have. There are a small number of people who have built agents and still been left unsatisfied (usually they are very practical, down-to-earth people who’ve worked in a job that handles information that isn’t cleanly handleable by an AI model), but almost everyone I know who has actually gone through the effort of engineering an agentic pipeline has realized that it’s an incredibly powerful abstraction for getting repetitive work done and focusing on the remaining unique / one-off problems they run into otherwise.
Correction: every element in the chain must create more value than it costs to perform its function. That is a pro tip, especially for socialists. Without profit, nothing justifies the capital the element requires. That capital could be allocated elsewhere, but this function is competing for it.
And on the 3rd AI wave, AI said: "Let there be light"
Can ai solve the divorce rate or the murder rate ?
or the declining birth rate?
Actually it’s gonna do the opposite.
your premise is not correct. it is not about making money, it is all about quality of living. liveability not viability.