The one thing the paper neglects to mention which should have been the biggest breakthrough of the 1bit LLM is that the VRAM required for training should be drastically less than its full fat 16-bit float counterpart. It should be possible to train the 70b 1-bit model on a single RTX4090 - at present, the 70b model with any meaningful quantization cannot even be run on a single consumer GPU. I made a video on this subject last week. At present the VRAM savings of current quantized LLMs are only apparent during execution, but what is more important is the democratization of LLM training. Lowering the barrier to training an LLM is a must to stop one company conquering the LLM space entirely.
I’m not an expert so the details are a bit fuzzy to me, but my understanding is that it accumulates the gradients at full precision during backprop even though the forward passes are ternary. I think the original 1bit paper that the new one references goes into more details on the training process but I haven’t had a chance to really dig into it. But I think the memory savings may not apply to the training process.
@@ShawnFumo My understanding is that the bulk of the LLM is a black box that takes a vector as input (which comes from the sentence encoder) does several layers of matrix multiplication to it and spits out another vector. The whole point of this is that the weights themselves are ternary but the vectors at each layer may not be. Perhaps they have to keep all the resultant vectors of each layer in memory during training resulting in greater memory usage. However, with all the weights being ternary will surely save some memory.
From their follow up paper ‘The Era of 1-bit LLMs: Training Tips, Code and FAQ’ they mention this: “Training acceleration? Our current implementation for training our model is still in FP16/BF16, so there is no actual speed-up in our experiments. However, there are significant opportunities for acceleration, especially for larger models. First, low-precision CUDA kernels (e.g., FP8 GEMM) can be used to accelerate the forward and backward computations of BitLinear2. Second, 1.58-bit language models can save a substantial amount of memory and communication bandwidth for distributed training. This is particularly beneficial for ZeRO [RRRH20], which conducts an all-gather operation to collect parameters during forward and backward computations.” So they don’t actually even train the thing with ternary weights. Hell we don’t even know when they convert the fp16 to ternary! This paper is not credible to me. Not only do they not provide clear methodologies allowing others to reproduce their findings but they directly contradict what they claim to do in their follow up paper. They say they train the model from scratch using ternary weights but 3 weeks later they post a follow up stating “Our current implementation for training our model is still in FP16/BF16, so there is no actual speed-up in our experiments”. If they had accomplished something notable that would be front and center yet they leave it up to how you interpret the above sentence. If we give them the benefit of the doubt then we would interpret their statement as the COMPUTATION is still high precision and the weights are updates to represent the ternary value. If this is the case then it does not change the fact you need many many powerful GPUs/TPUs to train larger models. Be skeptical of this paper.
@@SpencerP96"If we give them the benefit of the doubt then we would interpret their statement as the COMPUTATION is still high precision and the weights are updates to represent the ternary value. If this is the case then it does not change the fact you need many many powerful GPUs/TPUs to train larger models." Yes, and they never claimed you need less powerful systems for training. As I said in my other comment, their original BitNet paper (2310.11453 on arxiv) gets more into the training. They explicitly say "Most existing quantization approaches for large language models are post-training." and "To the best of our knowledge, this work is the first to investigate quantization-aware training for 1-bit large language models." and they show the math for their BitLinear function that replaces the regular matrix multiplication. The whole point of these two papers is that doing the quantization while training has a better result than trying to quantize it after training is done, and that the quality ends up similar to fp16 (in the case of ternary), while getting the normal benefits of quantizing (less vram needed during inference, and faster to infer). They already said they are releasing the code and weights in a week or two. That latest PDF was just to give people a quicker response. Not sure why the need to be suspicious when they've been so communicative and open on huggingface.
@@ShawnFumo I reread the BitNet paper and it’s addressed in there so I rescind my comment. My general criticism is bigger than just this paper though. I generally dislike when papers aren’t clear on something they’re intending to say (and I don’t mean any math or anything just normal sentences). As far as I saw when I looked they only responded to a couple comments on huggingface. Additionally, I shouldn’t have to dig around 3 different papers and a comment section to figure out what it is they mean, I’d prefer if they’d just say “for more details on x see the original BitNet paper”, which they do via a huggingface comment (though not directly applicable to the parts I quoted so you’d have to make the connection, again the ambiguity is what’s frustrating) but I shouldn’t have to look at a comment section to get that, especially when that’s not the only place someone could find the paper. If I just stumbled across the paper on arXiv or a blog post or RUclips video I may never read those comments. That extends beyond just this paper though, but since this is a comment section for this paper I’m picking on it.
Great! Very helpful. One suggestion I’d make: the numbers of bits being a fractional number will be unfamiliar for many people. I think it would be useful to make it clear that yes of course to represent three states in practicality you need two bits, and the 1.58 number is the theoretical Shannon-Hartley entropy
Correct. Note that in memory you could represent five states in a byte (3^5 = 243, which is less than 2^8) so in fact you could store it in 1.6 bits per value in memory with custom hardware.
Wow this seems promising. I hope this will reproduce properly and work in other situations too. If it is truly better in general, new hardware could be so much more efficient
promising? I really can't see why all this models havens used tritons from the start, it's basically LLLms in an assembly. I guess more numerical computing experts with background on mathematics, logic are needed.
@@lendo14571 bit holds 2 values - 1/0. to store 3 values you would need 2 bits. there were ternary logic machines that used trits (Setun), but using 3 values in a binary system computers seems a bit weird in terms of optimization. there could be some use of the fours "unused" value.
@lendo1457 in binary to store the values 0 and 1 you use 1 bit (0 1). Unoptimized signed -1 would require 2 bits total (00 01 10 but not 11) . If you were building a data structure to store it, though, you would use an encoding that uses less than 2 bits.
Ok, but what are the theory on WHY it achieves same performance? maybe this shows no one really understand how Neural Networks works and are giving them much more complicated steps when they could be just some "quantised" states.
I think to achieve the same results, it would need to have way more layers, to make up for the lack of nuance. It might run faster because the math is simplified, but I think it would use like 16 times as many layers or more. I think the result would be the same amount of data, giving the same chat answers, but it might run faster.
So between this, Groq hardware, Mojo language and Mamba architecture... How many of these are compatible and stack their benefits synergistically? And where they stack is the performance additive or multiplicative?
wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.
To summarize the Bitnet was trained by scratch. Therefore I cannot quantisize an existing llm to 1.58 bit? Or is there a quantisizing approach for existing llms onto 1.58bit?
yes, it would need to be trained from scratch. The math used in the model is different, so by my understanding they are entirely incompatible with each other. edit: Or I'm wrong - the supposed implementation I found actually has quantization code. I'm too lazy to try running it right now to verify its not just garbage.
@@Alice_FumoI haven’t dived into it too much, but the original 1bit paper mentions quantization-aware training. Like I think it uses full precision for gradient accumulation as it back-propagates, but uses the trits for forward and backward passes. I’m not sure exactly how that works out in practice, but it definitely isn’t the same as doing all the training in full precision and quantizing it after.
How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?
This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.
How would you represent ternaries in hardware? Would you leave pins floating, force them to the middle with a voltage divider or add a second pin? * Also, in general computing multiplication by unknowns and division by non powers of 2 are rare operations. All of that ignores the added complexity that would nullify the advantages of radix economy because it would increase the complexity of division by abandoning the check in binary long division for the guess and check needed in bases larger than 2. * In the first case you could not run at high clock speeds because strai capacitance and inductance would cause errors. Second case: Transistors become inefficient at the midpoint between high and low, causing massive energy consumption and heating. Third case: A second line allows you to use nibbles, meaning you just ignore certain states out of principle and wasting computational power.
@@antonf.9278 Just use negative voltages. Also division by non powers of 2 are VERY common in computing. As in most division for applications will not be a power of 2, like in machine learning.
To me it highlights compute is more important for neural networks than precision. That could also allow the usage probabilistic compute allowing for much more compute per watt.
I've made a few contributions to Quaternary algebra, I discovered the inclusive and exclusive not-gate and am currently working on proofs for them. The issue with ternary and quaternary at the moment is that current computers have to use numerous transistors per ternary or quaternary bit. Until we have a ternary or quaternary transistor, we may have to keep using bytes just like regular integers. I haven't seen any patents for a working one that isn't several times larger than a binary transistor which makes going back to binary more efficient, of course it depends though. I don't know what Microsoft is doing but on top of this, running ternary requires at absolute minimum 2 binary bits to run, meaning 2 physical data lines at best. Depending on how optimized everything from your languages compiler is to what kinds of operations you're performing it may use significantly more. To run ternary on current hardware doesn't quite make practical sense, when for the same~ amount of data likes you could be using quaternary.
I realize you have considerable expertise on this. However, what I would expect them to do is encode 5 ternary values into a single eight bit byte. As I understand that this accomplishes two things: 1. The multiplication step is removed entirely, which could greatly simplify custom architectures. 2. The amount of memory required to store the model decreases substantially. Even if it takes two bits to make it efficient, that is a fourfold deduction in the amount of memory required, which is a major improvement, especially for working on home GPUs with limited memory.
@ithaca2076 your reasoning is correct, but it's beyond the scope of an LLM. This is more related to the architectures of neural networks, they require FP16 weights to solve all kinds of problems. The paper presents a rather basic quantization formula (not new at all), and then they present a bunch of metrics that you don't really know how they obtained. The paper explains very little about the "model", and the big problem here is: how do you transform your weights from FP16 to 1 bit (or 1.56 or whatever) without losing information (precision) in your model? There's a trade-off between quantization and precision, and they don't explain anything about it. No demonstration at all. So I don't think it's a real "paper" yet, it's maybe an hypotesis or vapor, at least until it's truly demonstrated. I wish I were wrong though, it would be a real breakthrough.
@@cristianpercivati5097 I think the only way it could work, is if they scaled up the number of layers. 16 1 bit decisions could give you the same nuance as 1 16 bit decision. so maybe its 16x the layers to get the same result, but it calculates it faster, by avoiding the floating point math.
Never... because it was more like 127:256 or 49.6% =1 ... 1:256 or 0.4% = 0 ... 128:1 or 50% = (-1), because only a perfect zero got to stay zero ie sort by .. 8 bit ==> 127 + numbers ==> only 1 zero ==> 128 - numbes Program to sort would be if (x!=0) { { if(x>0) {x =1;} } else { if(x
Around 4:40 they explain the quantization algorithm. By default it looks like the quantization will distribute the values roughly evenly between 1, -1 and 0, but it won't be exact.
📝 Summary of Key Points: 📌 The research paper discusses the era of 1bit LLMS, focusing on reducing the size of large language models to address issues related to compute and memory resources, as well as environmental concerns. 🧐 The introduction of the BitNet B 1.58 model architecture, which utilizes weights that are ternary (-1, 0, 1) to reduce the number of bits required to represent the model, leading to improved efficiency without sacrificing performance. 🚀 Benefits of the BitNet B 1.58 model include reduced memory usage, lower latency, and comparable performance to full-precision models, showcasing its potential for future applications in large language models. 💡 Additional Insights and Observations: 💬 "Quantization in machine learning refers to the process of reducing the precision of model weights to optimize memory usage and speed." 📊 The BitNet B 1.58 model demonstrates significant improvements in memory usage, latency, and perplexity compared to existing models like Lama. 🌐 The research paper presents compelling evidence of the effectiveness of the BitNet B 1.58 model through comparisons with established models and tasks. 📣 Concluding Remarks: The era of 1bit LLMS introduces innovative approaches to reducing the size of large language models, with the BitNet B 1.58 model showing promising results in terms of efficiency and performance. This research opens up new possibilities for more accessible and environmentally friendly AI models in the future. Generated using TalkBud
1:56 is the "same peformance" with "perito improvement" just an illustration of theoretical prediction or actual model weights data from real trit-model?
Why does it still work when it’s quantified from float16 to -1,0,1. There could be countless numbers in float16 but only 3 numbers after quantification. I’m confused on this.😂
Its simple, its either a cat of a dog or it nothing.... In a cat/ dog model..... I don't care the picture has half a cat .. its still a cat in the picture...
@@ntal5859 to determine if its a cat or dog, you need to crunch a lot of numbers on fuzzy inputs, and using 1 bit instead of floats is like rounding during every step of a large math problem, so I doubt their claims that this could be as accurate as floating points at the same scale. You want a model that can think "this part looks a bit like a cat, and this other part also looks a bit like a cat, but this part looks like a dog, so I think this is 53% likely to be a cat" These things are supposed to add up uncertainty, but these 1 bit models would be certain about everything, which basically means they would be more likely to hallucinate unless you scaled them way up. a nuanced model with high precision floating points would think like a philosopher, while a bitnet model of the same scale would think like a Sith. maybe if you scale up that Sith bot, it could become a philosopher, but you would need way more layers than an equivalent floating point model, to get back that nuance. I think you would need like 16 times as many layers to get the same result, but maybe even at that scale increase, it could still be faster by avoiding some of the math.
@@ntal5859 you could also say it only has 2 states a -1 and a 1, where 0 is just a super position between the two. (kidding) How do you store a ternary value in 1.58 bits? I would assume that it would use 2's complement and store it in 2 bits... where bit 2 is the flag for -/+ (therefore it is actually 2 bits in memory)
@@rayujohnson1302In dedicated hardware, you can store 5 values in a byte (since 3^5 = 243, which is less than 2^8) so you could actually store it in 1.6 bits, not too far off the 1.58 bit theoretical value
Excellent explanations. This seems to be a comparison on Llama1 though, any confirmation if Llama2 models also perform similar after quantization? I am curious to know if this works on later generations, conceptually Llama2 outperforms Llama1 for the same size ( I.e 7B vs 7B, 13B vs 13B). So in effect the same weights now hold more complexity as compared to before, ie compression will work better when weights have more redundancy as compared to later versions where precision is more likely to be driving the performance differences
This is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English. I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL
That's maybe be fun, but I clearly can be potentially much more dangerous than a password leak. You trained with your voice right? would you want someone to make some calls with hate speech using your voice for example?
This might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.
i think transformer on current cpu is a cpu/gpu problem it self because those likes 1,0 when 1-bit reduces to 1,0 to fit the limitation of current cpu/gpu.. a gpu,cpu build for transformers might work better
Yes and requires adapting lama.cpp to work with that, they say is not hard and requires just minor adjustments. The only problem is models need to be trained from the scratch. So let's hope at least they start. If we are lucky we could start seeing something the next month.
@@BacktitrationfanWhere are you seeing code? Someone posted a few comments on the huggingface paper thread with code, but that is just a regular person’s attempt at re-implementing it from the paper. On the very first post, there is a link to a folder in ms’s unilm repo where seems the code will eventually appear, but it isn’t there yet. I’m🤞that it’ll appear next week some time but who knows.
It means that if you wanted to improve latency it would have to be at the cost of throughput and/or energy consumption. (And similarly if you wanted to improve throughput, or energy…)
Sorry but this paper is quite a briefing to 1 bit LLMs and the video itself is not explaining anything more than reading it out loud. There are multiple questions like what is the viable option to train such models, how it influences activation functions and what's the real benefit here as it suggests without multiplications today's GPUs would not be required which is not really true there. And requirement for new optimized hardware is not really a cool path to go forward.
Sure, but keep in mind it is a research paper with a lot of details (plus the previous paper). So even if they never release code (though MS has a pretty good record on releasing code), others should be able to reproduce it anyway. There is a lot of hype going on from people online since this could be really important, but it isn’t like a big flashy presentation. I have no reason to think they’d fake something like this, that is more technical and isn’t a product. They’d have a lot more to lose than gain IMO. Plus, they say themselves it hasn’t been scaled up above 3.9b params. They’re hopeful it keeps up the same trajectory, but we’ll have to wait and see. We’ll know the answer pretty soon anyway. This is too big a deal that others won’t try it now.
Please don't. Most TTS engines have became my personal heuristic for low effort spam (sometimes including automated content farms). Voice acting is a skill and will improve over time if you let it. Individuality, the subtle inflection and candidness of a person's interior thoughts matching the waveforms you hear, that neither a hired voice actor nor a TTS model could replicate.
The one thing the paper neglects to mention which should have been the biggest breakthrough of the 1bit LLM is that the VRAM required for training should be drastically less than its full fat 16-bit float counterpart. It should be possible to train the 70b 1-bit model on a single RTX4090 - at present, the 70b model with any meaningful quantization cannot even be run on a single consumer GPU. I made a video on this subject last week.
At present the VRAM savings of current quantized LLMs are only apparent during execution, but what is more important is the democratization of LLM training. Lowering the barrier to training an LLM is a must to stop one company conquering the LLM space entirely.
I’m not an expert so the details are a bit fuzzy to me, but my understanding is that it accumulates the gradients at full precision during backprop even though the forward passes are ternary. I think the original 1bit paper that the new one references goes into more details on the training process but I haven’t had a chance to really dig into it. But I think the memory savings may not apply to the training process.
@@ShawnFumo My understanding is that the bulk of the LLM is a black box that takes a vector as input (which comes from the sentence encoder) does several layers of matrix multiplication to it and spits out another vector. The whole point of this is that the weights themselves are ternary but the vectors at each layer may not be. Perhaps they have to keep all the resultant vectors of each layer in memory during training resulting in greater memory usage. However, with all the weights being ternary will surely save some memory.
From their follow up paper ‘The Era of 1-bit LLMs: Training Tips, Code and FAQ’ they mention this:
“Training acceleration?
Our current implementation for training our model is still in FP16/BF16, so there is no actual speed-up in our experiments. However, there are significant opportunities for acceleration, especially for larger models. First, low-precision CUDA kernels (e.g., FP8 GEMM) can be used to accelerate the forward and backward computations of BitLinear2. Second, 1.58-bit language models can save a substantial amount of memory and communication bandwidth for distributed training. This is particularly beneficial for ZeRO [RRRH20], which conducts an all-gather operation to collect parameters during forward and backward computations.”
So they don’t actually even train the thing with ternary weights. Hell we don’t even know when they convert the fp16 to ternary! This paper is not credible to me. Not only do they not provide clear methodologies allowing others to reproduce their findings but they directly contradict what they claim to do in their follow up paper.
They say they train the model from scratch using ternary weights but 3 weeks later they post a follow up stating “Our current implementation for training our model is still in FP16/BF16, so there is no actual speed-up in our experiments”.
If they had accomplished something notable that would be front and center yet they leave it up to how you interpret the above sentence.
If we give them the benefit of the doubt then we would interpret their statement as the COMPUTATION is still high precision and the weights are updates to represent the ternary value. If this is the case then it does not change the fact you need many many powerful GPUs/TPUs to train larger models.
Be skeptical of this paper.
@@SpencerP96"If we give them the benefit of the doubt then we would interpret their statement as the COMPUTATION is still high precision and the weights are updates to represent the ternary value. If this is the case then it does not change the fact you need many many powerful GPUs/TPUs to train larger models."
Yes, and they never claimed you need less powerful systems for training. As I said in my other comment, their original BitNet paper (2310.11453 on arxiv) gets more into the training. They explicitly say "Most existing quantization approaches for large language models are post-training." and "To the best of our knowledge, this work is the first to investigate quantization-aware training for 1-bit large language models." and they show the math for their BitLinear function that replaces the regular matrix multiplication.
The whole point of these two papers is that doing the quantization while training has a better result than trying to quantize it after training is done, and that the quality ends up similar to fp16 (in the case of ternary), while getting the normal benefits of quantizing (less vram needed during inference, and faster to infer).
They already said they are releasing the code and weights in a week or two. That latest PDF was just to give people a quicker response. Not sure why the need to be suspicious when they've been so communicative and open on huggingface.
@@ShawnFumo I reread the BitNet paper and it’s addressed in there so I rescind my comment.
My general criticism is bigger than just this paper though. I generally dislike when papers aren’t clear on something they’re intending to say (and I don’t mean any math or anything just normal sentences). As far as I saw when I looked they only responded to a couple comments on huggingface. Additionally, I shouldn’t have to dig around 3 different papers and a comment section to figure out what it is they mean, I’d prefer if they’d just say “for more details on x see the original BitNet paper”, which they do via a huggingface comment (though not directly applicable to the parts I quoted so you’d have to make the connection, again the ambiguity is what’s frustrating) but I shouldn’t have to look at a comment section to get that, especially when that’s not the only place someone could find the paper. If I just stumbled across the paper on arXiv or a blog post or RUclips video I may never read those comments.
That extends beyond just this paper though, but since this is a comment section for this paper I’m picking on it.
Great! Very helpful. One suggestion I’d make: the numbers of bits being a fractional number will be unfamiliar for many people. I think it would be useful to make it clear that yes of course to represent three states in practicality you need two bits, and the 1.58 number is the theoretical Shannon-Hartley entropy
Correct. Note that in memory you could represent five states in a byte (3^5 = 243, which is less than 2^8) so in fact you could store it in 1.6 bits per value in memory with custom hardware.
1.58 is exactly 1 trit. ternary computing.
I'm nitpicking but it's the information content, not the entropy.
We had a lecture about single bit neural networks at one of my uni courses, some 5 years ago. It was interesting.
Wow this seems promising. I hope this will reproduce properly and work in other situations too. If it is truly better in general, new hardware could be so much more efficient
promising? I really can't see why all this models havens used tritons from the start, it's basically LLLms in an assembly. I guess more numerical computing experts with background on mathematics, logic are needed.
Why not call it what it is? A trit
Wouldnt that be 3 bits
@@lendo14571 bit holds 2 values - 1/0. to store 3 values you would need 2 bits. there were ternary logic machines that used trits (Setun), but using 3 values in a binary system computers seems a bit weird in terms of optimization. there could be some use of the fours "unused" value.
U twit
@lendo1457 in binary to store the values 0 and 1 you use 1 bit (0 1). Unoptimized signed -1 would require 2 bits total (00 01 10 but not 11) . If you were building a data structure to store it, though, you would use an encoding that uses less than 2 bits.
@@playlist5455 Thank you!
1.58 bit for correction
Ok, but what are the theory on WHY it achieves same performance? maybe this shows no one really understand how Neural Networks works and are giving them much more complicated steps when they could be just some "quantised" states.
I think to achieve the same results, it would need to have way more layers, to make up for the lack of nuance. It might run faster because the math is simplified, but I think it would use like 16 times as many layers or more. I think the result would be the same amount of data, giving the same chat answers, but it might run faster.
@@Omnicypher001but it apparently has the same number of parameters.
@@dsmogorif it would be the case than weight of the model (in gb) would be about 20 times lower compared to float32
So between this, Groq hardware, Mojo language and Mamba architecture... How many of these are compatible and stack their benefits synergistically? And where they stack is the performance additive or multiplicative?
mamba is a trade of of memory and speed
maybe mamba will use only in hibrid or special cases
how does mojo lang relate with all this?
Simple just get one them to disassemble the others and rebuild it into it own platform... after all isn't that the whole idea of multi modal input.
1:22 it's 2 bytes, not 4 bytes, right?
Right, thank you for the correction 🙏
wonder why people don’t use this approach from the beginning. It’s like LLMs in assembly language. And as far as I know, every linear operator has a kernel. The kernel means that a linear operator H always maps the zero vector to itself. When we use a computer, we represent the zero vector as a column matrix of n zeros. Since the layers of LLMs are in the same vector space, we have H\vec{0} = \vec{0} for any H. I apologize for my bad LaTeX, but \vec{0} is supposed to be a vector. It’s important to remember that 0 is the trivial element in the kernel. For example, let Z be the set of all integers, and let H be the multiplication operator. Then, in ordinary algebra, we have positive, zero, and negative integers. The operator is \cdot, not x. The multiplication operator is often used in quantum mechanics of many particles, where the vector space grows exponentially, just like the number of bits for multiple objects.
To summarize the Bitnet was trained by scratch.
Therefore I cannot quantisize an existing llm to 1.58 bit?
Or is there a quantisizing approach for existing llms onto 1.58bit?
yes, it would need to be trained from scratch.
The math used in the model is different, so by my understanding they are entirely incompatible with each other.
edit: Or I'm wrong - the supposed implementation I found actually has quantization code. I'm too lazy to try running it right now to verify its not just garbage.
@@Alice_FumoI haven’t dived into it too much, but the original 1bit paper mentions quantization-aware training. Like I think it uses full precision for gradient accumulation as it back-propagates, but uses the trits for forward and backward passes. I’m not sure exactly how that works out in practice, but it definitely isn’t the same as doing all the training in full precision and quantizing it after.
It’s too lossy for quantisation. You can see their model has many more parameters.
How feasible is it to adapt BitNet b1.58's ternary quantization (-1, 0, 1) for quantum computing using qutrits, given the current state of qutrit-based hardware, error correction, and the development of specialized quantum algorithms?
This is something I predicted would happen in AI. It's cool to see a concrete usage of it. Ternary computers are the most efficienty computers and base 3 is the most efficient base. So this isn't surprising. Read up on Radix Economy to learn more.
How would you represent ternaries in hardware?
Would you leave pins floating, force them to the middle with a voltage divider or add a second pin? *
Also, in general computing multiplication by unknowns and division by non powers of 2 are rare operations.
All of that ignores the added complexity that would nullify the advantages of radix economy because it would increase the complexity of division by abandoning the check in binary long division for the guess and check needed in bases larger than 2.
* In the first case you could not run at high clock speeds because strai capacitance and inductance would cause errors.
Second case: Transistors become inefficient at the midpoint between high and low, causing massive energy consumption and heating.
Third case: A second line allows you to use nibbles, meaning you just ignore certain states out of principle and wasting computational power.
@@antonf.9278 Just use negative voltages. Also division by non powers of 2 are VERY common in computing. As in most division for applications will not be a power of 2, like in machine learning.
To me it highlights compute is more important for neural networks than precision. That could also allow the usage probabilistic compute allowing for much more compute per watt.
Model weights will make a lot more sense
Interesting how accuracy will be impacted in the end.
I've made a few contributions to Quaternary algebra, I discovered the inclusive and exclusive not-gate and am currently working on proofs for them.
The issue with ternary and quaternary at the moment is that current computers have to use numerous transistors per ternary or quaternary bit. Until we have a ternary or quaternary transistor, we may have to keep using bytes just like regular integers. I haven't seen any patents for a working one that isn't several times larger than a binary transistor which makes going back to binary more efficient, of course it depends though.
I don't know what Microsoft is doing but on top of this, running ternary requires at absolute minimum 2 binary bits to run, meaning 2 physical data lines at best. Depending on how optimized everything from your languages compiler is to what kinds of operations you're performing it may use significantly more.
To run ternary on current hardware doesn't quite make practical sense, when for the same~ amount of data likes you could be using quaternary.
Who asked
I realize you have considerable expertise on this. However, what I would expect them to do is encode 5 ternary values into a single eight bit byte.
As I understand that this accomplishes two things:
1. The multiplication step is removed entirely, which could greatly simplify custom architectures.
2. The amount of memory required to store the model decreases substantially. Even if it takes two bits to make it efficient, that is a fourfold deduction in the amount of memory required, which is a major improvement, especially for working on home GPUs with limited memory.
@ithaca2076 your reasoning is correct, but it's beyond the scope of an LLM. This is more related to the architectures of neural networks, they require FP16 weights to solve all kinds of problems. The paper presents a rather basic quantization formula (not new at all), and then they present a bunch of metrics that you don't really know how they obtained. The paper explains very little about the "model", and the big problem here is: how do you transform your weights from FP16 to 1 bit (or 1.56 or whatever) without losing information (precision) in your model? There's a trade-off between quantization and precision, and they don't explain anything about it. No demonstration at all. So I don't think it's a real "paper" yet, it's maybe an hypotesis or vapor, at least until it's truly demonstrated. I wish I were wrong though, it would be a real breakthrough.
@@cristianpercivati5097 I think the only way it could work, is if they scaled up the number of layers. 16 1 bit decisions could give you the same nuance as 1 16 bit decision. so maybe its 16x the layers to get the same result, but it calculates it faster, by avoiding the floating point math.
I wonder what the distribution is between the three values? It would be interesting if it was evenly 33.33%.
Never... because it was more like 127:256 or 49.6% =1
... 1:256 or 0.4% = 0
... 128:1 or 50% = (-1),
because only a perfect zero got to stay zero ie sort by ..
8 bit ==> 127 + numbers
==> only 1 zero
==> 128 - numbes
Program to sort would be
if (x!=0) {
{
if(x>0) {x =1;}
}
else
{
if(x
Around 4:40 they explain the quantization algorithm. By default it looks like the quantization will distribute the values roughly evenly between 1, -1 and 0, but it won't be exact.
📝 Summary of Key Points:
📌 The research paper discusses the era of 1bit LLMS, focusing on reducing the size of large language models to address issues related to compute and memory resources, as well as environmental concerns.
🧐 The introduction of the BitNet B 1.58 model architecture, which utilizes weights that are ternary (-1, 0, 1) to reduce the number of bits required to represent the model, leading to improved efficiency without sacrificing performance.
🚀 Benefits of the BitNet B 1.58 model include reduced memory usage, lower latency, and comparable performance to full-precision models, showcasing its potential for future applications in large language models.
💡 Additional Insights and Observations:
💬 "Quantization in machine learning refers to the process of reducing the precision of model weights to optimize memory usage and speed."
📊 The BitNet B 1.58 model demonstrates significant improvements in memory usage, latency, and perplexity compared to existing models like Lama.
🌐 The research paper presents compelling evidence of the effectiveness of the BitNet B 1.58 model through comparisons with established models and tasks.
📣 Concluding Remarks:
The era of 1bit LLMS introduces innovative approaches to reducing the size of large language models, with the BitNet B 1.58 model showing promising results in terms of efficiency and performance. This research opens up new possibilities for more accessible and environmentally friendly AI models in the future.
Generated using TalkBud
1:56 is the "same peformance" with "perito improvement" just an illustration of theoretical prediction or actual model weights data from real trit-model?
Why does it still work when it’s quantified from float16 to -1,0,1. There could be countless numbers in float16 but only 3 numbers after quantification. I’m confused on this.😂
Its simple, its either a cat of a dog or it nothing.... In a cat/ dog model..... I don't care the picture has half a cat .. its still a cat in the picture...
It’s not quantized from float-16. It’s trained from the start with only the values (-1, 0, and 1)
@@ntal5859 to determine if its a cat or dog, you need to crunch a lot of numbers on fuzzy inputs, and using 1 bit instead of floats is like rounding during every step of a large math problem, so I doubt their claims that this could be as accurate as floating points at the same scale. You want a model that can think "this part looks a bit like a cat, and this other part also looks a bit like a cat, but this part looks like a dog, so I think this is 53% likely to be a cat" These things are supposed to add up uncertainty, but these 1 bit models would be certain about everything, which basically means they would be more likely to hallucinate unless you scaled them way up.
a nuanced model with high precision floating points would think like a philosopher, while a bitnet model of the same scale would think like a Sith. maybe if you scale up that Sith bot, it could become a philosopher, but you would need way more layers than an equivalent floating point model, to get back that nuance. I think you would need like 16 times as many layers to get the same result, but maybe even at that scale increase, it could still be faster by avoiding some of the math.
Technically they should call it a 2 bit LLM -- which has multiple meanings ;)
2^2 = 4 states this only has 3 states so 2^1.58 ...
@@ntal5859 you could also say it only has 2 states a -1 and a 1, where 0 is just a super position between the two. (kidding) How do you store a ternary value in 1.58 bits? I would assume that it would use 2's complement and store it in 2 bits... where bit 2 is the flag for -/+ (therefore it is actually 2 bits in memory)
@@ntal5859yeah but it's represented with two bits and n-bit usually refers to the representation.
@@rayujohnson1302In dedicated hardware, you can store 5 values in a byte (since 3^5 = 243, which is less than 2^8) so you could actually store it in 1.6 bits, not too far off the 1.58 bit theoretical value
Excellent explanations. This seems to be a comparison on Llama1 though, any confirmation if Llama2 models also perform similar after quantization? I am curious to know if this works on later generations, conceptually Llama2 outperforms Llama1 for the same size ( I.e 7B vs 7B, 13B vs 13B). So in effect the same weights now hold more complexity as compared to before, ie compression will work better when weights have more redundancy as compared to later versions where precision is more likely to be driving the performance differences
This is great! FYI, you can create a model of your voice in ElevenLabs, do a voice-to-voice transformation, and out would come perfectly pronounced English.
I found this out by accident because I created a model of Arnold Schwarzenegger's voice, but everything I made it say LOST the accent but kept his tone of voice, LOL
That's maybe be fun, but I clearly can be potentially much more dangerous than a password leak. You trained with your voice right? would you want someone to make some calls with hate speech using your voice for example?
This might have advantage even more when we get dedicated hardware, since tri-state logic is already a thing in CMOS. A dedicated tri-state matrix multiplication architecture for this type of networks should be easy to engineer with modern processes. NVIDIA should be all over that.
but how is the accuracy ?
i think transformer on current cpu is a cpu/gpu problem it self because those likes 1,0 when 1-bit reduces to 1,0 to fit the limitation of current cpu/gpu.. a gpu,cpu build for transformers might work better
Does it mean every model can be quantized this way?
I don't think you can do that post training of a float based model
@@valentindion5573 makes sense. Thank you!
possible to test this model with llm studio?
no you're gonna need slm studio for these babies
Do we have code or at least from the community?
they have released it iirc
Yes and requires adapting lama.cpp to work with that, they say is not hard and requires just minor adjustments. The only problem is models need to be trained from the scratch. So let's hope at least they start. If we are lucky we could start seeing something the next month.
@@BacktitrationfanWhere are you seeing code? Someone posted a few comments on the huggingface paper thread with code, but that is just a regular person’s attempt at re-implementing it from the paper. On the very first post, there is a link to a folder in ms’s unilm repo where seems the code will eventually appear, but it isn’t there yet. I’m🤞that it’ll appear next week some time but who knows.
Where they release the code?@@Backtitrationfan
We are making a team to work in 1bit llm, you are interested?
A lot of "trees" here....
🎉😊❤
What does the pareto improvement mean? That it's the 20% giving 80% performance?
It means that if you wanted to improve latency it would have to be at the cost of throughput and/or energy consumption. (And similarly if you wanted to improve throughput, or energy…)
well well well isn't this model seems like to run the best in quantum computer? please enlighten me.
Probably so, but still a ways to go before a quantum computer could handle running an LLM I think in terms of size of params.
this one - definitely no. can a QC run a hypothetical ternary model faster than classical one? open question.
No ... 3 states would be easy for any conventional cpu... in fact the whole idea ... was to take away infinite weights and bring to 3 states.
2:13
Sorry but this paper is quite a briefing to 1 bit LLMs and the video itself is not explaining anything more than reading it out loud.
There are multiple questions like what is the viable option to train such models, how it influences activation functions and what's the real benefit here as it suggests without multiplications today's GPUs would not be required which is not really true there. And requirement for new optimized hardware is not really a cool path to go forward.
No code = No proof
Sure, but keep in mind it is a research paper with a lot of details (plus the previous paper). So even if they never release code (though MS has a pretty good record on releasing code), others should be able to reproduce it anyway. There is a lot of hype going on from people online since this could be really important, but it isn’t like a big flashy presentation. I have no reason to think they’d fake something like this, that is more technical and isn’t a product. They’d have a lot more to lose than gain IMO. Plus, they say themselves it hasn’t been scaled up above 3.9b params. They’re hopeful it keeps up the same trajectory, but we’ll have to wait and see. We’ll know the answer pretty soon anyway. This is too big a deal that others won’t try it now.
The LLM has 0.1975 bytes. I don't think it's going to work.
what is that accent?
was looking for the same
I would guess french
So in summary everything is either Yes=1, Never mind =0, No = - 1 If only women were so simple to work out.
It's a bit difficult to understand your accent, probably because I'm not a native speaker. Do you consider using an AI synthesized voice?
Please don't. Most TTS engines have became my personal heuristic for low effort spam (sometimes including automated content farms). Voice acting is a skill and will improve over time if you let it. Individuality, the subtle inflection and candidness of a person's interior thoughts matching the waveforms you hear, that neither a hired voice actor nor a TTS model could replicate.