Thanks for the feedback! I don't think that's mentioned in the paper, but this Reddit post explains what happens during backprop: www.reddit.com/r/MachineLearning/comments/1b22izk/comment/ksjphhj/?
Well, surely I'm not the smartest one around here, but here's my explanation: in information theory, by using the entropy (see the first equation here: en.wikipedia.org/wiki/Entropy_(information_theory)), you can measure the expected value of information, in bits, that you need to represent a random variable. In BitNet 1.58, the random variable can take three values {-1, 0, 1}, each equally likely. If you put the numbers into the entropy equation, you get -1/3 * log2(1/3) - 1/3*log2(1/3) - 1/3*log2(1/3) = -log2(1/3) = log2(3) bits necessary to represent that random variable. Hope that helps! :)
Thanks for noticing that and pointing it out! Well, maybe not on the all tasks, but the average seems to be higher for BitNet1.58. Anyway, the fact that they bolded only their results is kinda misleading, I genuinely didn't notice that.
Lovely work. Keep em coming!!
Thanks! Surely will do! Stay tuned for the new content. :)
Whoever figured out that one needs to get PAID. Possibly billions saved.
Most likely those that discover such things are paid generously, not billions, but still. :)
Good walkthrough! One thing i don't understand is how they can backpropagate through that RoundClip function when it's not differentiable.
Thanks for the feedback! I don't think that's mentioned in the paper, but this Reddit post explains what happens during backprop: www.reddit.com/r/MachineLearning/comments/1b22izk/comment/ksjphhj/?
Can someone smart explain why we evaluate log2 (3)?
Well, surely I'm not the smartest one around here, but here's my explanation: in information theory, by using the entropy (see the first equation here: en.wikipedia.org/wiki/Entropy_(information_theory)), you can measure the expected value of information, in bits, that you need to represent a random variable.
In BitNet 1.58, the random variable can take three values {-1, 0, 1}, each equally likely. If you put the numbers into the entropy equation, you get -1/3 * log2(1/3) - 1/3*log2(1/3) - 1/3*log2(1/3) = -log2(1/3) = log2(3) bits necessary to represent that random variable.
Hope that helps! :)
A binary number containing N bits can represent 2^N possible values. So if you want to represent 3 possible values (3 = 2^N) then N = log2(3).
10:48 i doesn't out perform llama, they just bolded their results which is a bit misleading
Thanks for noticing that and pointing it out! Well, maybe not on the all tasks, but the average seems to be higher for BitNet1.58. Anyway, the fact that they bolded only their results is kinda misleading, I genuinely didn't notice that.
The paper explained series can be found here: ruclips.net/p/PL8hTotro6aVHhn5QUB3HDJTu3rPJ48LeP