The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits - Paper Explained

Поделиться
HTML-код
  • Опубликовано: 7 сен 2024

Комментарии • 12

  • @mohammedrumaan2704
    @mohammedrumaan2704 5 месяцев назад +1

    Lovely work. Keep em coming!!

    • @datamlistic
      @datamlistic  5 месяцев назад +1

      Thanks! Surely will do! Stay tuned for the new content. :)

  • @jeffg4686
    @jeffg4686 5 месяцев назад +1

    Whoever figured out that one needs to get PAID. Possibly billions saved.

    • @datamlistic
      @datamlistic  5 месяцев назад +1

      Most likely those that discover such things are paid generously, not billions, but still. :)

  • @vlastimilmartinek9800
    @vlastimilmartinek9800 5 месяцев назад +1

    Good walkthrough! One thing i don't understand is how they can backpropagate through that RoundClip function when it's not differentiable.

    • @datamlistic
      @datamlistic  5 месяцев назад +1

      Thanks for the feedback! I don't think that's mentioned in the paper, but this Reddit post explains what happens during backprop: www.reddit.com/r/MachineLearning/comments/1b22izk/comment/ksjphhj/?

  • @Jdog1681
    @Jdog1681 5 месяцев назад +1

    Can someone smart explain why we evaluate log2 (3)?

    • @datamlistic
      @datamlistic  5 месяцев назад

      Well, surely I'm not the smartest one around here, but here's my explanation: in information theory, by using the entropy (see the first equation here: en.wikipedia.org/wiki/Entropy_(information_theory)), you can measure the expected value of information, in bits, that you need to represent a random variable.
      In BitNet 1.58, the random variable can take three values {-1, 0, 1}, each equally likely. If you put the numbers into the entropy equation, you get -1/3 * log2(1/3) - 1/3*log2(1/3) - 1/3*log2(1/3) = -log2(1/3) = log2(3) bits necessary to represent that random variable.
      Hope that helps! :)

    • @nyx211
      @nyx211 3 месяца назад +1

      A binary number containing N bits can represent 2^N possible values. So if you want to represent 3 possible values (3 = 2^N) then N = log2(3).

  • @bobsmithy3103
    @bobsmithy3103 5 месяцев назад

    10:48 i doesn't out perform llama, they just bolded their results which is a bit misleading

    • @datamlistic
      @datamlistic  5 месяцев назад

      Thanks for noticing that and pointing it out! Well, maybe not on the all tasks, but the average seems to be higher for BitNet1.58. Anyway, the fact that they bolded only their results is kinda misleading, I genuinely didn't notice that.

  • @datamlistic
    @datamlistic  5 месяцев назад

    The paper explained series can be found here: ruclips.net/p/PL8hTotro6aVHhn5QUB3HDJTu3rPJ48LeP