The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

Поделиться
HTML-код
  • Опубликовано: 17 янв 2025

Комментарии • 25

  • @crypticnomad
    @crypticnomad 9 месяцев назад +6

    Thank you for the great break down. I really like how you went back to the first paper to explain the theory underlying bitnet and then explained what was different in the new paper. In general these kinds of advancements excite me because they potentially could make running, and in some cases training, these huge models something us mere mortals without infinite compute can actually do locally.

  • @ehfik
    @ehfik 10 месяцев назад +1

    oh! ive been thinking about this myself. how nice to see it realized!

  • @TheRealHassan789
    @TheRealHassan789 10 месяцев назад

    Excellent review! Thanks

  • @cbuchner1
    @cbuchner1 10 месяцев назад +2

    Does using FP8 for activations as opposed to INT8 offer a significant accuracy benefit?
    I suppose a integer adder is even simpler than a floating point adder and may save additional power

    • @cbuchner1
      @cbuchner1 10 месяцев назад

      Ok I found a paper discussing these details: arxiv.org/pdf/2303.17951.pdf

    • @gabrielmongaras
      @gabrielmongaras  10 месяцев назад +2

      I have no idea why I said FP8 during the video 😳
      INT8 is used just like you said since FP8 doesn't offer anything over INT8

    • @cbuchner1
      @cbuchner1 10 месяцев назад +1

      @@gabrielmongaras actually I have seen a paper that reports a stability benefit of FP8 over INT8 for LLMs during training once they scale beyond a certain size.

    • @gabrielmongaras
      @gabrielmongaras  10 месяцев назад

      @@cbuchner1 That sounds really intersting. Can you please send the paper? I wonder if other formats would work better in FP8 such as how BFLOAT16 is usually better than FP16.

    • @cbuchner1
      @cbuchner1 10 месяцев назад

      @@gabrielmongaras chapter 4.6 here arxiv.org/pdf/2303.17951.pdf But I misremembered in so far as they state it works better for Transformers (not limited to very large ones) and that there are ways to also make it work well with int8
      I will have to keep looking for papers that talk about comparing int8/fp8 in training GPTs

  • @AnamikaChatterjee-j4l
    @AnamikaChatterjee-j4l 10 месяцев назад

    Excellent Explanation !! can you please make a video on speculative streaming

  • @FaultyTwo
    @FaultyTwo 10 месяцев назад

    Can't wait for 0.5 bits models!

    • @notu483
      @notu483 10 месяцев назад

      According to the calculation log2(0.5) = -1 so does that mean you need a base -1 number system?

    • @blaineone
      @blaineone 10 месяцев назад

      😂😂😂

    • @Aerotune0
      @Aerotune0 10 месяцев назад

      Analog computing

  • @ariabk
    @ariabk 10 месяцев назад

    very cool

  • @david1mdavis
    @david1mdavis 10 месяцев назад

    Will this work for CNN or only LLM? Does the training still need to used GPU's or is this only an inference benefit. I don't see any examples or data other than this Paper.

  • @TheRealHassan789
    @TheRealHassan789 10 месяцев назад

    Noob Question… so binary/ternary quantization has been around for a while… which part was the major innovation/discovery in BitNet paper?

    • @gabrielmongaras
      @gabrielmongaras  10 месяцев назад +5

      Mainly that a model can be trained from scratch using binary weights and still be competitive in terms of perplexity and accuracy.

  • @KevinInPhoenix
    @KevinInPhoenix 7 месяцев назад

    Does this technology make Nvidia's tech and NPUs obsolete?

  • @darshanputtaswamy3199
    @darshanputtaswamy3199 9 месяцев назад

    This paper will add a ceiling price to the Nivida stocks

  • @szebike
    @szebike 5 месяцев назад

    Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...

  • @JorgetePanete
    @JorgetePanete 10 месяцев назад

    we need the code

    • @crypticnomad
      @crypticnomad 9 месяцев назад

      I found an implementation already on pip under the name "bitnet". From looking at the code they fully implemented the first paper and are making the changes now to implement the changes in the second paper. They even have a bitnet version of llama(bit_llama) in the repo. They also have a function where you can replace all of the linear layers in a model with bitlinear layers.

  • @AIMevideochat
    @AIMevideochat 10 месяцев назад +1

    Hi bro, could you recommand me a large Model which can provide girlfriend API , or I can fine tune it to be a girlfriend model? I need an uncensored model for girl friend role. (not be NSFW, just to be a warm girl friend, when you ask her "can you be my girlfriend", the model won't reply "I am a AI model", that is so annoying) , or any other way to solve the problem. I whatched your video" Talking to girlfriend", but I worried that the Model you mentioned might be outdated. I am looking forward to your reply .Thank you!

    • @szebike
      @szebike 5 месяцев назад

      I don't think LLMs should be used for this task, you can interact with them thats ok but a "girlfriend" is something between humans. You shouldn't make money on the back of lonely people and even its free its unhealthy to form a forced realtionship with a machiene (apart from being morally and ethically questionable).