Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 24 дек 2024

Комментарии •

  • @scoffpickle9655
    @scoffpickle9655 23 часа назад +59

    Only true OGs know this was originally named "2024 12 24 15 43 12"

  • @lem0nhead84
    @lem0nhead84 20 часов назад +18

    Just wanted to say that we are very lucky to have this content for free.

  • @ai_outline
    @ai_outline 3 часа назад +1

    Computer science is evolving at an amazing pace. Impossible to keep track… thank you so much for this video!

  • @ДаниилИмани
    @ДаниилИмани 15 часов назад +5

    00:00 - Introduction and abstract of the article
    01:02 - Plots for comparing scaling properties of BLT vs. LLaMA 2 and LLaMA 3
    03:28 - Architecture of Byte Latent Transformer
    07:50 - Explains tokenization; byte-pair encoding
    13:25 - Problems with tokenization
    14:46 - Patch embeddings; dynamic tokenization
    20:35 - Entropy-based grouping of bytes into patches
    28:42 - Local encoder and local decoder
    29:48 - Encoder hash n-gram embeddings
    32:44 - BLT-specific hyperparameters: patch sizes
    33:26 - Comparison with LLaMA architectures
    35:35 - Limitations

    • @wolpumba4099
      @wolpumba4099 13 часов назад +1

      *Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)*
      * *0:00** Introduction:* Introduces the Byte Latent Transformer (BLT), a novel architecture that replaces traditional tokenization with dynamically sized "patches." Claims improved scaling behavior compared to token-based LLMs.
      * *0:16** Dynamic Patching:* BLT uses patches as the fundamental unit of computation, dynamically adjusting their size based on text complexity, offering a more efficient representation.
      * *1:04** Scaling Comparison:* Presents graphs comparing BLT's scaling to LLaMA 2 and 3, showcasing BLT's superior performance at equivalent training FLOPs, using bits-per-byte as an analog to perplexity.
      * *3:28** BLT Architecture:* Explains the two-tiered architecture. An inner, standard Transformer LLM operates on patch embeddings, while an outer system handles patch creation and decoding.
      * *7:50** Tokenization Explained:* Briefly explains common tokenization methods like byte-pair encoding (BPE) and word piece, highlighting issues like large vocabulary sizes and out-of-vocabulary words.
      * *13:25** Problems with Tokenization:* Discusses problems stemming from fixed vocabularies, such as difficulty handling numbers and limited chunk sizes.
      * *14:46** Patch Embeddings:* Describes how patch embeddings are dynamically created from byte embeddings using a local encoder. This allows for flexible, non-fixed vocabulary representation.
      * *20:35** Entropy-Based Grouping:* Details the process of dynamically grouping bytes into patches based on the entropy of the next byte prediction from a small, separate byte-level Transformer. High entropy triggers a new patch.
      * *28:42** Local Encoder/Decoder:* Explains the function of the local encoder (bytes to patch embedding) and decoder (patch embedding to bytes), which operate more frequently than the inner LLM.
      * *29:48** Encoder Hash N-gram Embeddings:* Describes how n-gram byte embeddings are hashed and incorporated into the byte embeddings to provide contextual information for the local encoder.
      * *32:44** Patch Size Advantage:* Experiments show BLT achieves similar performance to LLaMA models with significantly larger patch sizes (6-8 bytes vs. 3.7-4.4 bytes).
      * *33:26** Comparison with LLaMA:* BLT remains competitive with LLaMA models while demonstrating superior performance in tasks requiring character-level understanding, such as spelling inversion.
      * *35:35** Limitations:* Acknowledges limitations in raw runtime performance compared to highly optimized token-based LLMs, but highlights that FLOP-matched comparisons demonstrate BLT's potential. Further optimization is needed, particularly regarding techniques like Flex attention. Also mentions potential improvements in jointly training components like the small patching LLM.
      I used gemini-1.5-pro on rocketrecap dot com to summarize the transcript.
      Cost (if I didn't use the free tier): $0.03
      Input tokens: 22558
      Output tokens: 601

  • @Mordenor
    @Mordenor 36 минут назад

    Thank you Mr Yannic for giving a thoughtful discussion on an alternative tokenisation scheme using charachterwise patching.

  • @Kram1032
    @Kram1032 21 час назад +14

    Surely this could be iterated to allow even larger patches (call them roughly "sentence level" and then "paragraph level" or so), right?
    If it were possible to dynamically scale up to entire paragraphs or pages, we'd quite quickly cover entire books, possibly even with fairly short attention widths. Like, if your average patch covers roughly one page, if you have a max context length (at that level) of like 1024, most books ever written will comfortably fit inside this.
    All while, in principle, still having access to individual characters as needed.
    As for ASCII, surely this can work for BPE-style encodings that can handle arbitrary UTF-8 Unicode too?

    • @fuxtube
      @fuxtube 20 часов назад

      No, it couldn't. Byte patches then gets encoded into fixed dimensionality latents for main LLM, so you couldn't compress larger and larger chunks of information in it in "lossless" manner.
      Technique from paper improves dictionary handling and tokenization, but you can't trick information theory with it.

    • @kellymoses8566
      @kellymoses8566 17 часов назад +6

      On hackernews someone had the same idea and one of the authors said that with more than two levels of patches it gets too hard to figure out how to allocate training computer time.

    • @lem0nhead84
      @lem0nhead84 13 часов назад

      @@Kram1032 this surely doesn't scale. Can you imagine a LLM that you feed 1 page of text and, in 1 iteration, it spits out a whole new page? That would be impossible to train

    • @Kram1032
      @Kram1032 11 часов назад

      @@lem0nhead84 it's not technically 1 iteration. There would then be several loops, right? The increasingly nested transformers would have different jobs. Effectively, the ~sentence- and ~paragraph- level transformers would just keep around the longer-scale state and tell that to the ~word-level transformer, and the increasingly larger-scale transformers would be more expensive but also would get run more rarely, right? Like, the ~paragraph-level transformer might only run once a second or so. If you get one that can generate an entire page "in one step", it might only run every few seconds. The underlying smaller-scale transformers would each run much more often though
      Like, I'm making no claims about this being faster. A single step on the scale of the largest transformer may take a long time. But for shorter texts, that largest transformer wouldn't even necessarily be invoked a single time because the EOT appears before that scale is relevant. So if we counted iterations, what would that be? Fractional iterations?

    • @Kram1032
      @Kram1032 11 часов назад

      @@kellymoses8566 too hard as of right now, or too hard, fundamentally?

  • @c.2518
    @c.2518 22 часа назад +13

    For some reason your videos are not showing up often for me

    • @yeezythabest
      @yeezythabest 20 часов назад

      That's why you systematically like to train the algorithm

  • @helmutwollmersdorfer7314
    @helmutwollmersdorfer7314 6 часов назад +1

    Ok, their method is back to the roots, i.e. "reinvent the wheel".
    Letter Successor Variety (LSV) was introduced by Harris (1955, 1967).
    Hafer and Weiss (1974) named them LSV and Letter Predecessor Variety (LPV) and introduced
    Letter Successor’s Entropy (LSE). This and improved methods are established in "conventional" (un)supervised text segmentation.
    If the variable length (e.g. 2...8) n-grams are stored in a trie, then indexing them via a hash is obvious.

  • @acasualviewer5861
    @acasualviewer5861 14 часов назад

    Good explanation of the paper.. I saw another explanation and didn't understand a thing. You broke it down nicely.

  • @mshonle
    @mshonle 21 час назад +1

    Seems like you could have your own data (e.g., corporation-specific documents) and instead of fine-tuning an LLM to work better with your data, you could instead use NER and learn/compute the patch for that entity and work with this additional NER pre-processing to work directly with these specific terms. For example, the name of the CEO could be mapped to a patch that effectively means “gen X tech bro billionaire from California with blah blah blah.” You’d probably need to inject some extra context to the prompt to map in the most salient points about each custom entity. This could give you a form of learning that exists between the space of fine-tuning and ICL.

  • @PeterKornezos
    @PeterKornezos 3 часа назад

    Very nice paper. I have been thinking of this idea for a while now. I have two things to ask. Wouldn't it be beneficial if instead of n-grams we used some sort of convolution and would it be a good idea to have a second layer of patching? I am saying that because patching sort of makes words but not exactly and patching patches would sort of make sentences which to my mind makes sense to do because there are many ways to say the same thing thus the second patching should capture what we want to say and the first how we want to say it.

  • @Quazgaa
    @Quazgaa 16 часов назад +2

    Yannic to the rescue! I was honestly way more excited about this than o3 🤷

  • @JTMoustache
    @JTMoustache 22 часа назад +2

    Supposedly, once trained the outer encoding/decoding could then be used as an interface to any inner loop LLM. No need to retrain it, no ?

    • @farrael004
      @farrael004 19 часов назад +1

      Not if they are trained end-to-end with the Latent Transformer like he said. In that case, you need to train either future Latent Transformers with the pre-trained loop, or a different outer loop with the same Latent Transformer. You won't be able to mix and match two separate models that were trained differently.

  • @braphog21
    @braphog21 3 часа назад

    With this more modular approach I wonder if the local encoder/decoder could be replaced to "increase" the performance of the inner transformer (by eliciting preferred behaviour).

  • @braphog21
    @braphog21 3 часа назад

    I wonder if this could be extended to other modalities. You could start off with a classifier to determine the modality of the input data (text, image, audio, etc.) then use a different encoder for each modality, then feed that into a "unifying" encoder which then feeds "patches" into the latent transformer (doing the reverse to decode).

  • @QuadraticPerplexity
    @QuadraticPerplexity 20 часов назад +2

    Not to be confused with Byte Latent Tomatoes, obviously.

  • @braphog21
    @braphog21 3 часа назад

    I wonder if this could be extended so that instead of encoding/decoding "words" (groups of tokens) it would encode/decode groups of words - either by adding another encode/decode step to group the groups of tokens or as a single unit.

  • @st33lbird
    @st33lbird 22 часа назад +1

    Do you think this addresses the classical "chunking" problem?

  • @yannickpezeu3419
    @yannickpezeu3419 13 часов назад

    Thanks a lot !
    Wouldn't it be possible to do several layer of tokenization/encoding ? So with 4 to 5 such layers, the central llm would produce next idea instead of next token.

  • @gsusduke
    @gsusduke 17 часов назад

    this feels like a more complicated Perceiver. i guess the striding makes the cross attention layer a little less expensive, but the procedure used to determine the strides is complicated and kinda hacky

  • @draken5379
    @draken5379 15 часов назад +1

    First thing that pops into my mind, with your example of 'ER' is common, so lets just make 'er' a token.
    That will hinder the model in being able to learn relations between the token 'e' and 'er'.
    I feel like tokens not being single chars, is something that was done as an 'easy fix' to try save compute, but at this point its 100% hindering the models.
    Its the reason every model, in its base form, is so bad at counting words, etc, because it pretty much has to do chain of thought reasoning into order to count words, because its been hindered by the token setup so much, it needs to work insanely hard to do something any human can without thinking even.
    Hell, if you ask any good LLM, about this topic, they will say that training on non char-level tokens, WILL hinder the model in many aspects that could even compound.

  • @AndrewRafas
    @AndrewRafas 19 часов назад

    I have watched it even though I have already red the paper, and I liked it. However, the video is very silent relative to advertisement, so making it 50% louder would be nice. Thanks!

  • @HUEHUEUHEPony
    @HUEHUEUHEPony 14 часов назад

    We can't tokenize, otherwise how can we count how many r has strawberry?

  • @thivuxhale
    @thivuxhale 16 часов назад

    what do you think is the implication of this paper?

  • @twobob
    @twobob 8 часов назад

    NGL Yannic, this doesn't feel like a step TOWARD LLM transparency, amiright?

    • @twobob
      @twobob 7 часов назад

      Good explanation. Thanks

  • @_ARCATEC_
    @_ARCATEC_ 52 минуты назад

    Thank you.

  • @jondo7680
    @jondo7680 18 часов назад +1

    You aren't sure if you understood.
    I am sure that I didn't.
    We are not the same 👔

  • @florianvahl5494
    @florianvahl5494 9 часов назад

    Small large language model :D so a language model

  • @YuvarajaPolixena
    @YuvarajaPolixena 19 часов назад

    Thanks for the breakdown! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?

  • @johnkost2514
    @johnkost2514 Час назад

    So N-grams ..

  • @mike___-fi5kp
    @mike___-fi5kp 23 часа назад

    1 mins!

  • @rm-ra8317
    @rm-ra8317 23 часа назад

    2 mins

  • @siddharth-gandhi
    @siddharth-gandhi 22 часа назад

    seems...hacky

    • @farrael004
      @farrael004 19 часов назад +1

      Welcome to machine learning!

    • @bright_minary6537
      @bright_minary6537 19 часов назад +3

      Less hacky than tokens I guess?

  • @mmcrypto2403
    @mmcrypto2403 22 часа назад +10

    Since I became so rich in cryptocurrency I realise that crypto is the future cuz I invested 10k and made up to 36k as weekly profit I appreciate the help of your channel 😃🙂...

    • @altcoinsdaily1754
      @altcoinsdaily1754 22 часа назад +6

      For me trading has not been good well 😞 for me every day I came to watch video's I only see people appreciating how good they trading works

    • @altcoinsdaily1754
      @altcoinsdaily1754 22 часа назад +4

      How do you manage to use signal to make such profit 🤷.

    • @mmcrypto2403
      @mmcrypto2403 22 часа назад +1

      It's Juliana Hadas doing, she's changed my life

    • @mmcrypto2403
      @mmcrypto2403 22 часа назад +1

      She is a popular investment manager in Texas and even has a Google page and helps people grow various assets like cryptocurrencies, real estate, stocks, ETFs, etc. ✓ ✓ A very blessed and intelligent woman.

    • @RichmaxPetal
      @RichmaxPetal 22 часа назад

      WOW!!! You know her too? I'm also a proud
      beneficiary of her platform