*Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)* * *0:00** Introduction:* Introduces the Byte Latent Transformer (BLT), a novel architecture that replaces traditional tokenization with dynamically sized "patches." Claims improved scaling behavior compared to token-based LLMs. * *0:16** Dynamic Patching:* BLT uses patches as the fundamental unit of computation, dynamically adjusting their size based on text complexity, offering a more efficient representation. * *1:04** Scaling Comparison:* Presents graphs comparing BLT's scaling to LLaMA 2 and 3, showcasing BLT's superior performance at equivalent training FLOPs, using bits-per-byte as an analog to perplexity. * *3:28** BLT Architecture:* Explains the two-tiered architecture. An inner, standard Transformer LLM operates on patch embeddings, while an outer system handles patch creation and decoding. * *7:50** Tokenization Explained:* Briefly explains common tokenization methods like byte-pair encoding (BPE) and word piece, highlighting issues like large vocabulary sizes and out-of-vocabulary words. * *13:25** Problems with Tokenization:* Discusses problems stemming from fixed vocabularies, such as difficulty handling numbers and limited chunk sizes. * *14:46** Patch Embeddings:* Describes how patch embeddings are dynamically created from byte embeddings using a local encoder. This allows for flexible, non-fixed vocabulary representation. * *20:35** Entropy-Based Grouping:* Details the process of dynamically grouping bytes into patches based on the entropy of the next byte prediction from a small, separate byte-level Transformer. High entropy triggers a new patch. * *28:42** Local Encoder/Decoder:* Explains the function of the local encoder (bytes to patch embedding) and decoder (patch embedding to bytes), which operate more frequently than the inner LLM. * *29:48** Encoder Hash N-gram Embeddings:* Describes how n-gram byte embeddings are hashed and incorporated into the byte embeddings to provide contextual information for the local encoder. * *32:44** Patch Size Advantage:* Experiments show BLT achieves similar performance to LLaMA models with significantly larger patch sizes (6-8 bytes vs. 3.7-4.4 bytes). * *33:26** Comparison with LLaMA:* BLT remains competitive with LLaMA models while demonstrating superior performance in tasks requiring character-level understanding, such as spelling inversion. * *35:35** Limitations:* Acknowledges limitations in raw runtime performance compared to highly optimized token-based LLMs, but highlights that FLOP-matched comparisons demonstrate BLT's potential. Further optimization is needed, particularly regarding techniques like Flex attention. Also mentions potential improvements in jointly training components like the small patching LLM. I used gemini-1.5-pro on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 22558 Output tokens: 601
Surely this could be iterated to allow even larger patches (call them roughly "sentence level" and then "paragraph level" or so), right? If it were possible to dynamically scale up to entire paragraphs or pages, we'd quite quickly cover entire books, possibly even with fairly short attention widths. Like, if your average patch covers roughly one page, if you have a max context length (at that level) of like 1024, most books ever written will comfortably fit inside this. All while, in principle, still having access to individual characters as needed. As for ASCII, surely this can work for BPE-style encodings that can handle arbitrary UTF-8 Unicode too?
No, it couldn't. Byte patches then gets encoded into fixed dimensionality latents for main LLM, so you couldn't compress larger and larger chunks of information in it in "lossless" manner. Technique from paper improves dictionary handling and tokenization, but you can't trick information theory with it.
On hackernews someone had the same idea and one of the authors said that with more than two levels of patches it gets too hard to figure out how to allocate training computer time.
@@Kram1032 this surely doesn't scale. Can you imagine a LLM that you feed 1 page of text and, in 1 iteration, it spits out a whole new page? That would be impossible to train
@@lem0nhead84 it's not technically 1 iteration. There would then be several loops, right? The increasingly nested transformers would have different jobs. Effectively, the ~sentence- and ~paragraph- level transformers would just keep around the longer-scale state and tell that to the ~word-level transformer, and the increasingly larger-scale transformers would be more expensive but also would get run more rarely, right? Like, the ~paragraph-level transformer might only run once a second or so. If you get one that can generate an entire page "in one step", it might only run every few seconds. The underlying smaller-scale transformers would each run much more often though Like, I'm making no claims about this being faster. A single step on the scale of the largest transformer may take a long time. But for shorter texts, that largest transformer wouldn't even necessarily be invoked a single time because the EOT appears before that scale is relevant. So if we counted iterations, what would that be? Fractional iterations?
Ok, their method is back to the roots, i.e. "reinvent the wheel". Letter Successor Variety (LSV) was introduced by Harris (1955, 1967). Hafer and Weiss (1974) named them LSV and Letter Predecessor Variety (LPV) and introduced Letter Successor’s Entropy (LSE). This and improved methods are established in "conventional" (un)supervised text segmentation. If the variable length (e.g. 2...8) n-grams are stored in a trie, then indexing them via a hash is obvious.
Seems like you could have your own data (e.g., corporation-specific documents) and instead of fine-tuning an LLM to work better with your data, you could instead use NER and learn/compute the patch for that entity and work with this additional NER pre-processing to work directly with these specific terms. For example, the name of the CEO could be mapped to a patch that effectively means “gen X tech bro billionaire from California with blah blah blah.” You’d probably need to inject some extra context to the prompt to map in the most salient points about each custom entity. This could give you a form of learning that exists between the space of fine-tuning and ICL.
Very nice paper. I have been thinking of this idea for a while now. I have two things to ask. Wouldn't it be beneficial if instead of n-grams we used some sort of convolution and would it be a good idea to have a second layer of patching? I am saying that because patching sort of makes words but not exactly and patching patches would sort of make sentences which to my mind makes sense to do because there are many ways to say the same thing thus the second patching should capture what we want to say and the first how we want to say it.
Not if they are trained end-to-end with the Latent Transformer like he said. In that case, you need to train either future Latent Transformers with the pre-trained loop, or a different outer loop with the same Latent Transformer. You won't be able to mix and match two separate models that were trained differently.
With this more modular approach I wonder if the local encoder/decoder could be replaced to "increase" the performance of the inner transformer (by eliciting preferred behaviour).
I wonder if this could be extended to other modalities. You could start off with a classifier to determine the modality of the input data (text, image, audio, etc.) then use a different encoder for each modality, then feed that into a "unifying" encoder which then feeds "patches" into the latent transformer (doing the reverse to decode).
I wonder if this could be extended so that instead of encoding/decoding "words" (groups of tokens) it would encode/decode groups of words - either by adding another encode/decode step to group the groups of tokens or as a single unit.
Thanks a lot ! Wouldn't it be possible to do several layer of tokenization/encoding ? So with 4 to 5 such layers, the central llm would produce next idea instead of next token.
this feels like a more complicated Perceiver. i guess the striding makes the cross attention layer a little less expensive, but the procedure used to determine the strides is complicated and kinda hacky
First thing that pops into my mind, with your example of 'ER' is common, so lets just make 'er' a token. That will hinder the model in being able to learn relations between the token 'e' and 'er'. I feel like tokens not being single chars, is something that was done as an 'easy fix' to try save compute, but at this point its 100% hindering the models. Its the reason every model, in its base form, is so bad at counting words, etc, because it pretty much has to do chain of thought reasoning into order to count words, because its been hindered by the token setup so much, it needs to work insanely hard to do something any human can without thinking even. Hell, if you ask any good LLM, about this topic, they will say that training on non char-level tokens, WILL hinder the model in many aspects that could even compound.
I have watched it even though I have already red the paper, and I liked it. However, the video is very silent relative to advertisement, so making it 50% louder would be nice. Thanks!
Thanks for the breakdown! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
Since I became so rich in cryptocurrency I realise that crypto is the future cuz I invested 10k and made up to 36k as weekly profit I appreciate the help of your channel 😃🙂...
She is a popular investment manager in Texas and even has a Google page and helps people grow various assets like cryptocurrencies, real estate, stocks, ETFs, etc. ✓ ✓ A very blessed and intelligent woman.
Only true OGs know this was originally named "2024 12 24 15 43 12"
yep
Really? How can I find it
It still is=
@@johndoe6011 he changed it again to the original title lol
Just wanted to say that we are very lucky to have this content for free.
Computer science is evolving at an amazing pace. Impossible to keep track… thank you so much for this video!
00:00 - Introduction and abstract of the article
01:02 - Plots for comparing scaling properties of BLT vs. LLaMA 2 and LLaMA 3
03:28 - Architecture of Byte Latent Transformer
07:50 - Explains tokenization; byte-pair encoding
13:25 - Problems with tokenization
14:46 - Patch embeddings; dynamic tokenization
20:35 - Entropy-based grouping of bytes into patches
28:42 - Local encoder and local decoder
29:48 - Encoder hash n-gram embeddings
32:44 - BLT-specific hyperparameters: patch sizes
33:26 - Comparison with LLaMA architectures
35:35 - Limitations
*Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)*
* *0:00** Introduction:* Introduces the Byte Latent Transformer (BLT), a novel architecture that replaces traditional tokenization with dynamically sized "patches." Claims improved scaling behavior compared to token-based LLMs.
* *0:16** Dynamic Patching:* BLT uses patches as the fundamental unit of computation, dynamically adjusting their size based on text complexity, offering a more efficient representation.
* *1:04** Scaling Comparison:* Presents graphs comparing BLT's scaling to LLaMA 2 and 3, showcasing BLT's superior performance at equivalent training FLOPs, using bits-per-byte as an analog to perplexity.
* *3:28** BLT Architecture:* Explains the two-tiered architecture. An inner, standard Transformer LLM operates on patch embeddings, while an outer system handles patch creation and decoding.
* *7:50** Tokenization Explained:* Briefly explains common tokenization methods like byte-pair encoding (BPE) and word piece, highlighting issues like large vocabulary sizes and out-of-vocabulary words.
* *13:25** Problems with Tokenization:* Discusses problems stemming from fixed vocabularies, such as difficulty handling numbers and limited chunk sizes.
* *14:46** Patch Embeddings:* Describes how patch embeddings are dynamically created from byte embeddings using a local encoder. This allows for flexible, non-fixed vocabulary representation.
* *20:35** Entropy-Based Grouping:* Details the process of dynamically grouping bytes into patches based on the entropy of the next byte prediction from a small, separate byte-level Transformer. High entropy triggers a new patch.
* *28:42** Local Encoder/Decoder:* Explains the function of the local encoder (bytes to patch embedding) and decoder (patch embedding to bytes), which operate more frequently than the inner LLM.
* *29:48** Encoder Hash N-gram Embeddings:* Describes how n-gram byte embeddings are hashed and incorporated into the byte embeddings to provide contextual information for the local encoder.
* *32:44** Patch Size Advantage:* Experiments show BLT achieves similar performance to LLaMA models with significantly larger patch sizes (6-8 bytes vs. 3.7-4.4 bytes).
* *33:26** Comparison with LLaMA:* BLT remains competitive with LLaMA models while demonstrating superior performance in tasks requiring character-level understanding, such as spelling inversion.
* *35:35** Limitations:* Acknowledges limitations in raw runtime performance compared to highly optimized token-based LLMs, but highlights that FLOP-matched comparisons demonstrate BLT's potential. Further optimization is needed, particularly regarding techniques like Flex attention. Also mentions potential improvements in jointly training components like the small patching LLM.
I used gemini-1.5-pro on rocketrecap dot com to summarize the transcript.
Cost (if I didn't use the free tier): $0.03
Input tokens: 22558
Output tokens: 601
Thank you Mr Yannic for giving a thoughtful discussion on an alternative tokenisation scheme using charachterwise patching.
Surely this could be iterated to allow even larger patches (call them roughly "sentence level" and then "paragraph level" or so), right?
If it were possible to dynamically scale up to entire paragraphs or pages, we'd quite quickly cover entire books, possibly even with fairly short attention widths. Like, if your average patch covers roughly one page, if you have a max context length (at that level) of like 1024, most books ever written will comfortably fit inside this.
All while, in principle, still having access to individual characters as needed.
As for ASCII, surely this can work for BPE-style encodings that can handle arbitrary UTF-8 Unicode too?
No, it couldn't. Byte patches then gets encoded into fixed dimensionality latents for main LLM, so you couldn't compress larger and larger chunks of information in it in "lossless" manner.
Technique from paper improves dictionary handling and tokenization, but you can't trick information theory with it.
On hackernews someone had the same idea and one of the authors said that with more than two levels of patches it gets too hard to figure out how to allocate training computer time.
@@Kram1032 this surely doesn't scale. Can you imagine a LLM that you feed 1 page of text and, in 1 iteration, it spits out a whole new page? That would be impossible to train
@@lem0nhead84 it's not technically 1 iteration. There would then be several loops, right? The increasingly nested transformers would have different jobs. Effectively, the ~sentence- and ~paragraph- level transformers would just keep around the longer-scale state and tell that to the ~word-level transformer, and the increasingly larger-scale transformers would be more expensive but also would get run more rarely, right? Like, the ~paragraph-level transformer might only run once a second or so. If you get one that can generate an entire page "in one step", it might only run every few seconds. The underlying smaller-scale transformers would each run much more often though
Like, I'm making no claims about this being faster. A single step on the scale of the largest transformer may take a long time. But for shorter texts, that largest transformer wouldn't even necessarily be invoked a single time because the EOT appears before that scale is relevant. So if we counted iterations, what would that be? Fractional iterations?
@@kellymoses8566 too hard as of right now, or too hard, fundamentally?
For some reason your videos are not showing up often for me
That's why you systematically like to train the algorithm
Ok, their method is back to the roots, i.e. "reinvent the wheel".
Letter Successor Variety (LSV) was introduced by Harris (1955, 1967).
Hafer and Weiss (1974) named them LSV and Letter Predecessor Variety (LPV) and introduced
Letter Successor’s Entropy (LSE). This and improved methods are established in "conventional" (un)supervised text segmentation.
If the variable length (e.g. 2...8) n-grams are stored in a trie, then indexing them via a hash is obvious.
Good explanation of the paper.. I saw another explanation and didn't understand a thing. You broke it down nicely.
Seems like you could have your own data (e.g., corporation-specific documents) and instead of fine-tuning an LLM to work better with your data, you could instead use NER and learn/compute the patch for that entity and work with this additional NER pre-processing to work directly with these specific terms. For example, the name of the CEO could be mapped to a patch that effectively means “gen X tech bro billionaire from California with blah blah blah.” You’d probably need to inject some extra context to the prompt to map in the most salient points about each custom entity. This could give you a form of learning that exists between the space of fine-tuning and ICL.
Very nice paper. I have been thinking of this idea for a while now. I have two things to ask. Wouldn't it be beneficial if instead of n-grams we used some sort of convolution and would it be a good idea to have a second layer of patching? I am saying that because patching sort of makes words but not exactly and patching patches would sort of make sentences which to my mind makes sense to do because there are many ways to say the same thing thus the second patching should capture what we want to say and the first how we want to say it.
Yannic to the rescue! I was honestly way more excited about this than o3 🤷
Supposedly, once trained the outer encoding/decoding could then be used as an interface to any inner loop LLM. No need to retrain it, no ?
Not if they are trained end-to-end with the Latent Transformer like he said. In that case, you need to train either future Latent Transformers with the pre-trained loop, or a different outer loop with the same Latent Transformer. You won't be able to mix and match two separate models that were trained differently.
With this more modular approach I wonder if the local encoder/decoder could be replaced to "increase" the performance of the inner transformer (by eliciting preferred behaviour).
I wonder if this could be extended to other modalities. You could start off with a classifier to determine the modality of the input data (text, image, audio, etc.) then use a different encoder for each modality, then feed that into a "unifying" encoder which then feeds "patches" into the latent transformer (doing the reverse to decode).
Not to be confused with Byte Latent Tomatoes, obviously.
I wonder if this could be extended so that instead of encoding/decoding "words" (groups of tokens) it would encode/decode groups of words - either by adding another encode/decode step to group the groups of tokens or as a single unit.
Do you think this addresses the classical "chunking" problem?
Thanks a lot !
Wouldn't it be possible to do several layer of tokenization/encoding ? So with 4 to 5 such layers, the central llm would produce next idea instead of next token.
this feels like a more complicated Perceiver. i guess the striding makes the cross attention layer a little less expensive, but the procedure used to determine the strides is complicated and kinda hacky
First thing that pops into my mind, with your example of 'ER' is common, so lets just make 'er' a token.
That will hinder the model in being able to learn relations between the token 'e' and 'er'.
I feel like tokens not being single chars, is something that was done as an 'easy fix' to try save compute, but at this point its 100% hindering the models.
Its the reason every model, in its base form, is so bad at counting words, etc, because it pretty much has to do chain of thought reasoning into order to count words, because its been hindered by the token setup so much, it needs to work insanely hard to do something any human can without thinking even.
Hell, if you ask any good LLM, about this topic, they will say that training on non char-level tokens, WILL hinder the model in many aspects that could even compound.
I have watched it even though I have already red the paper, and I liked it. However, the video is very silent relative to advertisement, so making it 50% louder would be nice. Thanks!
We can't tokenize, otherwise how can we count how many r has strawberry?
what do you think is the implication of this paper?
NGL Yannic, this doesn't feel like a step TOWARD LLM transparency, amiright?
Good explanation. Thanks
Thank you.
You aren't sure if you understood.
I am sure that I didn't.
We are not the same 👔
Small large language model :D so a language model
Thanks for the breakdown! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
😂😂😂😂 bait one way or another
So N-grams ..
1 mins!
2 mins
seems...hacky
Welcome to machine learning!
Less hacky than tokens I guess?
Since I became so rich in cryptocurrency I realise that crypto is the future cuz I invested 10k and made up to 36k as weekly profit I appreciate the help of your channel 😃🙂...
For me trading has not been good well 😞 for me every day I came to watch video's I only see people appreciating how good they trading works
How do you manage to use signal to make such profit 🤷.
It's Juliana Hadas doing, she's changed my life
She is a popular investment manager in Texas and even has a Google page and helps people grow various assets like cryptocurrencies, real estate, stocks, ETFs, etc. ✓ ✓ A very blessed and intelligent woman.
WOW!!! You know her too? I'm also a proud
beneficiary of her platform