I definitely think attention in some form will survive even into a refined future mamba model due to its powerful ability to capture high dimension representations.
Thank you for a great presentation Sasha@srush_nlp. You mentioned that MambaByte is still behind token-based models. I wonder what makes Mamba still theoretically inferior to token-based transformers or is it just a matter of discovering best practices and tricks?
Charformer(which is mentioned in paper, gradient based tokenizers sound like pog) tests itself in multilingual tasks. It is interesting how RNNs would behave there. RWKV can easily jump to wrong language(model card talks about how using space in the end of the prompt can "upset the tokenizer") Also can we go lower? MambaBit when. Just imagine. Vocab size of 2. 😱
Although it's an interesting idea, I don't understand how that is beneficial. Humans neither understand nor produce outputs in bits. I think it's more reasonable that each token is a smallest semantic/graphical unit (sememe/grapheme), but bit does not hold semantic/graphical information though.
@@donnychan1999 Humans also don't produce output in bytes: There is no reason for 'кот' to take 2 times more "thought unit" comparing to 'cat' Well, meanwhile I decided to be change in world I want to see and published Maykeye/MambaBit on HF after torturing my laptop for 10 hours. "The cat can never" -> "The cat can never many be my father, Or else and the good many be my father, In the good many lord, and my father come." This is so cursed. Yet it's much better than I've expected.
Have loved your work since Annotated Transformer, thank you for sharing. Very clear explanation
I definitely think attention in some form will survive even into a refined future mamba model due to its powerful ability to capture high dimension representations.
Hi, for correction: The number of parameters for the transformer is 361M instead of 261M in this video, as shown in the paper.
Great presentation!! Thank you!!
Nice work. I see the models on huggingface. Is there also a github or notebooks to train or run inference on them?
Thank you for a great presentation Sasha@srush_nlp. You mentioned that MambaByte is still behind token-based models. I wonder what makes Mamba still theoretically inferior to token-based transformers or is it just a matter of discovering best practices and tricks?
Charformer(which is mentioned in paper, gradient based tokenizers sound like pog) tests itself in multilingual tasks. It is interesting how RNNs would behave there. RWKV can easily jump to wrong language(model card talks about how using space in the end of the prompt can "upset the tokenizer")
Also can we go lower? MambaBit when. Just imagine. Vocab size of 2. 😱
Although it's an interesting idea, I don't understand how that is beneficial. Humans neither understand nor produce outputs in bits. I think it's more reasonable that each token is a smallest semantic/graphical unit (sememe/grapheme), but bit does not hold semantic/graphical information though.
@@donnychan1999 Humans also don't produce output in bytes: There is no reason for 'кот' to take 2 times more "thought unit" comparing to 'cat'
Well, meanwhile I decided to be change in world I want to see and published Maykeye/MambaBit on HF after torturing my laptop for 10 hours.
"The cat can never" -> "The cat can never many be my father,
Or else and the good many be my father,
In the good many lord, and my father come."
This is so cursed. Yet it's much better than I've expected.
Great work and presentation! Have you also compared MambaByte to the baselines on any downstream tasks/benchmarks?
Great work!
Very impressive.