What is Retrieval Augmented Generation (RAG) and JinaAI?

Fine-Tune Llama3 using Synthetic Data

Don't Contribute to Open Source

Dragon Age: The Veilguard | Official Reveal Trailer

Life is Strange: Double Exposure - Announce Trailer (ESRB)

Uncle Roger Train To Get Aunties (ft. Chef Rush)

why llama-3-8B is 8 billion parameters instead of 7?

Chris Hay

Просмотров 3,1 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 апр 2024
llama-3 has ditched it's tokenizer and has instead opted to use the same tokenizer as gpt-4 (tiktoken created by openai), it's even using the same first 100K token vocabulary.
In this video chris walks through why Meta has switched tokenizer and the implications on the model sizes, embeddings layer and multi-lingual tokenization.
he also runs his tokenizer benchmark and show's how it's more efficient in languages such as japanese
repos
------
github.com/chrishayuk/embeddings
github.com/chrishayuk/tokeniz...
Наука

Комментарии • 15

@charbakcg Месяц назад ⁺¹
Excellent demonstration Chris , thanks for sharing!
@chrishayuk 21 день назад
thank you, glad it was useful
@goodtothinkwith Месяц назад ⁺¹
Great stuff.. no nonsense presentation style, clear and technical, as it should be 😅.. question: is there a reason why it’s not better to have common English syllables in the vocabulary? I understand “lov” being there, but I can’t imagine that “el” is a very useful token as part of “Lovelace”.. intuitively, I would think that is should simply be tokenized as “love” and “lace”
@chrishayuk 21 день назад
tbh... the general trend is to go more towards complete words where possible. the more you split the tokens the harder it is for the llm
@rluijk Месяц назад ⁺¹
ok, that is all very concrete! Awesome. Thanks for this. This seems like a lot of quick wins that are easy to discover, or is that because hindsight by you explaining it so clearly? Anyway, its all a bit new to me. Perhaps, lets say Norway, would be wise to run this with their own tokeniser? Or is that to simplistic thinking?
@chrishayuk 21 день назад
Glad it was helpful!, you're spot on. if someone was building a norwegian llm, it'd make a lot of sense to have a norwegian focused tokenizer
@aaravsethi6070 Месяц назад ⁺²
Im super excited to see the `llama.cpp`, `llama2.c`, etc. category be implemented for llama3!
@chrishayuk Месяц назад
Agree
@ArseniyPotapov Месяц назад
llama.cpp already supports Llama3
@leeme179 Месяц назад ⁺¹
great video, thank you
@chrishayuk Месяц назад
Thank you, glad it was useful
@leeme179 Месяц назад ⁺¹
What are you thought on including space in the tokenizer? I tried it once and the LLM was optimising to predict spaces as those easy wins for the LLM, but I like the way tiktoken has done to keep the space but not space as a token on it own....
@chrishayuk Месяц назад
I’m okay with it, if you watch my visualizing embeddings layer video you’ll see that words with spaces and words without spaces are so closely correlated on the initial embeddings layer that it’s basically a non issue. The cost however is the size of the vocabulary and therefore the embeddings layer size. It does however make the model much more efficient not having spaces handled separately. So having words with spaces as its own token makes so much more sense
@rogerc7960 Месяц назад ⁺¹
Why is there some pytorch? Does finetuned or merged versions need it?
@chrishayuk 21 день назад
i was using it for some of the loadings.. not necessary for these demos

Следующие

Автовоспроизведение

What is Retrieval Augmented Generation (RAG) and JinaAI?

What is Retrieval Augmented Generation (RAG) and JinaAI?

Fine-Tune Llama3 using Synthetic Data

Fine-Tune Llama3 using Synthetic Data

Don't Contribute to Open Source

Don't Contribute to Open Source

Dragon Age: The Veilguard | Official Reveal Trailer

Dragon Age: The Veilguard | Official Reveal Trailer

Life is Strange: Double Exposure - Announce Trailer (ESRB)

Life is Strange: Double Exposure – Announce Trailer (ESRB)

Uncle Roger Train To Get Aunties (ft. Chef Rush)

Uncle Roger Train To Get Aunties (ft. Chef Rush)

SONIC X SHADOW GENERATIONS - Summer Game Fest Trailer

SONIC X SHADOW GENERATIONS - Summer Game Fest Trailer

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B

Inside the LLM: Visualizing the Embeddings Layer of Mistral-7B and Gemma-2B

How To Run Llama 3 8B, 70B Models On Your Laptop (Free)

How To Run Llama 3 8B, 70B Models On Your Laptop (Free)

HuggingFace Fundamentals with LLM's such as TInyLlama and Mistral 7B

HuggingFace Fundamentals with LLM's such as TInyLlama and Mistral 7B

LLAMA 3 : Explained and Summarised Under 8 Minutes (Compared to Llama 2, Meta AI)

LLAMA 3 : Explained and Summarised Under 8 Minutes (Compared to Llama 2, Meta AI)

Llama 3 - 8B & 70B Deep Dive

Llama 3 - 8B & 70B Deep Dive

LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece

LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece

Biggest AI announcements from Apple's WWDC 2024

Biggest AI announcements from Apple's WWDC 2024

Llama 3 Fine Tuning for Dummies (with 16k, 32k,... Context)

Llama 3 Fine Tuning for Dummies (with 16k, 32k,... Context)

getting started with typespec

getting started with typespec

Гибкий телефон 📱

Гибкий телефон 📱

POCO F6 PRO 😈VS IPHONE 15 💀VS 8GB VS 6GB VS 4GB VS 2 GB PUBG test 💀😈 #pocox6pro #120fps #iphone

POCO F6 PRO 😈VS IPHONE 15 💀VS 8GB VS 6GB VS 4GB VS 2 GB PUBG test 💀😈 #pocox6pro #120fps #iphone

Сделайте что-нибудь Samsung J6 2018

Сделайте что-нибудь Samsung J6 2018

С ноутбуком придется попрощаться

С ноутбуком придется попрощаться

Лучшая «воздушка» враг хорошей, СЖО без помпы, необычные БП, топовый корпус за 18$ и прочее.

Лучшая «воздушка» враг хорошей, СЖО без помпы, необычные БП, топовый корпус за 18$ и прочее.

Don't worry, see if my color optimization is correct! S23 Ultra Vs iPhone 14 Pro Max #shorts

Don't worry, see if my color optimization is correct! S23 Ultra Vs iPhone 14 Pro Max #shorts

keren sih #iphone #apple

keren sih #iphone #apple