Low Level Technicals of LLMs: Daniel Han

AI Engineer

Просмотров 29 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 7 сен 2024
This workshop will be split into 3x one hour blocks:
How to analyze & fix LLMs - how to find and fix bugs in Gemma, Phi-3, Llama & tokenizers
Finetuning with Unsloth - continued pretraining, reward modelling, QLoRA & more
Deep dive into LLM technicals - hand deriving derivatives, SOTA finetuning tricks
It's recommended you have Python with Pytorch and Unsloth installed (or use online Google Colab / Kaggle). College level maths and programming would be helpful.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at www.ai.enginee... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at ai.engineer/2025
About Daniel
Hey I'm Daniel, the algos guy behind Unsloth. I love making LLM training go fast! We're the guys who fixed 8 of Google's Gemma bugs, a 2048 SWA Phi-3 issue, found tokenization issues and fixed untrained tokens with Llama-3, and I run Unsloth with my brother Michael!
Our open source package makes finetuning of LLMs 2x faster and uses 70% less VRAM with no accuracy degradation. I used to work at NVIDIA making GPU algos go fast and helped NASA engineers process data from a Mars rover faster!

Комментарии • 43

@danielhanchen Месяц назад ⁺⁷²
Thank you for inviting me! Let me know what you guys would like me to talk about next time! 😊
Also if you guys want me to clarify something, I'll try my best to reply! 🙏
@kumesh2785 Месяц назад ⁺¹
Great talk! Can you share insights on how to read other’s codebases effectively, now that you have achieved great pace. Thanks!
@aaryanbhagat4852 Месяц назад
Your talk was amazing however there is a small doubt, I do not quite understand the phrase "give model more freedom to train", does it imply?
Adding more latent variables to increase the hypothesis space?
Normalizing weights so that gradients do not explode or go to 0?
Both of the above things or something else entirely?
@sammcj2000 28 дней назад
I loved how excited you get when diving deep into these issues - it’s great to see passion.
@sammcj2000 28 дней назад
Keen to see unsloth work (locally) with multiple GPUs, this really needs to happen to unlock training on our home multi-GPU servers.
@thegovenor6166 27 дней назад
Your contributions to the llm field and localllama community have been amazing, thanks so much for your work!
@Troll3rHD Месяц назад ⁺⁵¹
this guy is a war machine
@danielhanchen Месяц назад ⁺³
Oop thanks!
@lemonsqueeezey Месяц назад ⁺²²
Watching so far and like how engaging the workshop is! We need a workshop for Triton or CUDA entirely.
@danielhanchen Месяц назад ⁺⁷
Could definitely be an interesting topic! 👀
@user-xk3tj5cj8p Месяц назад ⁺²
@@danielhanchen yes triton please 🎉🎉
@oguzhanyldrm962 Месяц назад ⁺³⁵
🎯 Key points for quick navigation:
00:35 *📊 Overview of Workshop and Introductions*
- Introduction to low-level technical analysis of language models by Daniel Han.
- Discussing the purpose of finding and fixing bugs in various language model implementations.
01:56 *🐛 Finding Bugs in Gemma and Other Models*
- Explanation of initial bug discoveries in Gemma, including issues with approximate vs exact calculations.
- Highlighting the complexity and variation in different model implementations.
04:03 *🧠 Analyzing Architecture and Quirks of Large Models*
- Discussing architectural quirks in large models like Nvidia's 340 billion parameter model.
- Exploration of non-standard implementations such as squared activations.
05:16 *🧩 Challenges in Tokenization*
- Addressing tokenization challenges and discrepancies across models.
- Example issues with token variants causing different results across implementations.
06:25 *📊 Broadening Discussion Beyond Language Models*
- Introduction to broader technical knowledge including SVD, PCA, and machine learning fundamentals.
- Encouragement for exploration and understanding of foundational algorithms.
10:09 *💻 UNS Slof and Optimization Techniques*
- Introduction to UNS Slof, optimizing fine-tuning of language models for efficiency.
- Discussion on Triton kernels and CUDA programming for GPU optimization.
13:05 *⚙️ Understanding Sparsity in GPUs*
- Explanation of sparsity feature in GPUs and its impact on training speed.
- Clarifying benefits and challenges of enabling sparsity in language model training.
15:09 *📈 Learning Rate Schedules and Model Training*
- Discussion on the impact of learning rate schedules and epochs on model training.
- Evaluation of different methodologies and their influence on model performance.
18:11 *🏆 Impact of Bug Fixes on Model Performance*
- Audience anecdote on the unexpected performance improvement post-bug fix in Gemma.
- Speculation on the multifaceted nature of bug fixes and their cumulative effect.
20:00 *🛠️ Overview of GPU memory management and efficiency in model training*
- Understanding how offloading GPU memory to system RAM can affect execution speed.
- Importance of correct memory offloading to avoid performance degradation.
21:13 *🧠 Introduction to Transformer architecture and its applications*
- Explanation of the Transformer's role in language models like GPT-4, GPT-3, and others.
- Versatility of Transformers beyond language modeling for sequence modeling tasks.
32:34 *📊 Tokenization strategies: from simple to industry standard*
- Creation and shortcomings of a basic tokenization method with combined punctuation.
- Issues identified such as vocabulary inflation and lack of normalization.
43:17 *🧠 Understanding Sequence Modeling*
- Sequence modeling involves predicting subsequent words iteratively.
- Never use future data in training machine learning models.
44:16 *🎲 Importance of Tokenization in Language Models*
- Tokenization requires each component to have the same number of numerical tokens.
- The number of combinations in token assignment can be infinite in theory.
46:06 *🛠️ Initialization and Training Considerations*
- Random initialization of model parameters can lead to issues like exploding gradients.
- Proper initialization is crucial to prevent training instability.
47:34 *📊 Structure of Training Data for Language Models*
- Training data for language models consists of sequences of tokenized text.
- Each sequence of tokens can be represented as a table of numerical embeddings.
49:15 *🔄 Training Mechanism and Transformer Architecture*
- Language models predict the next word in a sequence using shifted token predictions.
- The Transformer architecture includes attention and MLP layers for prediction refinement.
50:13 *🌐 Components of Language Models*
- Language models consist of prediction and MLP (Multi-Layer Perceptron) components.
- The attention mechanism in Transformers enhances sequence modeling capabilities.
51:08 *🤔 Exploring Multi-Token Prediction in Transformers*
- Transformers can predict multiple tokens at once by adjusting training objectives.
- Multi-token prediction can expedite inference time in language models.
52:51 *🔑 Tokenization and Embedding Process*
- Tokenizers convert text tokens into numerical IDs for embedding lookup.
- Embedding dimensions determine the vector representation's complexity for each token.
54:24 *🚀 Enhancing Training Efficiency with Padding and Tokenization*
- Padding tokens to a specific length can optimize GPU caching and training speed.
- Tokenizers with padded vocabularies enhance data processing efficiency.
56:06 *⚠️ Handling Tokenization Errors and Untrained Tokens*
- Tokenization errors can occur when using untrained tokens in fine-tuning.
- Setting untrained tokens to mean embeddings mitigates model training issues.
59:25 *🌐 Complexity Reduction in Language Model Training*
- Language models utilize attention mechanisms to reduce computational complexity.
- Masked attention allows language models to skip predicting future tokens.
01:00:49 *🧩 Mechanisms of Attention and Masking in Transformers*
- Attention mechanisms in Transformers utilize masking to skip irrelevant token interactions.
- Softmax normalization in attention mechanisms ensures probabilistic token predictions.
01:06:26 *🧮 Softmax and Layer Norms*
- Layer Norms normalize inputs across features, stabilizing training and improving model performance.
01:09:12 *📊 Backpropagation Challenges*
- Differentiating Layer Norms during backpropagation involves complex matrix operations.
- Triton's implementation complexities lie in managing gradients effectively for Layer Norms.
01:13:24 *🧩 Positional Encodings: Rope Embeddings*
- Rope embeddings enhance Transformer accuracy by encoding positional information dynamically.
- Absolute positional encodings are simpler but less effective compared to dynamic methods like rope embeddings.
01:21:01 *🔄 Derivatives and MLP in Transformers*
- Deriving gradients for rope embeddings involves specialized matrix operations like rotation matrices.
- MLP components in Transformers mix signals to enhance model expressiveness and learning flexibility.
01:27:54 *🧠 Understanding Matrix Operations in LLMs*
- Matrix operations like W_up, W_gate, and W_down are crucial in attention mechanisms.
- These matrices are trained to enhance model capacity and projection efficiency.
01:29:42 *📊 Managing Derivatives and Mathematical Formulas*
- Deriving formulas manually for complex functions like softmax derivatives is challenging and time-consuming.
- Tools like Desmos aid in visualizing and verifying mathematical derivations.
01:32:06 *🛠️ Enhancing Stability and Performance with Chunking*
- Chunking techniques optimize GPU memory usage for large vocabulary sizes in models like Llama.
- Techniques such as subtracting the maximum value in softmax enhance stability during training.
01:35:17 *🔍 Exploring Implementation Details of Llama Architecture*
- Detailed examination of key components like Layer Norms (LNorm) and rotary embeddings in Llama models.
- Insight into specific code segments for Layer Norm kernels and architectural optimizations.
01:50:03 *🧠 Low-level technical details of LLMs:*
- Understanding the architecture of LLMs involves multiple layers and operations, culminating in generating logits for token prediction.
- Upcasting to float32 from float16 enhances training stability by preventing NaNs due to large exponentials in softmax calculations.
01:57:40 *🔍 Analyzing JMA bugs:*
- Detailed exploration of bugs in JMA models reveals issues such as missing BOS tokens and typographical errors in papers.
02:01:22 *🔄 Decisions in model implementation:*
- Choosing between different model fixes (like the blue versus black line) involves balancing multiple errors and aligning with original implementations.
02:12:52 *🧮 Floating Point Formats and Performance Comparison*
- Overview of floating point formats (float16, float32) and their transistor requirements.
02:15:13 *🚀 Future of GPU Precision: Float16 and Beyond*
- Discussion on the potential future of GPU precision beyond float16.
02:22:12 *🔍 Analyzing Precision Issues in Machine Learning Models*
- Issues and considerations when implementing different precision formats in machine learning models.
02:28:11 *🛠️ Debugging Challenges in Precision Implementation*
- Challenges and methodologies for debugging precision-related issues in ML frameworks.
02:34:16 *🐍 Analyzing Implementation Differences*
- Comparing implementations between Hugging Face, PyTorch, and J implementations.
02:36:26 *🐞 Issues with Sliding Window Implementation*
- Discussing the sliding window bug in LLMs, specifically with a token limit of 2047 instead of 2048.
02:40:25 *🛠️ Tokenization Challenges and Solutions*
- Addressing challenges in tokenizer configurations and functionality.
@tahirdm1170 Месяц назад ⁺¹
Thank you for this😮 :)
@Phani-ix9sq Месяц назад ⁺¹
just looking for the timestamps guy in the comments😅😂
@kahvefincanim234 Месяц назад
bu özeti nasıl yaptınız
@oguzhanyldrm962 Месяц назад ⁺¹
@@kahvefincanim234 harpa ai adlı bir web eklentisi kullandım
@sakuragi9570 28 дней назад ⁺¹
Man, this guy is a beast. He compressed the knowledge and explain it abundantly clear
@Jay-wx6jt Месяц назад ⁺¹⁴
This guy is a gem. Keep it up
@danielhanchen Месяц назад ⁺²
:)
@edd36 Месяц назад ⁺¹
This is golden. Thank you so much for doing this workshop and for AI Engineer to create this awesome AI fair
@matawis Месяц назад ⁺⁵
I love the enthusiasm: Its infectious 😂😂
👍
@UshnishSengupta 29 дней назад ⁺¹
International icon! 😁 Thanks for sharing your knowledge and the great work y'all are doing at Unsloth! If you're up for it, would love a tutorial on kernel optimizations with Triton and how to make model training and inference go brrrrr.
@666WolfWere Месяц назад ⁺⁵
layer norml helps to avoid gradient banishing or explotion. Before that, it was almost impossible to train a deep network.
@danielhanchen Месяц назад ⁺²
Oh yes! Vanishing and exploding gradients! I remember people first said batch norm was used to reduce "internal covariate shift", but I more ascribe to the smoother and easier optimization reasoning for layernorms
@milandean Месяц назад ⁺¹³
God it was painful to hear that one guy continuously ask questions and take 10 minutes to have his question actually make sense lol
@TheWebPotato Месяц назад
You're a beast dude! Respect!
@taptnsovereigns4024 Месяц назад ⁺³
Clutch
@realisticlevel2553 Месяц назад ⁺⁴
the goat
@jimshtepa5423 13 дней назад
so what was the reason for using softmax? what would be otherwise had softmax not been used?
@JaazFelicio 28 дней назад ⁺¹
The guy knows every random question lol
@Dom-zy1qy 23 дня назад
The thing with SVD is that you need to understand everything in Linear Algebra to really "get" it. I guess math in general is sort of like this, but in the context of SVD, this is especially true. I remember when I finally understood it, it felt like the most epic moment ever.
But yeah, I think Linear Algebra should be mandatory for all CS degrees. My college doesn't require it, just calc and discrete math. Which is a shame, since linear algebra has been the most applicable of all the maths I've studied (as a programmer). Calc comes in at a really close second tho, the derivative is kinda OP.
@manncodes Месяц назад ⁺¹⁹
that one guy on the right is annoying!!
@hope42 День назад
I need to record this translated to text, feed Claude, requesting translation.
@reynoldoramas3138 27 дней назад
Why this video can´t be found on the channel, I got the link from linkedin?
@evanrsl Месяц назад ⁺³
that one guy trying to be smarter than the speaker
@Phani-ix9sq Месяц назад ⁺¹
where is the timestamps guy in the comment section
@judejin3066 17 дней назад
this guy should be a comedian
@ShaunShibu-oz8yn Месяц назад ⁺²
How can you ever know if these LLM providers don’t intentionally mix leaderboard data in smart ways to game the ranking.
@danielhanchen Месяц назад ⁺¹
Yep a huge problem - I normally trust the Hard Prompts section in the Chat LMSYS Leaderboard, and just rely on Redditors liking or disliking models - sadly we don't know for sure if models include the outputs of the Chatbot Arena dataset - some models at least explicitly state they train on the inputs / instructions of conversations.
@Isomorphist Месяц назад
"I actually used to major in Maths and Computer Science" YOU DONT SAY
@wayne8863 14 дней назад
This guy doesn't understand what is quantization and also transistor number does not play a role in efficiency necessarily. The main issue is GPU sm design and ALU units😅

Следующие

Автовоспроизведение

Mark Zuckerberg on Llama, AI, & Minus One