Low Level Technicals of LLMs: Daniel Han
HTML-код
- Опубликовано: 7 сен 2024
- This workshop will be split into 3x one hour blocks:
How to analyze & fix LLMs - how to find and fix bugs in Gemma, Phi-3, Llama & tokenizers
Finetuning with Unsloth - continued pretraining, reward modelling, QLoRA & more
Deep dive into LLM technicals - hand deriving derivatives, SOTA finetuning tricks
It's recommended you have Python with Pytorch and Unsloth installed (or use online Google Colab / Kaggle). College level maths and programming would be helpful.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at www.ai.enginee... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at ai.engineer/2025
About Daniel
Hey I'm Daniel, the algos guy behind Unsloth. I love making LLM training go fast! We're the guys who fixed 8 of Google's Gemma bugs, a 2048 SWA Phi-3 issue, found tokenization issues and fixed untrained tokens with Llama-3, and I run Unsloth with my brother Michael!
Our open source package makes finetuning of LLMs 2x faster and uses 70% less VRAM with no accuracy degradation. I used to work at NVIDIA making GPU algos go fast and helped NASA engineers process data from a Mars rover faster!
Thank you for inviting me! Let me know what you guys would like me to talk about next time! 😊
Also if you guys want me to clarify something, I'll try my best to reply! 🙏
Great talk! Can you share insights on how to read other’s codebases effectively, now that you have achieved great pace. Thanks!
Your talk was amazing however there is a small doubt, I do not quite understand the phrase "give model more freedom to train", does it imply?
Adding more latent variables to increase the hypothesis space?
Normalizing weights so that gradients do not explode or go to 0?
Both of the above things or something else entirely?
I loved how excited you get when diving deep into these issues - it’s great to see passion.
Keen to see unsloth work (locally) with multiple GPUs, this really needs to happen to unlock training on our home multi-GPU servers.
Your contributions to the llm field and localllama community have been amazing, thanks so much for your work!
this guy is a war machine
Oop thanks!
Watching so far and like how engaging the workshop is! We need a workshop for Triton or CUDA entirely.
Could definitely be an interesting topic! 👀
@@danielhanchen yes triton please 🎉🎉
🎯 Key points for quick navigation:
00:35 *📊 Overview of Workshop and Introductions*
- Introduction to low-level technical analysis of language models by Daniel Han.
- Discussing the purpose of finding and fixing bugs in various language model implementations.
01:56 *🐛 Finding Bugs in Gemma and Other Models*
- Explanation of initial bug discoveries in Gemma, including issues with approximate vs exact calculations.
- Highlighting the complexity and variation in different model implementations.
04:03 *🧠 Analyzing Architecture and Quirks of Large Models*
- Discussing architectural quirks in large models like Nvidia's 340 billion parameter model.
- Exploration of non-standard implementations such as squared activations.
05:16 *🧩 Challenges in Tokenization*
- Addressing tokenization challenges and discrepancies across models.
- Example issues with token variants causing different results across implementations.
06:25 *📊 Broadening Discussion Beyond Language Models*
- Introduction to broader technical knowledge including SVD, PCA, and machine learning fundamentals.
- Encouragement for exploration and understanding of foundational algorithms.
10:09 *💻 UNS Slof and Optimization Techniques*
- Introduction to UNS Slof, optimizing fine-tuning of language models for efficiency.
- Discussion on Triton kernels and CUDA programming for GPU optimization.
13:05 *⚙️ Understanding Sparsity in GPUs*
- Explanation of sparsity feature in GPUs and its impact on training speed.
- Clarifying benefits and challenges of enabling sparsity in language model training.
15:09 *📈 Learning Rate Schedules and Model Training*
- Discussion on the impact of learning rate schedules and epochs on model training.
- Evaluation of different methodologies and their influence on model performance.
18:11 *🏆 Impact of Bug Fixes on Model Performance*
- Audience anecdote on the unexpected performance improvement post-bug fix in Gemma.
- Speculation on the multifaceted nature of bug fixes and their cumulative effect.
20:00 *🛠️ Overview of GPU memory management and efficiency in model training*
- Understanding how offloading GPU memory to system RAM can affect execution speed.
- Importance of correct memory offloading to avoid performance degradation.
21:13 *🧠 Introduction to Transformer architecture and its applications*
- Explanation of the Transformer's role in language models like GPT-4, GPT-3, and others.
- Versatility of Transformers beyond language modeling for sequence modeling tasks.
32:34 *📊 Tokenization strategies: from simple to industry standard*
- Creation and shortcomings of a basic tokenization method with combined punctuation.
- Issues identified such as vocabulary inflation and lack of normalization.
43:17 *🧠 Understanding Sequence Modeling*
- Sequence modeling involves predicting subsequent words iteratively.
- Never use future data in training machine learning models.
44:16 *🎲 Importance of Tokenization in Language Models*
- Tokenization requires each component to have the same number of numerical tokens.
- The number of combinations in token assignment can be infinite in theory.
46:06 *🛠️ Initialization and Training Considerations*
- Random initialization of model parameters can lead to issues like exploding gradients.
- Proper initialization is crucial to prevent training instability.
47:34 *📊 Structure of Training Data for Language Models*
- Training data for language models consists of sequences of tokenized text.
- Each sequence of tokens can be represented as a table of numerical embeddings.
49:15 *🔄 Training Mechanism and Transformer Architecture*
- Language models predict the next word in a sequence using shifted token predictions.
- The Transformer architecture includes attention and MLP layers for prediction refinement.
50:13 *🌐 Components of Language Models*
- Language models consist of prediction and MLP (Multi-Layer Perceptron) components.
- The attention mechanism in Transformers enhances sequence modeling capabilities.
51:08 *🤔 Exploring Multi-Token Prediction in Transformers*
- Transformers can predict multiple tokens at once by adjusting training objectives.
- Multi-token prediction can expedite inference time in language models.
52:51 *🔑 Tokenization and Embedding Process*
- Tokenizers convert text tokens into numerical IDs for embedding lookup.
- Embedding dimensions determine the vector representation's complexity for each token.
54:24 *🚀 Enhancing Training Efficiency with Padding and Tokenization*
- Padding tokens to a specific length can optimize GPU caching and training speed.
- Tokenizers with padded vocabularies enhance data processing efficiency.
56:06 *⚠️ Handling Tokenization Errors and Untrained Tokens*
- Tokenization errors can occur when using untrained tokens in fine-tuning.
- Setting untrained tokens to mean embeddings mitigates model training issues.
59:25 *🌐 Complexity Reduction in Language Model Training*
- Language models utilize attention mechanisms to reduce computational complexity.
- Masked attention allows language models to skip predicting future tokens.
01:00:49 *🧩 Mechanisms of Attention and Masking in Transformers*
- Attention mechanisms in Transformers utilize masking to skip irrelevant token interactions.
- Softmax normalization in attention mechanisms ensures probabilistic token predictions.
01:06:26 *🧮 Softmax and Layer Norms*
- Layer Norms normalize inputs across features, stabilizing training and improving model performance.
01:09:12 *📊 Backpropagation Challenges*
- Differentiating Layer Norms during backpropagation involves complex matrix operations.
- Triton's implementation complexities lie in managing gradients effectively for Layer Norms.
01:13:24 *🧩 Positional Encodings: Rope Embeddings*
- Rope embeddings enhance Transformer accuracy by encoding positional information dynamically.
- Absolute positional encodings are simpler but less effective compared to dynamic methods like rope embeddings.
01:21:01 *🔄 Derivatives and MLP in Transformers*
- Deriving gradients for rope embeddings involves specialized matrix operations like rotation matrices.
- MLP components in Transformers mix signals to enhance model expressiveness and learning flexibility.
01:27:54 *🧠 Understanding Matrix Operations in LLMs*
- Matrix operations like W_up, W_gate, and W_down are crucial in attention mechanisms.
- These matrices are trained to enhance model capacity and projection efficiency.
01:29:42 *📊 Managing Derivatives and Mathematical Formulas*
- Deriving formulas manually for complex functions like softmax derivatives is challenging and time-consuming.
- Tools like Desmos aid in visualizing and verifying mathematical derivations.
01:32:06 *🛠️ Enhancing Stability and Performance with Chunking*
- Chunking techniques optimize GPU memory usage for large vocabulary sizes in models like Llama.
- Techniques such as subtracting the maximum value in softmax enhance stability during training.
01:35:17 *🔍 Exploring Implementation Details of Llama Architecture*
- Detailed examination of key components like Layer Norms (LNorm) and rotary embeddings in Llama models.
- Insight into specific code segments for Layer Norm kernels and architectural optimizations.
01:50:03 *🧠 Low-level technical details of LLMs:*
- Understanding the architecture of LLMs involves multiple layers and operations, culminating in generating logits for token prediction.
- Upcasting to float32 from float16 enhances training stability by preventing NaNs due to large exponentials in softmax calculations.
01:57:40 *🔍 Analyzing JMA bugs:*
- Detailed exploration of bugs in JMA models reveals issues such as missing BOS tokens and typographical errors in papers.
02:01:22 *🔄 Decisions in model implementation:*
- Choosing between different model fixes (like the blue versus black line) involves balancing multiple errors and aligning with original implementations.
02:12:52 *🧮 Floating Point Formats and Performance Comparison*
- Overview of floating point formats (float16, float32) and their transistor requirements.
02:15:13 *🚀 Future of GPU Precision: Float16 and Beyond*
- Discussion on the potential future of GPU precision beyond float16.
02:22:12 *🔍 Analyzing Precision Issues in Machine Learning Models*
- Issues and considerations when implementing different precision formats in machine learning models.
02:28:11 *🛠️ Debugging Challenges in Precision Implementation*
- Challenges and methodologies for debugging precision-related issues in ML frameworks.
02:34:16 *🐍 Analyzing Implementation Differences*
- Comparing implementations between Hugging Face, PyTorch, and J implementations.
02:36:26 *🐞 Issues with Sliding Window Implementation*
- Discussing the sliding window bug in LLMs, specifically with a token limit of 2047 instead of 2048.
02:40:25 *🛠️ Tokenization Challenges and Solutions*
- Addressing challenges in tokenizer configurations and functionality.
Thank you for this😮 :)
just looking for the timestamps guy in the comments😅😂
bu özeti nasıl yaptınız
@@kahvefincanim234 harpa ai adlı bir web eklentisi kullandım
Man, this guy is a beast. He compressed the knowledge and explain it abundantly clear
This guy is a gem. Keep it up
:)
This is golden. Thank you so much for doing this workshop and for AI Engineer to create this awesome AI fair
I love the enthusiasm: Its infectious 😂😂
👍
International icon! 😁 Thanks for sharing your knowledge and the great work y'all are doing at Unsloth! If you're up for it, would love a tutorial on kernel optimizations with Triton and how to make model training and inference go brrrrr.
layer norml helps to avoid gradient banishing or explotion. Before that, it was almost impossible to train a deep network.
Oh yes! Vanishing and exploding gradients! I remember people first said batch norm was used to reduce "internal covariate shift", but I more ascribe to the smoother and easier optimization reasoning for layernorms
God it was painful to hear that one guy continuously ask questions and take 10 minutes to have his question actually make sense lol
You're a beast dude! Respect!
Clutch
the goat
so what was the reason for using softmax? what would be otherwise had softmax not been used?
The guy knows every random question lol
The thing with SVD is that you need to understand everything in Linear Algebra to really "get" it. I guess math in general is sort of like this, but in the context of SVD, this is especially true. I remember when I finally understood it, it felt like the most epic moment ever.
But yeah, I think Linear Algebra should be mandatory for all CS degrees. My college doesn't require it, just calc and discrete math. Which is a shame, since linear algebra has been the most applicable of all the maths I've studied (as a programmer). Calc comes in at a really close second tho, the derivative is kinda OP.
that one guy on the right is annoying!!
I need to record this translated to text, feed Claude, requesting translation.
Why this video can´t be found on the channel, I got the link from linkedin?
that one guy trying to be smarter than the speaker
where is the timestamps guy in the comment section
this guy should be a comedian
How can you ever know if these LLM providers don’t intentionally mix leaderboard data in smart ways to game the ranking.
Yep a huge problem - I normally trust the Hard Prompts section in the Chat LMSYS Leaderboard, and just rely on Redditors liking or disliking models - sadly we don't know for sure if models include the outputs of the Chatbot Arena dataset - some models at least explicitly state they train on the inputs / instructions of conversations.
"I actually used to major in Maths and Computer Science" YOU DONT SAY
This guy doesn't understand what is quantization and also transistor number does not play a role in efficiency necessarily. The main issue is GPU sm design and ALU units😅