Mixture of Experts: Mixtral 8x7B

Parameter-efficient Fine-tuning of LLMs with LoRA

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Timothée Chalamet | This Past Weekend w/ Theo Von #551

I 3D Printed a $1,500 Chair

The History of Super Mario’s Hidden Ending

LLM inference optimization: Architecture, KV cache and Flash attention

YanAITalk

Просмотров 4,1 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 дек 2024

Комментарии • 7

@cliffordino 2 месяца назад ⁺¹
Nicely done and very helpful! Thank you!! FYI, the stress is on the first syllable of "INference", not the second ("inFERence").
@yanaitalk 2 месяца назад
Copy that! Thank you😊
@gilr8723 26 дней назад ⁺¹
Thank you! It was very informative and well explained. Is it possible to access the PDF slides you presented?
@yanaitalk 20 дней назад ⁺¹
Sure. The majority slides are taken from the AWS tutorial: drive.google.com/file/d/1uVhHtRBwXy7o8ejaS6Ab6pSybkzticE3/view
@johndong4754 3 месяца назад
Ive been learning about LLMs over the past few months, but i havent gone into too much depth. Your videos seem very detailed and technical. Which one(s) would you recommend starting off with?
@yanaitalk 3 месяца назад
There are excellent courses from DeepLearning.ai on Coursera. To go even deeper, I recommend to directly read the technical papers which gives you more depth of understanding.
@HeywardLiu 2 месяца назад ⁺²
1. Roofline model
2. Transformer arch. > bottleneck of attention > flash attention
3. LLM Inference can be divided into: prefilling-stage (compute-bound) and decoding-stage (memory-bound)
4. LLM serving: paged attention, radix attention
If you want to optimize the inference performance, this review paper is awesome: LLM Inference Unveiled: Survey and Roofline Model Insights

Следующие

Автовоспроизведение

Mixture of Experts: Mixtral 8x7B

Mixture of Experts: Mixtral 8x7B

Parameter-efficient Fine-tuning of LLMs with LoRA

Parameter-efficient Fine-tuning of LLMs with LoRA

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Timothée Chalamet | This Past Weekend w/ Theo Von #551

Timothée Chalamet | This Past Weekend w/ Theo Von #551

I 3D Printed a $1,500 Chair

I 3D Printed a $1,500 Chair

The History of Super Mario’s Hidden Ending

The History of Super Mario’s Hidden Ending

I Filled my ENTIRE House with Snow *don’t try this*

I Filled my ENTIRE House with Snow *don’t try this*

LLM inference optimization: Model Quantization and Distillation

LLM inference optimization: Model Quantization and Distillation

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Coding tutorial: LLM fine-tuning with LORA

Coding tutorial: LLM fine-tuning with LORA

How a Transformer works at inference vs training time

How a Transformer works at inference vs training time

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

AI Inference: The Secret to AI's Superpowers

AI Inference: The Secret to AI's Superpowers

Реакция Газана на танец Анджилиши

Реакция Газана на танец Анджилиши

Миф о безопасной Канаде: правда о преступности!!!

Миф о безопасной Канаде: правда о преступности!!!

ДПС ОСТАНОВИЛИ РОСГВАРДИЮ и ВОТ ЧТО ВЫШЛО.. в ГТА 5 РП (GTA 5 RMRP / Криминальная Москва)

ДПС ОСТАНОВИЛИ РОСГВАРДИЮ и ВОТ ЧТО ВЫШЛО.. в ГТА 5 РП (GTA 5 RMRP / Криминальная Москва)

Russian Soldier Freezes After Hearing Ukrainian FPV Drone. Interesting Tactics of Russian Troops

Russian Soldier Freezes After Hearing Ukrainian FPV Drone. Interesting Tactics of Russian Troops

Арсений Попов - Он попал в черный список «Натальной Карты» и стал самым таинственным из Импровизации

Арсений Попов – Он попал в черный список «Натальной Карты» и стал самым таинственным из Импровизации

Редакция. News: 148-я неделя

Редакция. News: 148-я неделя

Which box would you pick?😳

Which box would you pick?😳

哥哥吃最后一个橘子，妹妹生闷气让哥哥猜！ #俩活宝 #兄妹日常 #生闷气跺脚

哥哥吃最后一个橘子，妹妹生闷气让哥哥猜！ #俩活宝 #兄妹日常 #生闷气跺脚