LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

This Month Was Tough on Us..

I Ruined an Entire City With Unrelenting 100% Insanity - Highway Police Simulator

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

MLOps.community

Просмотров 18 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 дек 2024

Комментарии • 20

@iandanforth Год назад ⁺⁸
There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.
@eduardoalvarez7152 11 месяцев назад ⁺¹⁰
The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided.
Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.
@TheAIEpiphany 7 месяцев назад ⁺¹
I believe it was just a piece of napkin math: in reality he didn't count in KV cache at all in the P / mem bandwidth line which is a function of sequence length. That seems like the biggest approximation error I see here?
For the second line he discounted attention FLOPs and used just MLP FLOPs (the error of this approximation increases as the sequence grows, depends on the model size you're using e.g. for 7B model with a big sequence length, that term might actually be important).
Additionally the peak flops is a function of the data type and the operation you're executing, he's assuming bf16/fp16 which is what Mistral 7B is using, that gives you ~312 TFLOPs/s for A100.
All in all this is useful if you understand exactly the assumptions he's making.
@Venkat2811 7 месяцев назад
@@TheAIEpiphany Yes, I was looking for KV cache as well. Your explanation makes sense.
@evermorecurious91 Год назад ⁺³
This is gold!!!
@mndflctzn Год назад ⁺²
This is awesome. Thanks for sharing super useful
@janilbolswong1953 Год назад ⁺²
@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left
@attention42 Год назад ⁺⁴
I guess "memory movement" mean movement from GPU memory(HBM) to GPU computing component.
Model parameter stored in GPU memory not in compute component. So for computing model parameter moved from HBM to compute component every forward pass.
@fraternitas5117 7 месяцев назад
yes, it needs to be loaded in the gpu all the time. advanced users optimize their applications by sending an equal number of bytes as the memory maximum to optimize the utilizations of all memory in the clock cycle.
@frank96997 10 месяцев назад
Great talk! is there link to the slides for this talk?
@iogbole 3 месяца назад
The right continous profiling solution can help you find B* --> 7:23 with much less effort. 18:23 is where the power of low-level tracing with eBPF comes in; otherwise, the performance overhead is simply too high.
@boussouarsari4482 10 месяцев назад
It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?
@yaxiongzhao6640 5 месяцев назад
The KV cache's size is directly from the attention layer's size, which in turn is in proportional to model weights' total count
So model weights still proportionally determines the kv cache size, thus the statement.
@marvelfancollection3690 Месяц назад
Guys do you have a sql ai inference model..I've been checking around but can seem to find any.
@windmaple Год назад
Great talk!
@MLOps 8 месяцев назад
Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
@Gerald-iz7mv 9 месяцев назад
hi what benchmark he run to generate the plots? any open source github links?
@aneeinaec 6 месяцев назад
Is that Ryan Gosling ❤
@AbdulK-kr2jv 8 месяцев назад
What a horrible unethical response on the ethics of training data

Следующие

Автовоспроизведение

LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

LLMs in Production at GetYourGuide // Meghana Satish & Tina Treimane // LLMs III Talk

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

The FULL Guide To Get Fully AWAKENED Draco Race V4 (V1, V2 & V3) | Blox Fruits

This Month Was Tough on Us..

This Month Was Tough on Us..

I Ruined an Entire City With Unrelenting 100% Insanity - Highway Police Simulator

I Ruined an Entire City With Unrelenting 100% Insanity - Highway Police Simulator

Felix "Unfair" | [Stray Kids : SKZ-PLAYER]

Felix "Unfair" | [Stray Kids : SKZ-PLAYER]

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works

Accelerating LLM Inference with vLLM

Accelerating LLM Inference with vLLM

Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

Stanford Webinar - Large Language Models Get the Hype, but Compound Systems Are the Future of AI

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade"

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade"

AI Hardware: Training, Inference, Devices and Model Optimization

AI Hardware: Training, Inference, Devices and Model Optimization

Царица // 9 выпуск. Премьера

Царица // 9 выпуск. Премьера

ШОУ Я: Егор Крид, Дима Масленников, Прохор Шаляпин, Супер Стас, Кадрол #5

ШОУ Я: Егор Крид, Дима Масленников, Прохор Шаляпин, Супер Стас, Кадрол #5

спорим ты не знаешь как называется сын сестры? #юмор #катяклон #беременнав16 #мамадочка #узи

спорим ты не знаешь как называется сын сестры? #юмор #катяклон #беременнав16 #мамадочка #узи

ГОЛОЛЁД🤬10 ФУР СЛЕТЕЛИ С ТРАССЫ,БАРНАУЛ ВЫГРУЗКА УЖАС((ПОРВАЛ КРЫЛО…

ГОЛОЛЁД🤬10 ФУР СЛЕТЕЛИ С ТРАССЫ,БАРНАУЛ ВЫГРУЗКА УЖАС((ПОРВАЛ КРЫЛО…

Последствия перегона BMW. Извиняюсь за Русский Топ Гир. Везу ПРАДИК бате менять на Ниву

Последствия перегона BMW. Извиняюсь за Русский Топ Гир. Везу ПРАДИК бате менять на Ниву

Təyyarədə oksigen balonu partlayıb? I Bələdçi Aydan Rəhimlidən açıqlama

Təyyarədə oksigen balonu partlayıb? I Bələdçi Aydan Rəhimlidən açıqlama

Стыдные вопросы про Китай / вДудь

Стыдные вопросы про Китай / вДудь

Начальник туалета

Начальник туалета