Видео 56
Просмотров 88 909

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

44:31

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

48:06

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

59:55

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

49:38

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

1:04:28

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

52:35

vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024

In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.
Highlights:
[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussio...

Видео

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

44:31

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024

Просмотров 50014 дней назад

In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robus...

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

48:06

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

Просмотров 604Месяц назад

In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundarie...

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

59:55

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

Просмотров 504Месяц назад

In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM. During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from ...

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

49:38

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

Просмотров 8192 месяца назад

In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1 3.2 and InternLM2.5, and Reward LM support for Q...

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

1:04:28

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

Просмотров 1 тыс.2 месяца назад

In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its differ...

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

52:35

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

Просмотров 1,9 тыс.3 месяца назад

In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and ...

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

1:13:14

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

Просмотров 2,5 тыс.3 месяца назад

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM. During the Q&A, we tackled a variety of audience questions around hardware diversity,...

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

48:13

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

Просмотров 6674 месяца назад

In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline. During the Q&A, we tackled a variety of aud...

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

50:03

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

Просмотров 6374 месяца назад

In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM. Additionally, we touched on the vLLM v0.5.4 release, inc...

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

50:38

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

Просмотров 1,1 тыс.5 месяцев назад

In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM. Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support...

Deploy LLMs More Efficiently with vLLM and Neural Magic

33:21

Deploy LLMs More Efficiently with vLLM and Neural Magic

Просмотров 1,2 тыс.5 месяцев назад

Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services with more efficiency and cost savings.

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

56:09

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

Просмотров 1,7 тыс.5 месяцев назад

In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization. We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2. For more deta...

53:19

vLLM Office Hours - June 20, 2024

Просмотров 5346 месяцев назад

vLLM Office Hours - June 20, 2024

vLLM and Neural Magic Office Hours - June 5, 2024

44:47

vLLM and Neural Magic Office Hours - June 5, 2024

Просмотров 5496 месяцев назад

vLLM and Neural Magic Office Hours - June 5, 2024

6:31

Are MLOps disappearing?

Просмотров 361Год назад

Are MLOps disappearing?

1:06

5x Faster YOLOv8 on CPUs

Просмотров 4,6 тыс.Год назад

5x Faster YOLOv8 on CPUs

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

47:52

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

Просмотров 3,7 тыс.Год назад

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

Unlock Faster and More Efficient LLMs with SparseGPT

42:27

Unlock Faster and More Efficient LLMs with SparseGPT

Просмотров 2,3 тыс.Год назад

Unlock Faster and More Efficient LLMs with SparseGPT

Pruning and Quantizing ML Models With One Shot Without Retraining

52:31

Pruning and Quantizing ML Models With One Shot Without Retraining

Просмотров 2,2 тыс.Год назад

Pruning and Quantizing ML Models With One Shot Without Retraining

Sparse Transferring Hugging Face Models With SparseML

8:15

Sparse Transferring Hugging Face Models With SparseML

Просмотров 537Год назад

Sparse Transferring Hugging Face Models With SparseML

Apply Second-Order Pruning Algorithms for SOTA Model Compression

41:42

Apply Second-Order Pruning Algorithms for SOTA Model Compression

Просмотров 967Год назад

Apply Second-Order Pruning Algorithms for SOTA Model Compression

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets

6:53

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets

Просмотров 450Год назад

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets

5:02

Intro to SparseML

Просмотров 687Год назад

Intro to SparseML

Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime

4:23

Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime

Просмотров 209Год назад

Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime

Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime

4:20

Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime

Просмотров 182Год назад

Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime

Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime

4:50

Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime

Просмотров 849Год назад

Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime

7:38

Intro to DeepSparse Runtime

Просмотров 1,6 тыс.Год назад

Intro to DeepSparse Runtime

Intro to Deep Learning Model Sparsification

7:08

Intro to Deep Learning Model Sparsification

Просмотров 957Год назад

Intro to Deep Learning Model Sparsification

6:16

Intro to SparseZoo

Просмотров 348Год назад

Intro to SparseZoo

@MyronRosalee 3 часа назад
You're doing a fantastic job! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). What's the best way to send them to Binance?
@mfc1190 9 дней назад
Great work on this project! I’ve loved the simplicity + perf when using it.
@qiyuangong4785 10 дней назад
Nice presentation! This feature may significantly reduce prefill recomputation in the long contexts.
@Earthvssuna Месяц назад
And can it run out of the box or difficult to execute it on amd?
@Earthvssuna Месяц назад
So at the end is vllm running on amd gpu good?
@MEvansMusic Месяц назад
when trying to run EAGLE model in openai api_server, how should you format the script? Since EAGLE doesn't use a draft model in its set up, I'm not sure how to make it work
@불고기-u5o Месяц назад
12:12
@temka088 Месяц назад
which chat ui is being used in the end?
@michaelgoin4760 Месяц назад
The frontend in the demo is custom built on Next.js and using Vercel's AI SDK
@dmytro7441 Месяц назад
Thank you for the video. Do you expect it to work with Multimodal Models like llama-3.2-vision and Pixtral?
@불고기-u5o Месяц назад
7:16 으앙 찾던 내용이다 아키텍처 아저씨 최고 최고 최고
@불고기-u5o Месяц назад
45:04 6버전 이야기
@불고기-u5o Месяц назад
7:08
@micuentadecasa 2 месяца назад
Hi, great video, I would like to know if it is possible to use e5-mistral-7b-instruct in VLLM for embedding and completions with only one instance of VLLM?
@curtwortman6995 3 месяца назад
Excellent progress and very informative. Thank you Neural Magic and team from your innovation and fantastic contributions.
@nickellas9882 3 месяца назад
Great session - thanks for posting!
@bigtymer4862 3 месяца назад
28:12 AWQ?
@hari000-f6y 3 месяца назад
I have a question!. I'm serving multimodal on vLLM, quantized (InternVL2) on L4 , it takes ~5-6 secs to complete a request, so when multiple request hit at a time, it takes much time ~30 secs to complete the requests. how to handle it like multiple requests also gets completed in ~5 secs. I have less understanding in batch_requesting and all.
@shumshvenhiszali 4 месяца назад
Say code opensource but where?
@pseudokamp 5 месяцев назад
Please share the source code
@pseudokamp 5 месяцев назад
Please share the source code of quantization
@neuralmagic 5 месяцев назад
See here: github.com/vllm-project/llm-compressor/tree/main/examples
@Reeves-k2k 4 месяца назад
Great Work! Thanks :)
@spaken2768 6 месяцев назад
Can Deepsparse run on Raspberry Pi AI kit for even faster fps?
@muhammadwaleedather1726 7 месяцев назад
mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?
@muhammadwaleedather1726 7 месяцев назад
mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?
@music_love21 8 месяцев назад
HI! We are creating a system that classifies tomato ripeness levels using image processing in CNN architecture with YOLOv8 model. We are using Raspberry Pi 4 OS with 4GB RAM and we have encountered problem - the system has 2-3 minutes delay/lag in classifying the ripeness level. Would you happen to have any recommendation/suggestion sir on this problem?
@szhavel 8 месяцев назад
Link to collab is not available now
@MahrukhAliKhan-x4c 9 месяцев назад
Can you make a tutorial of practically pruning some GAN model like GFPFGAN model ?
@prateekpatel6082 10 месяцев назад
Could you clarify if these pruning strategies are post training or training aware. It seems like progressive sparsification is training aware but from what i recall woodfisher approach is post training and require some fine tuning at the end ?
@prateekpatel6082 10 месяцев назад
In the recipes , its shown we do distillation to recover accuracy and then followed by quantization . Curios do you observe degradaiton in quantization step , why not distill or fine tune post quantization ? Also with this recipes , does training become more expensive compared to base dense models ? are there any data comparing the training cost and time
@RobGreenberg-ri4hy Год назад
Wow, very insightful interview!
@hosseinsoleimani3193 Год назад
Any C++ API for this?
@albertofernandez055 Год назад
Hi, I am really interested in improving inference time for YOLO models in CPU. Here are some questions I have: (1) After using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, is it possible to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? (3) As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) Are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks in advance
@albertofernandez055 Год назад
Many thanks for this video. I have several questions: (1) After "applying to our data" using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, have you tried to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx') (3) And finally, I have a question with the license. As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) So are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks!!!
@chihebnouri5541 9 месяцев назад
Did you find a solution?
@shahid19297 Год назад
@Neural Magic You are building amazing tools but the demos aren't good, I don't like these PPT explanations, show us the practical demos and all the different tool and methods you have. I still can't figure out how do I train my yolov5 with custom dataset using sparcing. I still can't figure out what is deep sparce, what is sparifying, sparceML, sparceZoo...and... I just want to know if I have to train my custom dataset with yolov5 with your sparcing method. I should see complete code examples for that, and what if I have already trained weights, how can we apply pruning and quantization. It's very confusing as none of your article explain things completely. There should have been hands on videos or articles. As a beginner it is difficult to understand. By the way you should try Roboflow style demos, those are great.
@nishant.wankhade Год назад
Hi there, What are the hardware specifications of GPU to run yolov8 model , I have Nvidia GT 730 and while running the model it is giving me "cuda: no kernel image is available for execution on the device."
@neuralmagic Год назад
Hello! Our software is specific to CPU infrastructure. Our runtime, DeepSparse, is engineered to take advantage of CPU memory to deliver the performance we claimed in the video.
@yamenshahla8214 Год назад
hello , i followed all steps and i now have best_pruned.onnx model file, deep sparse is giving me the ability to only test on an image how can i deploy my model on a camera live stream or a video file ? Thank you
@billykotsos4642 Год назад
29:28 This truly is a game changer!
@billykotsos4642 Год назад
This is truly ground breaking.... you guys are doing phenomenal work...
@Gstreng Год назад
Amazing stuff!
@billykotsos4642 Год назад
LLMs running on CPUs is ground breaking. You guys are doing amazing work !
@hritikakolkar Год назад
Hi, do you guys have any ML intern position
@xeetu.7065 Год назад
Is it possible to write a yolov5 object recognition application using neural magic on windows?
@neuralmagic Год назад
Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
@CrushCuisine Год назад
Has anyone, outside of the NeuralMagic team, pruned their own model with SparseML and can confirm these claims?
@madhura305 Год назад
Hello @Neural Magic can i follow the same steps for training the model in windows??
@neuralmagic Год назад
Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
@madhura305 Год назад
Hey!Can u'll plz help me out how exactly to train the model for custom data set. Though I tired n followed all the instructions mentioned in github. I'm not able to train it.
@neuralmagic Год назад
@@madhura305 Hi! We see that you asked your question in our Slack community. We will help you there!
@ch1n3du3 Год назад
this is great work. Is there any research on the ideal sparse training to dense training ratios?
@dtransposed79 Год назад
We may publish more information on that in the follow-up to this paper soon. But what I can share right now is that you can often reduce the length of the dense phases, but not remove them completely.
@ch1n3du3 Год назад
@@dtransposed79 thanks for the response
@andrewowens5653 Год назад
What about LLM's and/or Stable Diffusion models? Can your techniques be used width GPT's?
@neuralmagic Год назад
We are actively working on optimizing generative models. Here is our latest research that shows you can sparsify LLMs 50% with one-shot: neuralmagic.com/blog/sparsegpt-remove-100-billion-parameters-for-free/ In about 30 days, we are holding a webinar where we'll discuss the SparseGPT method, how you can apply it to your models, and how you can run sparse generative models in DeepSparse super fast. We will post the registration link here by the end of the week: neuralmagic.com/neural-magic-events/
@RobGreenberg-ri4hy Год назад
Very well done!
@dtransposed79 Год назад
Absolutely. Great job!
@flymousechiu 2 года назад
Congrats on making it to EMNLP 2022! Also, for those who can't take pineapple pizzas, you have no idea what you are missing out.
@modeltrainer1246 2 года назад
we need yolo v7 deep sparse. is that too much to ask for?
@neuralmagic 2 года назад
Dense YOLOv7 runs in the DeepSparse Engine! We are seeing speedups out of the box. YOLOv7-tiny seems quite accurate with YOLOv5s speeds. We are working on sparsifying and quantizing YOLOv7 for way better performance. We see promising results from sparsity, even more so than YOLOv5. Stay tuned - the best place is our Slack community as we post our model and engine updates there.
@dislike__button 2 года назад
When big NLP models like GPT-2, M2M100 etc.? 😞😭
@neuralmagic 2 года назад
Hello! We are working on sparsifying large language models to make them more usable in production. We will keep the community informed of our efforts. Join us in Slack to stay in the loop: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
@kishoreg8835 2 года назад
WHEN YOLOV7 SPARSE COMING
@neuralmagic 2 года назад
Soon! Join us in the DeepSparse Community to hear exactly when: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
@MeanGeneHacks 2 года назад
Definitely excited for Yolo V7 Sparse! @Neural Magic
@kishoreg8835 2 года назад
@@neuralmagic been a month.... how long is soon

Neural Magic

Комментарии