Neural Magic
Neural Magic
  • Видео 56
  • Просмотров 88 909
vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024
In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.
Highlights:
[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussio...
Просмотров: 256

Видео

vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024
Просмотров 50014 дней назад
In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robus...
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
Просмотров 604Месяц назад
In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundarie...
vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024
Просмотров 504Месяц назад
In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM. During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from ...
vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024
Просмотров 8192 месяца назад
In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1 3.2 and InternLM2.5, and Reward LM support for Q...
vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024
Просмотров 1 тыс.2 месяца назад
In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its differ...
vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024
Просмотров 1,9 тыс.3 месяца назад
In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and ...
vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024
Просмотров 2,5 тыс.3 месяца назад
In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM. During the Q&A, we tackled a variety of audience questions around hardware diversity,...
vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024
Просмотров 6674 месяца назад
In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline. During the Q&A, we tackled a variety of aud...
vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024
Просмотров 6374 месяца назад
In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM. Additionally, we touched on the vLLM v0.5.4 release, inc...
vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024
Просмотров 1,1 тыс.5 месяцев назад
In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM. Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support...
Deploy LLMs More Efficiently with vLLM and Neural Magic
Просмотров 1,2 тыс.5 месяцев назад
Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services with more efficiency and cost savings.
vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024
Просмотров 1,7 тыс.5 месяцев назад
In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization. We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2. For more deta...
vLLM Office Hours - June 20, 2024
Просмотров 5346 месяцев назад
vLLM Office Hours - June 20, 2024
vLLM and Neural Magic Office Hours - June 5, 2024
Просмотров 5496 месяцев назад
vLLM and Neural Magic Office Hours - June 5, 2024
Are MLOps disappearing?
Просмотров 361Год назад
Are MLOps disappearing?
5x Faster YOLOv8 on CPUs
Просмотров 4,6 тыс.Год назад
5x Faster YOLOv8 on CPUs
Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have
Просмотров 3,7 тыс.Год назад
Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have
Unlock Faster and More Efficient LLMs with SparseGPT
Просмотров 2,3 тыс.Год назад
Unlock Faster and More Efficient LLMs with SparseGPT
Pruning and Quantizing ML Models With One Shot Without Retraining
Просмотров 2,2 тыс.Год назад
Pruning and Quantizing ML Models With One Shot Without Retraining
Sparse Transferring Hugging Face Models With SparseML
Просмотров 537Год назад
Sparse Transferring Hugging Face Models With SparseML
Apply Second-Order Pruning Algorithms for SOTA Model Compression
Просмотров 967Год назад
Apply Second-Order Pruning Algorithms for SOTA Model Compression
Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets
Просмотров 450Год назад
Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets
Intro to SparseML
Просмотров 687Год назад
Intro to SparseML
Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime
Просмотров 209Год назад
Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime
Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime
Просмотров 182Год назад
Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime
Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime
Просмотров 849Год назад
Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime
Intro to DeepSparse Runtime
Просмотров 1,6 тыс.Год назад
Intro to DeepSparse Runtime
Intro to Deep Learning Model Sparsification
Просмотров 957Год назад
Intro to Deep Learning Model Sparsification
Intro to SparseZoo
Просмотров 348Год назад
Intro to SparseZoo

Комментарии

  • @MyronRosalee
    @MyronRosalee 3 часа назад

    You're doing a fantastic job! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). What's the best way to send them to Binance?

  • @mfc1190
    @mfc1190 9 дней назад

    Great work on this project! I’ve loved the simplicity + perf when using it.

  • @qiyuangong4785
    @qiyuangong4785 10 дней назад

    Nice presentation! This feature may significantly reduce prefill recomputation in the long contexts.

  • @Earthvssuna
    @Earthvssuna Месяц назад

    And can it run out of the box or difficult to execute it on amd?

  • @Earthvssuna
    @Earthvssuna Месяц назад

    So at the end is vllm running on amd gpu good?

  • @MEvansMusic
    @MEvansMusic Месяц назад

    when trying to run EAGLE model in openai api_server, how should you format the script? Since EAGLE doesn't use a draft model in its set up, I'm not sure how to make it work

  • @불고기-u5o
    @불고기-u5o Месяц назад

    12:12

  • @temka088
    @temka088 Месяц назад

    which chat ui is being used in the end?

    • @michaelgoin4760
      @michaelgoin4760 Месяц назад

      The frontend in the demo is custom built on Next.js and using Vercel's AI SDK

  • @dmytro7441
    @dmytro7441 Месяц назад

    Thank you for the video. Do you expect it to work with Multimodal Models like llama-3.2-vision and Pixtral?

  • @불고기-u5o
    @불고기-u5o Месяц назад

    7:16 으앙 찾던 내용이다 아키텍처 아저씨 최고 최고 최고

  • @불고기-u5o
    @불고기-u5o Месяц назад

    45:04 6버전 이야기

  • @불고기-u5o
    @불고기-u5o Месяц назад

    7:08

  • @micuentadecasa
    @micuentadecasa 2 месяца назад

    Hi, great video, I would like to know if it is possible to use e5-mistral-7b-instruct in VLLM for embedding and completions with only one instance of VLLM?

  • @curtwortman6995
    @curtwortman6995 3 месяца назад

    Excellent progress and very informative. Thank you Neural Magic and team from your innovation and fantastic contributions.

  • @nickellas9882
    @nickellas9882 3 месяца назад

    Great session - thanks for posting!

  • @bigtymer4862
    @bigtymer4862 3 месяца назад

    28:12 AWQ?

  • @hari000-f6y
    @hari000-f6y 3 месяца назад

    I have a question!. I'm serving multimodal on vLLM, quantized (InternVL2) on L4 , it takes ~5-6 secs to complete a request, so when multiple request hit at a time, it takes much time ~30 secs to complete the requests. how to handle it like multiple requests also gets completed in ~5 secs. I have less understanding in batch_requesting and all.

  • @shumshvenhiszali
    @shumshvenhiszali 4 месяца назад

    Say code opensource but where?

  • @pseudokamp
    @pseudokamp 5 месяцев назад

    Please share the source code

  • @pseudokamp
    @pseudokamp 5 месяцев назад

    Please share the source code of quantization

    • @neuralmagic
      @neuralmagic 5 месяцев назад

      See here: github.com/vllm-project/llm-compressor/tree/main/examples

    • @Reeves-k2k
      @Reeves-k2k 4 месяца назад

      Great Work! Thanks :)

  • @spaken2768
    @spaken2768 6 месяцев назад

    Can Deepsparse run on Raspberry Pi AI kit for even faster fps?

  • @muhammadwaleedather1726
    @muhammadwaleedather1726 7 месяцев назад

    mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?

  • @muhammadwaleedather1726
    @muhammadwaleedather1726 7 месяцев назад

    mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?

  • @music_love21
    @music_love21 8 месяцев назад

    HI! We are creating a system that classifies tomato ripeness levels using image processing in CNN architecture with YOLOv8 model. We are using Raspberry Pi 4 OS with 4GB RAM and we have encountered problem - the system has 2-3 minutes delay/lag in classifying the ripeness level. Would you happen to have any recommendation/suggestion sir on this problem?

  • @szhavel
    @szhavel 8 месяцев назад

    Link to collab is not available now

  • @MahrukhAliKhan-x4c
    @MahrukhAliKhan-x4c 9 месяцев назад

    Can you make a tutorial of practically pruning some GAN model like GFPFGAN model ?

  • @prateekpatel6082
    @prateekpatel6082 10 месяцев назад

    Could you clarify if these pruning strategies are post training or training aware. It seems like progressive sparsification is training aware but from what i recall woodfisher approach is post training and require some fine tuning at the end ?

  • @prateekpatel6082
    @prateekpatel6082 10 месяцев назад

    In the recipes , its shown we do distillation to recover accuracy and then followed by quantization . Curios do you observe degradaiton in quantization step , why not distill or fine tune post quantization ? Also with this recipes , does training become more expensive compared to base dense models ? are there any data comparing the training cost and time

  • @RobGreenberg-ri4hy
    @RobGreenberg-ri4hy Год назад

    Wow, very insightful interview!

  • @hosseinsoleimani3193
    @hosseinsoleimani3193 Год назад

    Any C++ API for this?

  • @albertofernandez055
    @albertofernandez055 Год назад

    Hi, I am really interested in improving inference time for YOLO models in CPU. Here are some questions I have: (1) After using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, is it possible to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? (3) As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) Are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks in advance

  • @albertofernandez055
    @albertofernandez055 Год назад

    Many thanks for this video. I have several questions: (1) After "applying to our data" using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, have you tried to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx') (3) And finally, I have a question with the license. As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) So are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks!!!

  • @shahid19297
    @shahid19297 Год назад

    @Neural Magic You are building amazing tools but the demos aren't good, I don't like these PPT explanations, show us the practical demos and all the different tool and methods you have. I still can't figure out how do I train my yolov5 with custom dataset using sparcing. I still can't figure out what is deep sparce, what is sparifying, sparceML, sparceZoo...and... I just want to know if I have to train my custom dataset with yolov5 with your sparcing method. I should see complete code examples for that, and what if I have already trained weights, how can we apply pruning and quantization. It's very confusing as none of your article explain things completely. There should have been hands on videos or articles. As a beginner it is difficult to understand. By the way you should try Roboflow style demos, those are great.

  • @nishant.wankhade
    @nishant.wankhade Год назад

    Hi there, What are the hardware specifications of GPU to run yolov8 model , I have Nvidia GT 730 and while running the model it is giving me "cuda: no kernel image is available for execution on the device."

    • @neuralmagic
      @neuralmagic Год назад

      Hello! Our software is specific to CPU infrastructure. Our runtime, DeepSparse, is engineered to take advantage of CPU memory to deliver the performance we claimed in the video.

  • @yamenshahla8214
    @yamenshahla8214 Год назад

    hello , i followed all steps and i now have best_pruned.onnx model file, deep sparse is giving me the ability to only test on an image how can i deploy my model on a camera live stream or a video file ? Thank you

  • @billykotsos4642
    @billykotsos4642 Год назад

    29:28 This truly is a game changer!

  • @billykotsos4642
    @billykotsos4642 Год назад

    This is truly ground breaking.... you guys are doing phenomenal work...

  • @Gstreng
    @Gstreng Год назад

    Amazing stuff!

  • @billykotsos4642
    @billykotsos4642 Год назад

    LLMs running on CPUs is ground breaking. You guys are doing amazing work !

  • @hritikakolkar
    @hritikakolkar Год назад

    Hi, do you guys have any ML intern position

  • @xeetu.7065
    @xeetu.7065 Год назад

    Is it possible to write a yolov5 object recognition application using neural magic on windows?

    • @neuralmagic
      @neuralmagic Год назад

      Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ

  • @CrushCuisine
    @CrushCuisine Год назад

    Has anyone, outside of the NeuralMagic team, pruned their own model with SparseML and can confirm these claims?

  • @madhura305
    @madhura305 Год назад

    Hello @Neural Magic can i follow the same steps for training the model in windows??

    • @neuralmagic
      @neuralmagic Год назад

      Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ

    • @madhura305
      @madhura305 Год назад

      Hey!Can u'll plz help me out how exactly to train the model for custom data set. Though I tired n followed all the instructions mentioned in github. I'm not able to train it.

    • @neuralmagic
      @neuralmagic Год назад

      @@madhura305 Hi! We see that you asked your question in our Slack community. We will help you there!

  • @ch1n3du3
    @ch1n3du3 Год назад

    this is great work. Is there any research on the ideal sparse training to dense training ratios?

    • @dtransposed79
      @dtransposed79 Год назад

      We may publish more information on that in the follow-up to this paper soon. But what I can share right now is that you can often reduce the length of the dense phases, but not remove them completely.

    • @ch1n3du3
      @ch1n3du3 Год назад

      @@dtransposed79 thanks for the response

  • @andrewowens5653
    @andrewowens5653 Год назад

    What about LLM's and/or Stable Diffusion models? Can your techniques be used width GPT's?

    • @neuralmagic
      @neuralmagic Год назад

      We are actively working on optimizing generative models. Here is our latest research that shows you can sparsify LLMs 50% with one-shot: neuralmagic.com/blog/sparsegpt-remove-100-billion-parameters-for-free/ In about 30 days, we are holding a webinar where we'll discuss the SparseGPT method, how you can apply it to your models, and how you can run sparse generative models in DeepSparse super fast. We will post the registration link here by the end of the week: neuralmagic.com/neural-magic-events/

  • @RobGreenberg-ri4hy
    @RobGreenberg-ri4hy Год назад

    Very well done!

  • @flymousechiu
    @flymousechiu 2 года назад

    Congrats on making it to EMNLP 2022! Also, for those who can't take pineapple pizzas, you have no idea what you are missing out.

  • @modeltrainer1246
    @modeltrainer1246 2 года назад

    we need yolo v7 deep sparse. is that too much to ask for?

    • @neuralmagic
      @neuralmagic 2 года назад

      Dense YOLOv7 runs in the DeepSparse Engine! We are seeing speedups out of the box. YOLOv7-tiny seems quite accurate with YOLOv5s speeds. We are working on sparsifying and quantizing YOLOv7 for way better performance. We see promising results from sparsity, even more so than YOLOv5. Stay tuned - the best place is our Slack community as we post our model and engine updates there.

  • @dislike__button
    @dislike__button 2 года назад

    When big NLP models like GPT-2, M2M100 etc.? 😞😭

    • @neuralmagic
      @neuralmagic 2 года назад

      Hello! We are working on sparsifying large language models to make them more usable in production. We will keep the community informed of our efforts. Join us in Slack to stay in the loop: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ

  • @kishoreg8835
    @kishoreg8835 2 года назад

    WHEN YOLOV7 SPARSE COMING

    • @neuralmagic
      @neuralmagic 2 года назад

      Soon! Join us in the DeepSparse Community to hear exactly when: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ

    • @MeanGeneHacks
      @MeanGeneHacks 2 года назад

      Definitely excited for Yolo V7 Sparse! @Neural Magic

    • @kishoreg8835
      @kishoreg8835 2 года назад

      @@neuralmagic been a month.... how long is soon