- Видео 56
- Просмотров 88 909
Neural Magic
США
Добавлен 20 дек 2019
Neural Magic is on a mission to bring the power of open-source LLMs and vLLM to every enterprise on the planet. The Future of AI is Open.
vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024
In this session, we wrapped up 2024 with a comprehensive update on the vLLM project and shared exciting plans for 2025. Michael Goin, vLLM Committer, walked us through the latest updates in vLLM v0.6.5, including performant structured outputs, while Simon Mo, vLLM Maintainer, shared key insights from vLLM’s 2024 journey and the roadmap for 2025.
Highlights:
[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussio...
Highlights:
[00:00-02:45] A recap of 2024 vLLM Office Hours by the numbers
[02:46-09:03] About vLLM & Neural Magic
[09:04-15:58] What’s new in vLLM v0.6.5, including performant structured outputs
[15:59-25:59] vLLM’s 2024 milestones and achievements
[26:00-35:55] vLLM 2025 roadmap, including upcoming features and improvements
[35:56-56:03] Open discussio...
Просмотров: 256
Видео
vLLM Office Hours - Exploring Machete, a Mixed-Input GEMM Kernel for Hopper GPUs - December 5, 2024
Просмотров 50014 дней назад
In this session, we explored Machete, Neural Magic's newest innovation in mixed-input GEMM kernel design for NVIDIA Hopper GPUs. Built on top of advancements in NVIDIA CUTLASS 3.5.1, Machete is optimized for both compute and memory-bound regimes on Hopper GPUs (H100). Key features include on-the-fly upconversion of weights, latency hiding through overlapping compute and data movement, and robus...
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
Просмотров 604Месяц назад
In this session of our bi-weekly vLLM office hours, we explored the potential of disaggregated prefill and KV cache storage in vLLM to enhance distributed inference. We discussed the initial PR on disaggregated prefill and how KV cache sharing across vLLM improves performance through faster delivery and the composition of multiple KV caches. These advancements are designed to push the boundarie...
vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024
Просмотров 504Месяц назад
In this session, we dive deep into the implementation of state-of-the-art (SOTA) tool-calling in vLLM. We discuss the importance of tools and functions in open-source AI and provide insights into the challenges and solutions around OpenAI-style tools in vLLM. During the Q&A, we explored questions around serving multiple models on a single vLLM server, the benefits of partial JSON decoding from ...
vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024
Просмотров 8192 месяца назад
In this session of our bi-weekly vLLM office hours, we explored the exciting updates in the vLLM v0.6.3 release, featuring experimental fullgraph torch.compile, the introduction of a Feature Compatibility Matrix, and the Machete w4a16 kernel for Hopper GPUs. We also covered new VLM support for GLM-4V, Molmo, NVLM-D, tool-use support for Llama 3.1 3.2 and InternLM2.5, and Reward LM support for Q...
vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024
Просмотров 1 тыс.2 месяца назад
In this vLLM office hours session, we explore the latest updates in vLLM v0.6.2, including Llama 3.2 Vision support, the introduction of MQLLMEngine for API Server, and beam search externalization. Following these updates, Lily Liu, vLLM Committer and PhD student at UC Berkeley, joins us to discuss speculative decoding in vLLM. She provides insights into what speculative decoding is, its differ...
vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024
Просмотров 1,9 тыс.3 месяца назад
In this session of Neural Magic's bi-weekly vLLM office hours, we cover the latest updates in vLLM v0.6.0 and v0.6.1, including Vision LM support for Pixtral and Qwen2-VL, and tool-use support for Mistral and Qwen2.5. We also delve into advanced techniques for maximizing inference performance in large language models, highlighting key optimizations that deliver 2.7x throughput improvements and ...
vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024
Просмотров 2,5 тыс.3 месяца назад
In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8 kernels in vLLM. During the Q&A, we tackled a variety of audience questions around hardware diversity,...
vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024
Просмотров 6674 месяца назад
In this exciting session, we were joined by Woosuk Kwon, the co-creator of vLLM, to dive deep into vLLM's performance on AMD GPUs and Google TPUs. Woosuk shared detailed performance benchmarks and discussed the supported features for each hardware platform. We also explored vLLM's diverse hardware support, including what's coming next in the pipeline. During the Q&A, we tackled a variety of aud...
vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024
Просмотров 6374 месяца назад
In this session, we brought on Roger Wang, a vLLM Committer and Software Engineer, ML Platform at Roblox, to discuss the development of supporting transformer-based multimodal models on vLLM. Roger shared insights on effectively using vision-language models with vLLM, upcoming changes, and the roadmap for multimodal model support in vLLM. Additionally, we touched on the vLLM v0.5.4 release, inc...
vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024
Просмотров 1,1 тыс.5 месяцев назад
In this session, we brought on model compression expert Eldar Kurtić to discuss Model Quantization for Efficient vLLM Inference. Eldar shared the why, when, and how to quantize LLMs for efficient inference. He introduced a new library called llm-compressor for optimizing LLMs for accurate inference in vLLM. Additionally, we touched on the vLLM v0.5.2 and v0.5.3 releases, including model support...
Deploy LLMs More Efficiently with vLLM and Neural Magic
Просмотров 1,2 тыс.5 месяцев назад
Learn why vLLM is the leading open-source inference server and how Neural Magic works with enterprises to build and scale vLLM-based model services with more efficiency and cost savings.
vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024
Просмотров 1,7 тыс.5 месяцев назад
In this session, we brought on vLLM Committers from Anyscale to give an in-depth dive into FP8 quantization. They discussed why FP8 is important, how to get started with FP8 in vLLM, and shared quality and performance results of FP8 quantization. We also covered the latest updates in vLLM v0.5.1, including pipeline parallelism and model support for Gemma 2, Jamba, and DeepSeek-V2. For more deta...
vLLM and Neural Magic Office Hours - June 5, 2024
Просмотров 5496 месяцев назад
vLLM and Neural Magic Office Hours - June 5, 2024
Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have
Просмотров 3,7 тыс.Год назад
Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have
Unlock Faster and More Efficient LLMs with SparseGPT
Просмотров 2,3 тыс.Год назад
Unlock Faster and More Efficient LLMs with SparseGPT
Pruning and Quantizing ML Models With One Shot Without Retraining
Просмотров 2,2 тыс.Год назад
Pruning and Quantizing ML Models With One Shot Without Retraining
Sparse Transferring Hugging Face Models With SparseML
Просмотров 537Год назад
Sparse Transferring Hugging Face Models With SparseML
Apply Second-Order Pruning Algorithms for SOTA Model Compression
Просмотров 967Год назад
Apply Second-Order Pruning Algorithms for SOTA Model Compression
Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets
Просмотров 450Год назад
Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets
Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime
Просмотров 209Год назад
Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime
Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime
Просмотров 182Год назад
Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime
Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime
Просмотров 849Год назад
Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime
Intro to Deep Learning Model Sparsification
Просмотров 957Год назад
Intro to Deep Learning Model Sparsification
You're doing a fantastic job! I have a quick question: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). What's the best way to send them to Binance?
Great work on this project! I’ve loved the simplicity + perf when using it.
Nice presentation! This feature may significantly reduce prefill recomputation in the long contexts.
And can it run out of the box or difficult to execute it on amd?
So at the end is vllm running on amd gpu good?
when trying to run EAGLE model in openai api_server, how should you format the script? Since EAGLE doesn't use a draft model in its set up, I'm not sure how to make it work
12:12
which chat ui is being used in the end?
The frontend in the demo is custom built on Next.js and using Vercel's AI SDK
Thank you for the video. Do you expect it to work with Multimodal Models like llama-3.2-vision and Pixtral?
7:16 으앙 찾던 내용이다 아키텍처 아저씨 최고 최고 최고
45:04 6버전 이야기
7:08
Hi, great video, I would like to know if it is possible to use e5-mistral-7b-instruct in VLLM for embedding and completions with only one instance of VLLM?
Excellent progress and very informative. Thank you Neural Magic and team from your innovation and fantastic contributions.
Great session - thanks for posting!
28:12 AWQ?
I have a question!. I'm serving multimodal on vLLM, quantized (InternVL2) on L4 , it takes ~5-6 secs to complete a request, so when multiple request hit at a time, it takes much time ~30 secs to complete the requests. how to handle it like multiple requests also gets completed in ~5 secs. I have less understanding in batch_requesting and all.
Say code opensource but where?
Please share the source code
Please share the source code of quantization
See here: github.com/vllm-project/llm-compressor/tree/main/examples
Great Work! Thanks :)
Can Deepsparse run on Raspberry Pi AI kit for even faster fps?
mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?
mam my question is that if i have trained a model on simple yolov8 from ultralytics i got the best.pt file as my trained model can i directly remove unwanted weights from that from your technique or i have to complete train the model through your technique to get that right?
HI! We are creating a system that classifies tomato ripeness levels using image processing in CNN architecture with YOLOv8 model. We are using Raspberry Pi 4 OS with 4GB RAM and we have encountered problem - the system has 2-3 minutes delay/lag in classifying the ripeness level. Would you happen to have any recommendation/suggestion sir on this problem?
Link to collab is not available now
Can you make a tutorial of practically pruning some GAN model like GFPFGAN model ?
Could you clarify if these pruning strategies are post training or training aware. It seems like progressive sparsification is training aware but from what i recall woodfisher approach is post training and require some fine tuning at the end ?
In the recipes , its shown we do distillation to recover accuracy and then followed by quantization . Curios do you observe degradaiton in quantization step , why not distill or fine tune post quantization ? Also with this recipes , does training become more expensive compared to base dense models ? are there any data comparing the training cost and time
Wow, very insightful interview!
Any C++ API for this?
Hi, I am really interested in improving inference time for YOLO models in CPU. Here are some questions I have: (1) After using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, is it possible to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? (3) As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) Are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks in advance
Many thanks for this video. I have several questions: (1) After "applying to our data" using a SparseZoo recipe to apply to our data, using "sparseml.ultralytics.train", what is the format of the generated weights? (2) Moreover, have you tried to import the generated weights for the sparsified model in OpenCV using: cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx') (3) And finally, I have a question with the license. As far as I see in all the Neural Magic repositories, all of them have Apache License 2.0. Is this correct? (4) So are there any commercial restrictions of using cv2.dnn.readNetFromONNX('yolov8n_sparsified.onnx')? Many thanks!!!
Did you find a solution?
@Neural Magic You are building amazing tools but the demos aren't good, I don't like these PPT explanations, show us the practical demos and all the different tool and methods you have. I still can't figure out how do I train my yolov5 with custom dataset using sparcing. I still can't figure out what is deep sparce, what is sparifying, sparceML, sparceZoo...and... I just want to know if I have to train my custom dataset with yolov5 with your sparcing method. I should see complete code examples for that, and what if I have already trained weights, how can we apply pruning and quantization. It's very confusing as none of your article explain things completely. There should have been hands on videos or articles. As a beginner it is difficult to understand. By the way you should try Roboflow style demos, those are great.
Hi there, What are the hardware specifications of GPU to run yolov8 model , I have Nvidia GT 730 and while running the model it is giving me "cuda: no kernel image is available for execution on the device."
Hello! Our software is specific to CPU infrastructure. Our runtime, DeepSparse, is engineered to take advantage of CPU memory to deliver the performance we claimed in the video.
hello , i followed all steps and i now have best_pruned.onnx model file, deep sparse is giving me the ability to only test on an image how can i deploy my model on a camera live stream or a video file ? Thank you
29:28 This truly is a game changer!
This is truly ground breaking.... you guys are doing phenomenal work...
Amazing stuff!
LLMs running on CPUs is ground breaking. You guys are doing amazing work !
Hi, do you guys have any ML intern position
Is it possible to write a yolov5 object recognition application using neural magic on windows?
Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
Has anyone, outside of the NeuralMagic team, pruned their own model with SparseML and can confirm these claims?
Hello @Neural Magic can i follow the same steps for training the model in windows??
Hello! Yes, you could train the model in Windows or do it all in a WSL/a VM. Join our Slack community to ask questions if you run into issues: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
Hey!Can u'll plz help me out how exactly to train the model for custom data set. Though I tired n followed all the instructions mentioned in github. I'm not able to train it.
@@madhura305 Hi! We see that you asked your question in our Slack community. We will help you there!
this is great work. Is there any research on the ideal sparse training to dense training ratios?
We may publish more information on that in the follow-up to this paper soon. But what I can share right now is that you can often reduce the length of the dense phases, but not remove them completely.
@@dtransposed79 thanks for the response
What about LLM's and/or Stable Diffusion models? Can your techniques be used width GPT's?
We are actively working on optimizing generative models. Here is our latest research that shows you can sparsify LLMs 50% with one-shot: neuralmagic.com/blog/sparsegpt-remove-100-billion-parameters-for-free/ In about 30 days, we are holding a webinar where we'll discuss the SparseGPT method, how you can apply it to your models, and how you can run sparse generative models in DeepSparse super fast. We will post the registration link here by the end of the week: neuralmagic.com/neural-magic-events/
Very well done!
Absolutely. Great job!
Congrats on making it to EMNLP 2022! Also, for those who can't take pineapple pizzas, you have no idea what you are missing out.
we need yolo v7 deep sparse. is that too much to ask for?
Dense YOLOv7 runs in the DeepSparse Engine! We are seeing speedups out of the box. YOLOv7-tiny seems quite accurate with YOLOv5s speeds. We are working on sparsifying and quantizing YOLOv7 for way better performance. We see promising results from sparsity, even more so than YOLOv5. Stay tuned - the best place is our Slack community as we post our model and engine updates there.
When big NLP models like GPT-2, M2M100 etc.? 😞😭
Hello! We are working on sparsifying large language models to make them more usable in production. We will keep the community informed of our efforts. Join us in Slack to stay in the loop: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
WHEN YOLOV7 SPARSE COMING
Soon! Join us in the DeepSparse Community to hear exactly when: join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ
Definitely excited for Yolo V7 Sparse! @Neural Magic
@@neuralmagic been a month.... how long is soon