Linux Kernel CVEs, What Has Caused So Many to Suddenly Show Up? - Greg Kroah-Hartman

LLM's Anywhere: Browser Deployment with Wasm & WebGPU - Joinal Ahmed & Nikhil Rana

Detecting and Overcoming GPU Failures During ML... - Ganeshkumar Ashokavardhanan & Sarah Belghiti

Wreckage Of Titan Submersible Reveal How It Imploded

Something About Shadow The Hedgehog ANIMATED ⚫💨💨💨

Can You Beat BORDERLANDS With Only 1-Shots?

Detecting & Overcoming GPU Failures During ML Training- Ganeshkumar Ashokavardhanan & Sarah Belghiti

The Linux Foundation

Просмотров 44

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 15 сен 2024
Don't miss out! Join us at our upcoming conference: Open Source Summit + AI_Dev: Open Source GenAI & ML Summit in Tokyo from October 28-29, 2024. Connect with peers as the community gathers to further the education and advancement of open source and GenAI. Learn more at events.linuxfo...
Detecting and Overcoming GPU Failures During ML Training | 在ML训练过程中检测和克服GPU故障 - Ganeshkumar Ashokavardhanan, Microsoft & Sarah Belghiti, Wayve
Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.
随着模型规模和训练规模的增加，机器学习训练需要强大的GPU基础设施，而GPU故障成为一种昂贵的风险。从硬件故障到性能逐渐下降，未被发现的GPU问题可能会破坏训练任务，增加成本并减缓开发速度。本次演讲将深入探讨在机器学习训练中GPU故障所带来的挑战，特别是在分布式训练中。我们将探讨各种GPU问题的范围，以及为什么即使是轻微的性能下降也可能瘫痪大型任务。了解如何通过观测性（利用诸如NVIDIA DCGM之类的工具）通过GPU健康检查实现问题的主动检测。了解容错分布式训练的原则，以减轻GPU故障的后果。借鉴云服务提供商和自动驾驶汽车公司的经验，我们将分享高效识别、纠正和预防GPU故障的最佳实践。我们还将探讨像CRIU和任务抢占等尖端想法，以应对GPU工作负载。

Комментарии •

Следующие

Автовоспроизведение

Linux Kernel CVEs, What Has Caused So Many to Suddenly Show Up? - Greg Kroah-Hartman

Linux Kernel CVEs, What Has Caused So Many to Suddenly Show Up? - Greg Kroah-Hartman

LLM's Anywhere: Browser Deployment with Wasm & WebGPU - Joinal Ahmed & Nikhil Rana

LLM's Anywhere: Browser Deployment with Wasm & WebGPU - Joinal Ahmed & Nikhil Rana

Detecting and Overcoming GPU Failures During ML... - Ganeshkumar Ashokavardhanan & Sarah Belghiti

Detecting and Overcoming GPU Failures During ML... - Ganeshkumar Ashokavardhanan & Sarah Belghiti

Wreckage Of Titan Submersible Reveal How It Imploded

Wreckage Of Titan Submersible Reveal How It Imploded

Something About Shadow The Hedgehog ANIMATED ⚫💨💨💨

Something About Shadow The Hedgehog ANIMATED ⚫💨💨💨

Can You Beat BORDERLANDS With Only 1-Shots?

Can You Beat BORDERLANDS With Only 1-Shots?

Den of Thieves 2: Pantera (2025) Official Trailer - Gerard Butler, O’Shea Jackson Jr.

Den of Thieves 2: Pantera (2025) Official Trailer – Gerard Butler, O’Shea Jackson Jr.

Sit Back and Relax with Fault Awareness and Robust Instant Recovery for... Fanshi Zhang & Kebe Liu

Sit Back and Relax with Fault Awareness and Robust Instant Recovery for... Fanshi Zhang & Kebe Liu

FINOS Q3 2024 All Community Call

FINOS Q3 2024 All Community Call

Is Your GPU Really Working Efficiently in the Data Center? N Ways to Imp... Xiao Zhang & Wu Ying Jun

Is Your GPU Really Working Efficiently in the Data Center? N Ways to Imp... Xiao Zhang & Wu Ying Jun

AI chip makers battle for dominance | BBC News

AI chip makers battle for dominance | BBC News

Simplify AI Infrastructure with Kubernetes Operators - Ganeshkumar Ashokavardhanan & Tariq Ibrahim

Simplify AI Infrastructure with Kubernetes Operators - Ganeshkumar Ashokavardhanan & Tariq Ibrahim

What is generative AI and how does it work? - The Turing Lectures with Mirella Lapata

What is generative AI and how does it work? – The Turing Lectures with Mirella Lapata

Quiet Night: Deep Sleep Music with Black Screen - Fall Asleep with Ambient Music

Quiet Night: Deep Sleep Music with Black Screen - Fall Asleep with Ambient Music

Security Threat Model Analysis and Protection Practice in Edge Computing Scena... Yue Bao & Huan Wei

Security Threat Model Analysis and Protection Practice in Edge Computing Scena... Yue Bao & Huan Wei

OS Migration Solution on Cloud - Jianlin Lv, eBay

OS Migration Solution on Cloud - Jianlin Lv, eBay

С чего всё началось?

С чего всё началось?

ПОПАДАЮ В НОВУЮ СЕМЬЮ ЗЛЫХ РОДИТЕЛЕЙ В SCHOOLBOY RUNAWAY В МАЙНКРАФТ!

ПОПАДАЮ В НОВУЮ СЕМЬЮ ЗЛЫХ РОДИТЕЛЕЙ В SCHOOLBOY RUNAWAY В МАЙНКРАФТ!

У ТЕБЯ СЕГОДНЯ ДЕНЬ РОЖДЕНИЯ?! #Shorts #Глент

У ТЕБЯ СЕГОДНЯ ДЕНЬ РОЖДЕНИЯ?! #Shorts #Глент

Do you choose Inside Out 2 or The Amazing World of Gumball? 🤔

Do you choose Inside Out 2 or The Amazing World of Gumball? 🤔

Парень Ксении Бородиной ворвался на съемку? Выбор кроссовок! #тренды #интервью

Парень Ксении Бородиной ворвался на съемку? Выбор кроссовок! #тренды #интервью

Never thought this girl can be a killer #shorts #cdrama #coupleofmirrors #movie #drama

Never thought this girl can be a killer #shorts #cdrama #coupleofmirrors #movie #drama

Atlanta United vs. Inter Miami CF | Messi, Suárez Stifled! | Full Match Highlights

Atlanta United vs. Inter Miami CF | Messi, Suárez Stifled! | Full Match Highlights

Наколдовала монстряшку #monsterhigh #monsterhigh2024

Наколдовала монстряшку #monsterhigh #monsterhigh2024