Detecting & Overcoming GPU Failures During ML Training- Ganeshkumar Ashokavardhanan & Sarah Belghiti

Поделиться
HTML-код
  • Опубликовано: 15 сен 2024
  • Don't miss out! Join us at our upcoming conference: Open Source Summit + AI_Dev: Open Source GenAI & ML Summit in Tokyo from October 28-29, 2024. Connect with peers as the community gathers to further the education and advancement of open source and GenAI. Learn more at events.linuxfo...
    Detecting and Overcoming GPU Failures During ML Training | 在ML训练过程中检测和克服GPU故障 - Ganeshkumar Ashokavardhanan, Microsoft & Sarah Belghiti, Wayve
    Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.
    随着模型规模和训练规模的增加,机器学习训练需要强大的GPU基础设施,而GPU故障成为一种昂贵的风险。从硬件故障到性能逐渐下降,未被发现的GPU问题可能会破坏训练任务,增加成本并减缓开发速度。本次演讲将深入探讨在机器学习训练中GPU故障所带来的挑战,特别是在分布式训练中。我们将探讨各种GPU问题的范围,以及为什么即使是轻微的性能下降也可能瘫痪大型任务。 了解如何通过观测性(利用诸如NVIDIA DCGM之类的工具)通过GPU健康检查实现问题的主动检测。了解容错分布式训练的原则,以减轻GPU故障的后果。借鉴云服务提供商和自动驾驶汽车公司的经验,我们将分享高效识别、纠正和预防GPU故障的最佳实践。我们还将探讨像CRIU和任务抢占等尖端想法,以应对GPU工作负载。

Комментарии •