Name: Detecting and Overcoming GPU Failures During ML Training | 在ML训练过程中检测和克服GPU故障 - Ganeshkumar Ashokavardhanan, Microsoft & Sarah Belghiti, Wayve
Start: 2024-08-23T15:15:00+0800
End: 2024-08-23T15:50:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Friday August 23, 2024 3:15pm - 3:50pm HKT

Level 1 | Hung Hom Room 3

Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.

随着模型规模和训练规模的增加，机器学习训练需要强大的GPU基础设施，而GPU故障成为一种昂贵的风险。从硬件故障到性能逐渐下降，未被发现的GPU问题可能会破坏训练任务，增加成本并减缓开发速度。本次演讲将深入探讨在机器学习训练中GPU故障所带来的挑战，特别是在分布式训练中。我们将探讨各种GPU问题的范围，以及为什么即使是轻微的性能下降也可能瘫痪大型任务。了解如何通过观测性（利用诸如NVIDIA DCGM之类的工具）通过GPU健康检查实现问题的主动检测。了解容错分布式训练的原则，以减轻GPU故障的后果。借鉴云服务提供商和自动驾驶汽车公司的经验，我们将分享高效识别、纠正和预防GPU故障的最佳实践。我们还将探讨像CRIU和任务抢占等尖端想法，以应对GPU工作负载。

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Sarah Belghiti

ML Platform Engineer, Wayve

Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →

Friday August 23, 2024 3:15pm - 3:50pm HKT
Level 1 | Hung Hom Room 3

AI_dev: Open Source GenAI & ML Summit Sessions, MLOps + GenOps + DataOps

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Ganeshkumar Ashokavardhanan

Sarah Belghiti

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!