Name: Sit Back and Relax with Fault Awareness and Robust Instant Recovery for Large Scale AI Workloads | 坐和放宽，了解大规模 AI 负载场景下的故障感知和健壮的快速故障恢复 - Fanshi Zhang & Kebe Liu, DaoCloud
Start: 2024-08-21T15:35:00+0800
End: 2024-08-21T16:10:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Wednesday August 21, 2024 3:35pm - 4:10pm HKT

Level 1 | Hung Hom Room 3

The fault tolerance during train, fine-tuning, and even inferencing is crucial to modern AI workloads when it happens on large scale, with loads of GPU clusters. For training and fine-tuning tasks, failure of GPUs, storages, any hardware issues often cause the extending the training time to weeks and even months significantly. For inferencing, when massive loads of requests income, if one of the inferencing servers went faulty, we need a policy and scheduler to perform mitigation to transfer the workloads fast and efficiently. In this talk, We will introduce a series of mechanism we have designed to help Kubernetes clusters and workloads itself to locate, diagnostic the root cause, schedule and perform mitigation when it comes to any of hardware or CUDA API call failures to reduce the overall operating challenges. But the possibilities will not stop here, the fault awareness and mitigation scheduler will help any of the workloads to mitigate during failures.

在大规模GPU集群上进行训练、微调甚至推理时的容错性对现代人工智能工作负载至关重要。对于训练和微调任务，GPU、存储等硬件故障经常会导致训练时间延长至数周甚至数月。对于推理任务，当大量请求涌入时，如果其中一个推理服务器出现故障，我们需要一种策略和调度程序来快速高效地转移工作负载。在本次演讲中，我们将介绍一系列我们设计的机制，帮助Kubernetes集群和工作负载本身定位、诊断根本原因，并在硬件或CUDA API调用失败时进行调度和执行缓解，以减少整体运营挑战。但可能性不会止步于此，故障感知和缓解调度程序将帮助任何工作负载在故障期间进行缓解。

Speakers

Kebe Liu

Senior software engineer, DaoCloud

Member of Istio Steering Committee, focused on cloud-native and Istio, eBPF and other areas in recent years. Founder of Merbridge project.

Neko Ayaka

Software Engineer, DaoCloud

Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase

Wednesday August 21, 2024 3:35pm - 4:10pm HKT
Level 1 | Hung Hom Room 3

AI_dev: Open Source GenAI & ML Summit Sessions, Foundations + Frameworks + Tools for Machine Learning

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Kebe Liu

Neko Ayaka

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!