Name: Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano | 通过Volcano增强的智能基础设施优化LLM工作流程 - Xin Li, qihoo360 & William Wang, Huawei Cloud Technologies Co., LTD
Start: 2024-08-23T10:35:00+0800
End: 2024-08-23T11:10:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Friday August 23, 2024 10:35am - 11:10am HKT

Level 1 | Hung Hom Room 2

As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型（LLMs）革新我们生活的各个方面，许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而，管理大规模LLM训练和推理平台面临更为关键的挑战，如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上，演讲者将分享他们在使用基于Kubernetes的智能基础设施（由Volcano增强）管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖：故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。

Speakers

Xin Li

Senior Engineer of Server Development, qihoo360

Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →

Friday August 23, 2024 10:35am - 11:10am HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Xin Li

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!