Loading…
Attending this event?
In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

亲临现场
2024年8月21-23日
了解更多并注册参加

Sched应用程序允许您创建自己的日程安排,但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024,才能参加会议。如果您尚未注册但希望加入我们,请访问活动注册页面购买注册。

请注意:本日程自动显示为香港标准时间(UTC +8)。要查看您偏好的时区的日程,请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动,会议席位先到先得。
Friday August 23, 2024 10:35am - 11:10am HKT
As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型(LLMs)革新我们生活的各个方面,许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而,管理大规模LLM训练和推理平台面临更为关键的挑战,如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上,演讲者将分享他们在使用基于Kubernetes的智能基础设施(由Volcano增强)管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖:故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。
Speakers
avatar for Xin Li

Xin Li

Senior Engineer of Server Development, qihoo360
Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →
Friday August 23, 2024 10:35am - 11:10am HKT
Level 1 | Hung Hom Room 2
  KubeCon + CloudNativeCon Sessions, AI + ML

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link