Loading…
Attending this event?
In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

亲临现场
2024年8月21-23日
了解更多并注册参加

Sched应用程序允许您创建自己的日程安排,但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024,才能参加会议。如果您尚未注册但希望加入我们,请访问活动注册页面购买注册。

请注意:本日程自动显示为香港标准时间(UTC +8)。要查看您偏好的时区的日程,请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动,会议席位先到先得。
Friday August 23, 2024 10:35am - 11:10am HKT
Large AI models are driving significant investment in GPU clusters. Yet, managing these clusters is hard: Slurm-based HPC setups lack of management granularity and stability, while Kubernetes poses usability challenges for AI users. This talk introduces TACC, an AI infra management solution that bridges the advantages of both K8S and Slurm setups. This is a joint-work from computer system researchers at HKUST and leading CNCF contributors at DaoCloud. TACC manages a large-scale cluster at HKUST that supports over 500 active researchers since 2020. In this talk, we share our five-year journey with TACC, covering: * [User Experience] A seamless UI for job submissions and management, supporting both container and Slurm format, all on the same backbone * [Resource Management] Multi-tenant allocation with configurable strategies, using CNCF HAMi and Kueue * [Performance and Scalability] A robust distributed infrastructure with networked storage and RDMA, via CNCF SpiderPool,Fluid...

大型AI模型正在推动GPU集群的重大投资。然而,管理这些集群很困难:基于Slurm的HPC设置缺乏管理粒度和稳定性,而Kubernetes对AI用户存在可用性挑战。 本次演讲介绍了TACC,这是一种AI基础设施管理解决方案,可以结合K8S和Slurm设置的优势。这是香港科技大学的计算机系统研究人员与DaoCloud领先的CNCF贡献者共同合作的成果。 TACC自2020年以来管理着香港科技大学支持超过500名活跃研究人员的大规模集群。在本次演讲中,我们分享了与TACC一起的五年历程,涵盖以下内容: * [用户体验] 无缝的UI界面用于作业提交和管理,支持容器和Slurm格式,均在同一基础上 * [资源管理] 多租户分配与可配置策略,使用CNCF HAMi和Kueue * [性能和可扩展性] 强大的分布式基础设施,具有网络存储和RDMA,通过CNCF SpiderPool,Fluid...
Speakers
avatar for Peter Pan

Peter Pan

VP of R&D Engineering, DaoCloud
├ DaoCloud R&D Engineering VP├ CNCF wg-AI (AI Working-Group) member├ Maintainer of a few CNCF projects (GithubID: panpan0000): CloudTTY, KuBean, HwameiStor├ Public Tech Events:└─ 2023 KubeCon SH Speaker (https://sched.co/1PTFI)└─ 2023 KubeCon EU Program Committee... Read More →
avatar for Kaiqiang Xu

Kaiqiang Xu

Researcher, Hong Kong University of Science and Technology
Hong Kong University of Science and Technology
Friday August 23, 2024 10:35am - 11:10am HKT
Level 1 | Hung Hom Room 3

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link