Loading…
Attending this event?
In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

亲临现场
2024年8月21-23日
了解更多并注册参加

Sched应用程序允许您创建自己的日程安排,但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024,才能参加会议。如果您尚未注册但希望加入我们,请访问活动注册页面购买注册。

请注意:本日程自动显示为香港标准时间(UTC +8)。要查看您偏好的时区的日程,请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动,会议席位先到先得。
AI_dev: Open Source GenAI & ML Summit Sessions clear filter
Wednesday, August 21
 

11:00 HKT

Addressing Challenges of Cross-Architecture Dynamic Migration Over Heterogeneous Acceleration System | 解决异构加速系统上跨架构动态迁移的挑战 - Yanjun Chen, China Mobile
Wednesday August 21, 2024 11:00 - 11:35 HKT
With the surge of application computing demand, the industry began to run AI applications on diverse acceleration hardwares (GPU, FPGA, NPU...) to gain more processing capability. One key problem to use diverse accelerators is tool chain & vendor lock-in in application Dev-to-Run processes. Cross-system (multi-arch chips + multi-vendor tool chain) application development and migration is hard to achieve. In this presentation China Mobile will introduce the practices to solve above challenges allowing AI applications smoothly migrate among different accelerators. It includes a unified abstraction for diverse accelerators, a middle-compiler using existing compilers (CUDA, ROCm, oneAPI...) to achieve cross-architecture compile in the same execution, and a runtime supporting dynamic and replaceable link. We want to enable applications migrate freely between diverse accelerators without changing development habits, and show the architecture design, open source plans and a demo.

随着应用计算需求的激增,行业开始在各种加速硬件(GPU、FPGA、NPU等)上运行AI应用程序,以获得更多的处理能力。在使用各种加速器时,一个关键问题是在应用程序开发到运行过程中的工具链和供应商锁定。跨系统(多架构芯片+多供应商工具链)应用程序开发和迁移很难实现。在这个演示中,中国移动将介绍解决上述挑战的实践,使AI应用程序能够在不同的加速器之间平稳迁移。这包括对各种加速器的统一抽象,使用现有编译器(CUDA、ROCm、oneAPI等)的中间编译器实现跨架构编译在同一执行中,以及支持动态和可替换链接的运行时。我们希望能够使应用程序在不改变开发习惯的情况下自由迁移至各种加速器,并展示架构设计、开源计划和演示。
Speakers
avatar for Yanjun Chen

Yanjun Chen

Open Source Expert, China Mobile
Yanun Chen is the open source expert and CNCF delegate in China Mobile. She joined actively in many open source projects and now she is the TSC member of LF Edge Akraino.
Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 3

15:35 HKT

Sit Back and Relax with Fault Awareness and Robust Instant Recovery for Large Scale AI Workloads | 坐和放宽,了解大规模 AI 负载场景下的故障感知和健壮的快速故障恢复 - Fanshi Zhang & Kebe Liu, DaoCloud
Wednesday August 21, 2024 15:35 - 16:10 HKT
The fault tolerance during train, fine-tuning, and even inferencing is crucial to modern AI workloads when it happens on large scale, with loads of GPU clusters. For training and fine-tuning tasks, failure of GPUs, storages, any hardware issues often cause the extending the training time to weeks and even months significantly. For inferencing, when massive loads of requests income, if one of the inferencing servers went faulty, we need a policy and scheduler to perform mitigation to transfer the workloads fast and efficiently. In this talk, We will introduce a series of mechanism we have designed to help Kubernetes clusters and workloads itself to locate, diagnostic the root cause, schedule and perform mitigation when it comes to any of hardware or CUDA API call failures to reduce the overall operating challenges. But the possibilities will not stop here, the fault awareness and mitigation scheduler will help any of the workloads to mitigate during failures.

在大规模GPU集群上进行训练、微调甚至推理时的容错性对现代人工智能工作负载至关重要。 对于训练和微调任务,GPU、存储等硬件故障经常会导致训练时间延长至数周甚至数月。对于推理任务,当大量请求涌入时,如果其中一个推理服务器出现故障,我们需要一种策略和调度程序来快速高效地转移工作负载。 在本次演讲中,我们将介绍一系列我们设计的机制,帮助Kubernetes集群和工作负载本身定位、诊断根本原因,并在硬件或CUDA API调用失败时进行调度和执行缓解,以减少整体运营挑战。但可能性不会止步于此,故障感知和缓解调度程序将帮助任何工作负载在故障期间进行缓解。
Speakers
avatar for Kebe Liu

Kebe Liu

Senior software engineer, DaoCloud
Member of Istio Steering Committee, focused on cloud-native and Istio, eBPF and other areas in recent years. Founder of Merbridge project.
avatar for Neko Ayaka

Neko Ayaka

Software Engineer, DaoCloud
Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase
Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 3

16:25 HKT

Simplify AI Infrastructure with Kubernetes Operators | 使用Kubernetes Operators简化AI基础设施 - Ganeshkumar Ashokavardhanan, Microsoft & Tariq Ibrahim US, NVIDIA
Wednesday August 21, 2024 16:25 - 17:00 HKT
ML applications often require specialized hardware and additional configuration to run efficiently and reliably on Kubernetes. However, managing the cluster lifecycle, the diversity and complexity of hardware configuration across nodes can be challenging. How can we simplify and automate this process to ensure a smooth experience for kubernetes users? Kubernetes Operators offer a great solution. In this session, we will go over operators and demonstrate how they can help automate the installation, configuration, and lifecycle management of AI-ready infra end to end from cluster provisioning and k8s node configuration to deep learning model deployments. We will demo a fine-tuning LLM workload, to showcase how existing operators in the ecosystem such as Cluster API Operator, GPU Operator, Network Operator, and the Kubernetes AI Toolchain Operator, can be used to simplify the infra. Finally, we will discuss challenges and best practices of using operators in production.

ML 应用通常需要专门的硬件和额外的配置才能在 Kubernetes 上高效可靠地运行。然而,管理集群生命周期、节点间硬件配置的多样性和复杂性可能具有挑战性。我们如何简化和自动化这个过程,以确保 Kubernetes 用户的顺畅体验? Kubernetes 运算符提供了一个很好的解决方案。在本场演讲中,我们将介绍运算符,并演示它们如何帮助自动化 AI-ready 基础架构的安装、配置和生命周期管理,从集群提供和 k8s 节点配置到深度学习模型部署。我们将演示一个微调 LLM 工作负载,展示生态系统中现有运算符(如 Cluster API Operator、GPU Operator、Network Operator 和 Kubernetes AI Toolchain Operator)如何简化基础架构。最后,我们将讨论在生产环境中使用运算符的挑战和最佳实践。
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Tariq Ibrahim US

Tariq Ibrahim US

Senior Cloud Platform Engineer, NVIDIA
Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →
Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 3

17:15 HKT

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi | 解锁异构AI基础设施K8s集群:发挥HAMi的力量 - Xiao Zhang, DaoCloud & Mengxuan Li, The 4th Paradigm
Wednesday August 21, 2024 17:15 - 17:50 HKT
With AI's growing popularity, Kubernetes has become the de facto AI infrastructure. However, the increasing number of clusters with diverse AI devices (e.g., NVIDIA, Intel, Huawei Ascend) presents a major challenge. AI devices are expensive, how to better improve resource utilization? How to better integrate with K8s clusters? How to manage heterogeneous AI devices consistently, support flexible scheduling policies, and observability all bring many challenges The HAMi project was born for this purpose. This session including: * How K8s manages heterogeneous AI devices (unified scheduling, observability) * How to improve device usage by GPU share * How to ensure the QOS of high-priority tasks in GPU share stories * Support flexible scheduling strategies for GPU (NUMA affinity/anti-affinity, binpack/spread etc) * Integration with other projects (such as volcano, scheduler-plugin, etc.) * Real-world case studies from production-level users. * Some other challenges still faced and roadmap

随着人工智能的日益普及,Kubernetes已成为事实上的人工智能基础设施。然而,不断增加的具有多样化人工智能设备(如NVIDIA、Intel、华为Ascend)的集群数量带来了重大挑战。人工智能设备价格昂贵,如何更好地提高资源利用率?如何更好地与K8s集群集成?如何一致地管理异构人工智能设备,支持灵活的调度策略和可观察性都带来了许多挑战。HAMi项目应运而生。本场演讲包括: * K8s如何管理异构人工智能设备(统一调度、可观察性) * 如何通过GPU共享提高设备使用率 * 如何确保GPU共享故事中高优先级任务的QOS * 为GPU支持灵活的调度策略(NUMA亲和性/反亲和性、binpack/spread等) * 与其他项目的集成(如volcano、scheduler-plugin等) * 来自生产级用户的实际案例研究。 * 仍然面临的一些其他挑战和路线图
Speakers
avatar for xiaozhang

xiaozhang

Senior Technical Lead, DaoCloud
- Xiao Zhang is leader of the Container team(focus on infra,AI,Muti-Cluster,Cluster - LCM,OCI) - Kubernetes / Kubernetes-sigs active Contributor、member - Karmada maintainer,kubean maintainer,HAMi maintainer - Cloud-Native Developer - CNCF Open Source Enthusiast. - GithubID: waw... Read More →
avatar for Mengxuan Li

Mengxuan Li

senior developer, The 4th Paradigm Co., Ltd
Reviewer of volcano community Founder of CNCF Landscape project HAMi responsible for the development of gpu virtualization mechanism on volcano. It have been merged in the master branch of volcano, and will be released in v1.8. speaker, in OpenAtom Global Open Source Commit#2023 speaker... Read More →
Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 3
 
Thursday, August 22
 

11:50 HKT

VeScale: A PyTorch Native LLM Training Framework | veScale:一个PyTorch原生LLM训练框架 - Hongyu Zhu, ByteDance
Thursday August 22, 2024 11:50 - 12:25 HKT
The era of giant LLM today calls forth distributed training. Despite countless distributed training frameworks that have been published in the past decade, few have excelled at real industry production, as the quality favored the most is often the Ease of Use instead of pure Performance. The Ease of Use lies in two essentials -- PyTorch and Automatic Parallelism, because: i) PyTorch ecosystem dominates and owns 92% of models on HuggingFace, and ii) giant models cannot be trained without complex nD Parallelism. Currently, this Ease of Use is "broken" for industry-level frameworks, as they are either not PyTorch-native (TensorFlow/JAX) or not fully Automated (Megatron/DeepSpeed/torch). We propose a novel framework that combines PyTorch Nativeness and Automatic Parallelism for scaling LLM training with Ease of Use. We only expect developers to write single-device torch code but automatically parallelize it into nD parallelism with all heavy lifting handled transparently.

当今巨型LLM时代呼唤分布式训练。尽管过去十年中已经发布了无数分布式训练框架,但很少有能够在真实产业生产中表现出色,因为最受青睐的质量往往是易用性而不是纯性能。易用性在于两个关键点--PyTorch和自动并行性,因为:i)PyTorch生态系统主导并拥有HuggingFace上92%的模型,ii)巨型模型无法在没有复杂的nD并行性的情况下进行训练。 目前,这种易用性对于产业级框架来说已经“破碎”,因为它们要么不是PyTorch原生的(TensorFlow/JAX),要么不是完全自动化的(Megatron/DeepSpeed/torch)。 我们提出了一个结合了PyTorch原生性和自动并行性的新型框架,以便通过易用性扩展LLM训练。我们只期望开发人员编写单设备torch代码,但自动将其并行化为nD并行性,所有繁重的工作都由框架透明地处理。
Speakers
avatar for Hongyu Zhu

Hongyu Zhu

Machine Learning System Software Engineer, ByteDance
Hongyu is a Machine Learning System Engineer in ByteDance AML group, working on systems and compilers for training workloads. He got his PhD degree from University of Toronto, where he worked with Professor Gennady Pekhimenko. He is generally interested in machine learning compilers... Read More →
Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 3
 
Friday, August 23
 

10:35 HKT

Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC | 打破界限:TACC作为AI + HPC统一云原生基础设施 - Peter Pan, DaoCloud & Kaiqiang Xu, Hong Kong University of Science and Technology
Friday August 23, 2024 10:35 - 11:10 HKT
Large AI models are driving significant investment in GPU clusters. Yet, managing these clusters is hard: Slurm-based HPC setups lack of management granularity and stability, while Kubernetes poses usability challenges for AI users. This talk introduces TACC, an AI infra management solution that bridges the advantages of both K8S and Slurm setups. This is a joint-work from computer system researchers at HKUST and leading CNCF contributors at DaoCloud. TACC manages a large-scale cluster at HKUST that supports over 500 active researchers since 2020. In this talk, we share our five-year journey with TACC, covering: * [User Experience] A seamless UI for job submissions and management, supporting both container and Slurm format, all on the same backbone * [Resource Management] Multi-tenant allocation with configurable strategies, using CNCF HAMi and Kueue * [Performance and Scalability] A robust distributed infrastructure with networked storage and RDMA, via CNCF SpiderPool,Fluid...

大型AI模型正在推动GPU集群的重大投资。然而,管理这些集群很困难:基于Slurm的HPC设置缺乏管理粒度和稳定性,而Kubernetes对AI用户存在可用性挑战。 本次演讲介绍了TACC,这是一种AI基础设施管理解决方案,可以结合K8S和Slurm设置的优势。这是香港科技大学的计算机系统研究人员与DaoCloud领先的CNCF贡献者共同合作的成果。 TACC自2020年以来管理着香港科技大学支持超过500名活跃研究人员的大规模集群。在本次演讲中,我们分享了与TACC一起的五年历程,涵盖以下内容: * [用户体验] 无缝的UI界面用于作业提交和管理,支持容器和Slurm格式,均在同一基础上 * [资源管理] 多租户分配与可配置策略,使用CNCF HAMi和Kueue * [性能和可扩展性] 强大的分布式基础设施,具有网络存储和RDMA,通过CNCF SpiderPool,Fluid...
Speakers
avatar for Peter Pan

Peter Pan

VP of R&D Engineering, DaoCloud
├ DaoCloud R&D Engineering VP├ CNCF wg-AI (AI Working-Group) member├ Maintainer of a few CNCF projects (GithubID: panpan0000): CloudTTY, KuBean, HwameiStor├ Public Tech Events:└─ 2023 KubeCon SH Speaker (https://sched.co/1PTFI)└─ 2023 KubeCon EU Program Committee... Read More →
avatar for Kaiqiang Xu

Kaiqiang Xu

Researcher, Hong Kong University of Science and Technology
Hong Kong University of Science and Technology
Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 3

13:20 HKT

Constructing the 10x Efficiency of Cloud-Native AI Infrastructure | 如何让你的 AI 底座效能提升 10 倍? - Peter Pan, DaoCloud & 秋萍 戴, daocloud
Friday August 23, 2024 13:20 - 13:55 HKT
Enterprises keep invested in AI. But once GPU are installed in a data center, a challenge arises: how to construct an "AI cloud" atop bare-metal. Even when K8S is recognized as the foundational infrastructure for AI, But K8S only is merely the initial step. Organizations may face challenges: - Maximizing GPU utilization - Unifying multi-arch accelerators/GPUs (k8s DRA) - Organization quotas and cost management - Resource isolation among organizations - Smarter scheduling, tiered GPU allocation, task prioritization.. - Sharing GPU clusters between VMs & containers - Harnessing the full potential of high-speed networks , Storage optimization and dataset orchestration Leveraging open source stacks in Linux Foundation and CNCF, we've experience in building AI clouds for IDC or internal usage. We can share experiences to empower communities' journey towards constructing the 10x efficiency of cloud-native AI. Refer to `Additional resources` chapter for more details

企业继续投资于人工智能。但是一旦在数据中心安装了GPU,就会面临一个挑战:如何在裸金属之上构建一个“AI云”。即使K8S被认为是AI的基础基础设施,但K8S只是一个起步。 组织可能面临的挑战包括: - 最大化GPU利用率 - 统一多架构加速器/GPU(k8s DRA) - 组织配额和成本管理 - 组织之间的资源隔离 - 更智能的调度,分层GPU分配,任务优先级... - 在虚拟机和容器之间共享GPU集群 - 充分利用高速网络的潜力,优化存储和数据集编排 利用Linux基金会和CNCF中的开源堆栈,我们在为IDC或内部使用构建AI云方面有经验。我们可以分享经验,以赋予社区构建云原生AI的效率提升10倍的旅程。 有关更多详细信息,请参考“附加资源”章节。
Speakers
avatar for Peter Pan

Peter Pan

VP of R&D Engineering, DaoCloud
├ DaoCloud R&D Engineering VP├ CNCF wg-AI (AI Working-Group) member├ Maintainer of a few CNCF projects (GithubID: panpan0000): CloudTTY, KuBean, HwameiStor├ Public Tech Events:└─ 2023 KubeCon SH Speaker (https://sched.co/1PTFI)└─ 2023 KubeCon EU Program Committee... Read More →
avatar for 秋萍 戴

秋萍 戴

product mananger, daocloud
QiuPing Dai is a senior Technology Product Manager at DaoCloud for 5 years and involved in Cloud Computing ( including Kubernetes Computing, Storage, Network) development work. Before that, Qiuping worked at IBM for Cloud Computing. QiuPing is interested in Storage, Network , Scheduling... Read More →
Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 2

13:20 HKT

Write Once Run Anywhere, but for GPUs | GPU 时代的“一次编写,到处运行” - Michael Yuan, Second State
Friday August 23, 2024 13:20 - 13:55 HKT
With the popularity of LLM apps, there is an increasing demand for running and scaling AI workloads in the cloud and on edge devices. Rust and Wasm offer a solution by providing a portable bytecode that abstracts hardware complexities. LlamaEdge is a lightweight, high-performance and cross-platform LLM inference runtime. Written in Rust and built on WasmEdge, LlamaEdge provides a standard WASI-NN API to developers. Developers only need to write against the API and compile to Wasm. The Wasm file can run on any device, where WasmEdge translates and routes Wasm calls to the underlying native libraries such as llama.cpp. This talk will discuss the design and implementation of LlamaEdge and show how it enables cross-platform LLM app development and deployment. We will also walk through several code examples from a basic sentence completion app, to a chat bot, to an RAG agent app with external knowledge in vector databases, to a Kubernetes managed app across a heterogeneous cluster.

随着LLM应用程序的流行,云端和边缘设备上运行和扩展AI工作负载的需求不断增加。Rust和Wasm通过提供一个抽象硬件复杂性的可移植字节码来提供解决方案。 LlamaEdge是一个轻量级、高性能和跨平台的LLM推理运行时。使用Rust编写,并构建在WasmEdge上,LlamaEdge为开发人员提供了一个标准的WASI-NN API。开发人员只需针对API编写代码并编译为Wasm。Wasm文件可以在任何设备上运行,WasmEdge将Wasm调用转换并路由到底层的本地库,如llama.cpp。 本次演讲将讨论LlamaEdge的设计和实现,并展示它如何实现跨平台的LLM应用程序开发和部署。我们还将从基本的句子补全应用程序、聊天机器人,到具有外部知识的矢量数据库中的RAG代理应用程序,再到跨异构集群的Kubernetes管理应用程序,演示几个代码示例。
Speakers
avatar for Michael Yuan

Michael Yuan

Product Manager, Second State
Dr. Michael Yuan is a maintainer of WasmEdge Runtime (a project under CNCF) and a co-founder of Second State. He is the author of 5 books on software engineering published by Addison-Wesley, Prentice-Hall, and O'Reilly. Michael is a long-time open-source developer and contributor... Read More →
Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 3

16:05 HKT

Boosting LLM Development and Training Efficiency: Automated Parallelization with MindSpore | 提升LLM开发和培训效率:MindSpore自动并行化 - Yufeng Lyu, Huawei Technologies Co., Ltd
Friday August 23, 2024 16:05 - 16:40 HKT
With the popularity of LLM, large-scale pre-training has become an indispensable step in AI research and implementation. However, large-scale distributed parallel training requires developers to consider various factors affecting the efficiency of model development and training, such as partitioning and communication, and then modify the model accordingly. In this presentation, we will demonstrate an automatic parallelization approach that allows developers to focus on algorithm research without the need for intrusive model modifications. Distributed training on a large-scale cluster can be achieved simply by configuring strategies. Developers can also utilize MindSpore's hyperparameter search model to automatically find the best parallelization strategy. The parallel strategy obtained through search can achieve 90%-110% of the expert tuning performance, significantly reducing the time required for model modifications while efficiently accelerating LLM training.

随着LLM的流行,大规模预训练已成为人工智能研究和实施中不可或缺的一步。然而,大规模分布式并行训练需要开发人员考虑各种影响模型开发和训练效率的因素,如分区和通信,然后相应地修改模型。 在本次演示中,我们将展示一种自动并行化方法,使开发人员能够专注于算法研究,而无需进行侵入性的模型修改。通过配置策略,可以简单实现在大规模集群上的分布式训练。开发人员还可以利用MindSpore的超参数搜索模型自动找到最佳的并行化策略。通过搜索获得的并行策略可以实现专家调整性能的90%-110%,显著减少了模型修改所需的时间,同时有效加速LLM的训练。
Speakers
avatar for Yufeng Lyu

Yufeng Lyu

Senior Engineer, Huawei Technologies Co., Ltd
Lyu Yufeng, a technical architect at MindSpore and maintainer of the MindNLP framework, focuses his research on natural language processing and distributed parallelism for LLM. He possesses extensive experience in the development and implementation of LLM solutions.
Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 3
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.