Loading…
Attending this event?
In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 

亲临现场
2024年8月21-23日
了解更多并注册参加

Sched应用程序允许您创建自己的日程安排,但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024,才能参加会议。如果您尚未注册但希望加入我们,请访问活动注册页面购买注册。

请注意:本日程自动显示为香港标准时间(UTC +8)。要查看您偏好的时区的日程,请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动,会议席位先到先得。
中级 (Intermediate) clear filter
Wednesday, August 21
 

11:00 HKT

Addressing Challenges of Cross-Architecture Dynamic Migration Over Heterogeneous Acceleration System | 解决异构加速系统上跨架构动态迁移的挑战 - Yanjun Chen, China Mobile
Wednesday August 21, 2024 11:00 - 11:35 HKT
With the surge of application computing demand, the industry began to run AI applications on diverse acceleration hardwares (GPU, FPGA, NPU...) to gain more processing capability. One key problem to use diverse accelerators is tool chain & vendor lock-in in application Dev-to-Run processes. Cross-system (multi-arch chips + multi-vendor tool chain) application development and migration is hard to achieve. In this presentation China Mobile will introduce the practices to solve above challenges allowing AI applications smoothly migrate among different accelerators. It includes a unified abstraction for diverse accelerators, a middle-compiler using existing compilers (CUDA, ROCm, oneAPI...) to achieve cross-architecture compile in the same execution, and a runtime supporting dynamic and replaceable link. We want to enable applications migrate freely between diverse accelerators without changing development habits, and show the architecture design, open source plans and a demo.

随着应用计算需求的激增,行业开始在各种加速硬件(GPU、FPGA、NPU等)上运行AI应用程序,以获得更多的处理能力。在使用各种加速器时,一个关键问题是在应用程序开发到运行过程中的工具链和供应商锁定。跨系统(多架构芯片+多供应商工具链)应用程序开发和迁移很难实现。在这个演示中,中国移动将介绍解决上述挑战的实践,使AI应用程序能够在不同的加速器之间平稳迁移。这包括对各种加速器的统一抽象,使用现有编译器(CUDA、ROCm、oneAPI等)的中间编译器实现跨架构编译在同一执行中,以及支持动态和可替换链接的运行时。我们希望能够使应用程序在不改变开发习惯的情况下自由迁移至各种加速器,并展示架构设计、开源计划和演示。
Speakers
avatar for Yanjun Chen

Yanjun Chen

Open Source Expert, China Mobile
Yanun Chen is the open source expert and CNCF delegate in China Mobile. She joined actively in many open source projects and now she is the TSC member of LF Edge Akraino.
Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 3

11:00 HKT

Accelerating Serverless AI Large Model Inference with Functionalized Scheduling and RDMA | 通过功能化调度和RDMA加速无服务器AI大模型推理 - Yiming Li, Tianjin University& Chenglong Wang, Jinan Inspur Data Technology Co., Ltd.
Wednesday August 21, 2024 11:00 - 11:35 HKT
The deployment of AI large models on standard Serverless inference platforms like KServe is gaining popularity due to its ability to improve resource utilization and reduce costs. However, existing large model inference faces significant scheduling and communication bottlenecks, making it challenging to meet low-latency and high-throughput demands. The centralized control plane of Kubernetes leads to low scheduling efficiency, unable to achieve second-level response to large-scale burst requests. Additionally, the large model inference needs to transfer GB-level KV cache for each request, resulting in high communication overhead. So, we have developed a highly elastic functionalized scheduling framework to guarantee second-level scheduling for thousands of Serverless AI large model inference task instances. Additionally, we leverage RDMA technology to achieve high-speed KV cache migration, avoiding the high overhead caused by traditional network protocol stacks.

AI大模型在像KServe这样的标准无服务器推理平台上的部署越来越受欢迎,因为它能够提高资源利用率并降低成本。然而,现有的大模型推理面临着重要的调度和通信瓶颈,使得满足低延迟和高吞吐量需求变得具有挑战性。Kubernetes的集中式控制平面导致低调度效率,无法实现对大规模突发请求的秒级响应。此外,大模型推理需要为每个请求传输GB级别的KV缓存,导致高通信开销。因此,我们开发了一个高度弹性的功能化调度框架,以确保对数千个无服务器AI大模型推理任务实例进行秒级调度。此外,我们利用RDMA技术实现高速KV缓存迁移,避免传统网络协议栈引起的高开销。
Speakers
avatar for Cookie

Cookie

Senior Software Engineer, Jinan Inspur Data Technology Co., Ltd.
I'm employed in Inspur. I mainly do container computing related development and are familiar with container networks, especially Calico and Cilium. I'm also a contributor to the Openyurt community and mainly participate in the development of the raven project.
avatar for Yiming Li

Yiming Li

PhD candidate, Tianjin University
Yiming Li received the bachelor’s and master’s degrees from Tianjin University, China, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include cloud com... Read More →
Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, AI + ML

11:50 HKT

AI Inference Performance Acceleration: Methods, Tools, and Deployment Workflows | AI推理性能加速:方法、工具和部署工作流程 - Yifei Zhang & 磊 钱, Bytedance
Wednesday August 21, 2024 11:50 - 12:25 HKT
As AI rapidly evolves and embraces cloud-native technologies, inference performance has become crucial for application value. GPU selection, serving framework configuration, and model/data loading significantly impact inference efficiency. We'll focus on cloud-native solutions to storage performance issues and tools for evaluating inference performance across configurations, offering optimal deployment setups integrated into cloud-native workflows. We'll discuss inference performance's impact on user experience and how optimization can reduce costs and improve efficiency. Using technologies like Fluid and model optimization, we'll share strategies to enhance inference performance. Based on performance and cost analysis of various GPUs, we'll guide AI engineers in hardware selection. Additionally, we'll introduce a performance testing tool to evaluate and recommend the best model, hardware, and acceleration scheme combinations, aligning with deployment workflows based on test results.

随着人工智能的快速发展和对云原生技术的采用,推理性能对应用价值变得至关重要。 GPU选择、服务框架配置以及模型/数据加载对推理效率有着重大影响。我们将专注于云原生解决方案,解决存储性能问题,并提供评估不同配置下推理性能的工具,为云原生工作流程提供最佳部署设置。 我们将讨论推理性能对用户体验的影响,以及优化如何降低成本并提高效率。利用Fluid和模型优化等技术,我们将分享增强推理性能的策略。基于各种GPU的性能和成本分析,我们将指导人工智能工程师进行硬件选择。此外,我们将介绍一种性能测试工具,评估并推荐最佳模型、硬件和加速方案组合,根据测试结果与部署工作流程相匹配。
Speakers
avatar for Yifei Zhang

Yifei Zhang

Software Engineer, Bytedance
Yifei Zhang, Software Engineer at Volcengine, focuses on technical research and product development in Kubernetes and AI, and has rich experience in public cloud, and is now fully working on VKE (Volcengine Kubernetes Engine), which is the managed Kubernetes product in Volcengine... Read More →
avatar for 钱磊

钱磊

Software Engineer, Bytedance
a kubernetes developer in bytedance. focus on building a stable kubernetes engine on public cloud.
Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, AI + ML

11:50 HKT

Implementing Fine-Grained and Pluggable Container Resource Management Leveraging NRI | 基于 NRI 实现精细化且可插拔的容器资源管理 - Qiang Ren, Intel & He Cao, ByteDance
Wednesday August 21, 2024 11:50 - 12:25 HKT
To overcome Kubernetes' limitations in resource management, ByteDance developed Katalyst, a resource management system. Katalyst employs a range of methodologies, including colocation, node over-commitment, specification recommendation, and tidal colocation, aimed at optimizing cluster resource utilization.

Initially, Katalyst introduced a QoS Resource Manager (QRM) framework within kubelet, facilitating versatile container resource allocation through a plugin architecture. Presently, the Node Resource Interface (NRI) presents a refined alternative.

This session elucidates how Katalyst leverages NRI for fine-grained and adaptable container resource management, ensuring efficiency without intrusive modifications of upstream components. This novel architecture allows Katalyst to seamlessly integrate with native Kubernetes, offering a user-friendly and easily maintainable solution.

为了克服 Kubernetes 在资源管理方面的局限性,字节跳动构建了一个资源管理系统 Katalyst,通过在离线业务常态混部、资源超分、规格推荐、潮汐混部等方式,提升集群的资源利用率。最初,Katalyst 在 kubelet 中引入了一个 QoS Resource Manager(QRM)框架,通过插件化的方式来扩展容器的资源分配策略;当前,Node Resource Interface(NRI)提供了一个原生的替代方案。

本次演讲将介绍 Katalyst 如何通过 NRI 实现精细化且可插拔的容器资源管理,在不对上游组件进行侵入性修改的情况下,提升资源利用率并保证业务的 SLO 不受影响。这种全新的架构使 Katalyst 能够与原生 Kubernetes 无缝集成,提供了一种易于使用和维护的解决方案。
Speakers
avatar for Qiang Ren

Qiang Ren

Software Engineer, Intel
Ren Qiang works as a Cloud Orchestration Software Engineer in SATG, Intel. He mainly focuses on Cloud Native technologies in the runtime. At the same time, he actively participates in open-source projects and is committed to promoting the development of runtime and resource isola... Read More →
avatar for He Cao

He Cao

Senior Software Engineer, ByteDance
He Cao is a senior software engineer on the Cloud Native team at ByteDance, a maintainer of Katalyst and KubeZoo, and a member of Istio. He has 5+ years of experience in the cloud native area. Since joining ByteDance, he has designed and implemented several critical systems for VKE... Read More →
Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

13:50 HKT

Boundaryless Computing: Optimizing LLM Performance, Cost, and Efficiency in Multi-Cloud Architecture | 无边界计算:在多云架构中优化LLM性能、成本和效率 - Jian Zhu, Red Hat & Kai Zhang, Alibaba Cloud Intelligence
Wednesday August 21, 2024 13:50 - 14:25 HKT
For large language model (LLM) inference, GPU resources within a single data center or cloud region often cannot meet all user demands. Additionally, for the end-users, deploying across multiple geographic regions is necessary to provide an optimal user experience. However, managing model distribution, synchronization, and consistency across multiple regions presents new challenges. To address this, the OCM and Fluid communities have collaborated to automate the multi-region distribution of inference applications through OCM's multi-cluster application deployment capabilities, combined with Fluid's data orchestration capabilities. This automation facilitates the cross-regional distribution and pre-warming of large models, enhancing the efficiency of model deployment and upgrades.

对于大型语言模型(LLM)推理,单个数据中心或云区域内的GPU资源通常无法满足所有用户需求。此外,对于最终用户来说,跨多个地理区域部署是为了提供最佳用户体验。然而,在多个地区管理模型分发、同步和一致性会带来新的挑战。为了解决这个问题,OCM和Fluid社区合作,通过OCM的多集群应用部署能力和Fluid的数据编排能力自动化实现推理应用的多地区分发。这种自动化促进了大型模型的跨地区分发和预热,提高了模型部署和升级的效率。
Speakers
avatar for Kai Zhang

Kai Zhang

Senior Staff Engineer, Alibaba
Kai Zhang is a Senior Staff Engineer at Alibaba Cloud Intelligence, where he has been part of the team developing the Alibaba Cloud container service for Kubernetes (ACK) for over 6 years. He currently leads ACK’s Cloud native AI product and solution offerings. Before this, he spent... Read More →
avatar for Jian Zhu

Jian Zhu

Senior Software Engineer, RedHat
Zhu Jian is a senior software engineer at RedHat, core contributor to open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.
Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, AI + ML

13:50 HKT

Kubespray Unleashed: Navigating Bare Metal Services in Kubernetes for LLM and RAG | Kubespray大放异彩:在Kubernetes中为LLM和RAG部署裸金属服务 - Kay Yan, DaoCloud & Alan Leung, Equinix
Wednesday August 21, 2024 13:50 - 14:25 HKT
Kubespray, popular within the SIG-Cluster-Lifecycle of Kubernetes, is celebrated for deploying production-ready Kubernetes clusters, particularly on bare metal, which boosts performance for AI workloads like LLM and RAG. This session will explore using Kubespray in bare metal settings, addressing challenges, and sharing best practices. The first part of the talk will show Kubespray's key features and provide practical tips. The latter half will focus on swiftly deploying AI using Retrieval-Augmented Generation (RAG), demonstrating how Kubespray facilitates setting up Kubernetes clusters on bare metal. This setup enhances AI applications by integrating continuous knowledge updates and domain-specific information via RAG, improving the accuracy and credibility of the AI systems. The session will conclude with discussions on community engagement and future advancements, followed by a Q&A period to address participant queries.

KubeCon会议描述: Kubespray在Kubernetes的SIG-Cluster-Lifecycle中备受推崇,以在裸金属上部署可用于生产的Kubernetes集群而闻名,特别是对于像LLM和RAG这样的AI工作负载,可以提高性能。本场演讲将探讨在裸金属环境中使用Kubespray,解决挑战,并分享最佳实践。 演讲的第一部分将展示Kubespray的关键特性并提供实用技巧。后半部分将重点介绍如何使用检索增强生成(RAG)快速部署AI,演示Kubespray如何在裸金属上设置Kubernetes集群。通过RAG集成持续的知识更新和领域特定信息,这种设置可以提升AI应用程序的性能,提高AI系统的准确性和可信度。 本场演讲将以社区参与和未来发展的讨论结束,随后进行问答环节以解答参与者的疑问。
Speakers
avatar for Kay Yan

Kay Yan

Principal Software Engineer, DaoCloud
Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.
avatar for Alan Leung

Alan Leung

Digital Technical Specialist, Equinix
Alan is the Digital Technical Specialist at Equinix with focus on enabling customers, prospects and partners to develop innovative solutions to solve business challenges at the digital edge.
Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

14:05 HKT

⚡ Lightning Talk: How Prometheus AI Agent Helps Build Interactive Monitoring? | ⚡ 闪电演讲: Prometheus AI代理如何帮助构建交互式监控? - Zhihao Liu, Quwan
Wednesday August 21, 2024 14:05 - 14:10 HKT
In day-to-day work, both SREs and developers often struggle when working with the observability tools like Prometheus, mainly due to the complex PromQL syntax and disorganized metrics. This talk will showcase how to build Agent. It will have the ability to think, act, and analyze like a human, and it will solve user issues through conversation. This talk presents two main standout ideas: 1. Leveraging RAG technology, it performs multi-path retrieval from local metric knowledge, Prometheus API, Request Logs, and public domain knowledge to produce a consolidated answer. 2. Using the ReAct method, it engages in multi-round dialogues to refine and generate the correct PromQL, call api, and render the dashboard return. This talk, we hope the audience will learn: 1. How to integrate LLM effectively within the observability space. 2. The steps to create an easy-to-use and practical Prometheus AI Agent. 3. Gain experience and insights from practical examples of the Prometheus AI Agent.

在日常工作中,SRE和开发人员在使用像Prometheus这样的可观察性工具时经常遇到困难,主要是由于复杂的PromQL语法和混乱的指标。本次演讲将展示如何构建Agent。它将具有像人类一样思考、行动和分析的能力,并通过对话解决用户问题。 本次演讲提出了两个主要的突出想法: 1. 利用RAG技术,从本地度量知识、Prometheus API、请求日志和公共领域知识中进行多路径检索,以生成一个整合的答案。 2. 使用ReAct方法,进行多轮对话以完善和生成正确的PromQL,调用api,并呈现仪表板返回。 通过本次演讲,我们希望观众能学到: 1. 如何在可观察性领域有效地整合LLM。 2. 创建一个易于使用和实用的Prometheus人工智能Agent的步骤。 3. 从Prometheus人工智能Agent的实际示例中获得经验和见解。
Speakers
avatar for Zhihao Liu

Zhihao Liu

Senior Devops Engineer, Quwan
three years of experience in the observability field. I have been involved in the development of the company's observability platform.
Wednesday August 21, 2024 14:05 - 14:10 HKT
Level 1 | Hung Hom Room 1
  ⚡ Lightning Talks | ⚡ 闪电演讲, Observability

14:40 HKT

⚡ Lightning Talk: Kubernetes Raises Questions. Can a PaaS Answer Them? | ⚡ 闪电演讲: Kubernetes引发了问题。 PaaS能解答吗? - Ram Iyengar, Cloud Foundry Foundation
Wednesday August 21, 2024 14:40 - 14:45 HKT
The enormous success of the CNCF Landscape has produced an overwhelming number of options in the space, where organizations struggle to establish their platforms quickly. This talk will help guide the community through the thought process of building these platforms, explore some examples of what a healthy source-driven platform ecosystem looks like, and showcase the power that a good cloud native platform will deliver to an organization. Though there are variations of platforms (i.e data, application, machine learning, etc) many start to have the same problems. These include artifact management, secrets management, TLS certificates, cloud permissions, and the list goes on. Providing turnkey solutions for platforms that can be ready in minutes adds much velocity to engineering teams across organizations that adopt the platform engineering model.

CNCF景观的巨大成功在该领域产生了大量的选择,组织往往难以快速建立自己的平台。本次演讲将帮助指导社区通过构建这些平台的思考过程,探讨健康的源驱动平台生态系统的一些示例,并展示一个优秀的云原生平台将为组织带来的力量。 尽管平台有各种变化(如数据、应用程序、机器学习等),许多开始出现相同的问题。这些问题包括工件管理、密钥管理、TLS证书、云权限等等。为平台提供即插即用的解决方案,可以在几分钟内准备就绪,为采用平台工程模型的组织的工程团队带来更大的速度。
Speakers
avatar for Ram Iyengar

Ram Iyengar

Chief Evangelist, Cloud Foundry Foundation
Ram Iyengar is an engineer by practice and an educator at heart. He was (cf) pushed into technology evangelism along his journey as a developer and hasn’t looked back since! He enjoys helping engineering teams around the world discover new and creative ways to work. He is a proponent... Read More →
Wednesday August 21, 2024 14:40 - 14:45 HKT
Level 1 | Hung Hom Room 1

14:40 HKT

Best Practice: Karmada & Istio Improve Workload & Traffic Resilience of Production Distributed Cloud | 最佳实践:Karmada和Istio提高生产分布式云的工作负载和流量弹性 - Chaomeng Zhang, Huawei
Wednesday August 21, 2024 14:40 - 15:15 HKT
The Distributed cloud offers better resilience by providing redundancy, scalability and flexibility, especially for cloud native applications. However the complexity of multi-cluster workload and traffic management in hybrid or multi-cloud environment brings huge challenges in practice, such as the number of overall multi-cluster workload instances serve for customer request decreased when some unhealthy ones isolated in case of failures. In this speech, Chaomeng introduces a production practice of Karmada and Istio work together to promote resilience of multi-cluster application. How Karmada and Istio policies configured in a centralized control plane controls both replica and traffic distribution across cluster automatically. In case of failures, how Istio’s failover acts to remove unhealthy endpoints from global load balancing pool, and how Karmada rebuild the according number of instance in other healthy clusters, ensure multi-cluster instances always meet the capacity design.

分布式云通过提供冗余、可伸缩性和灵活性,特别是对于云原生应用程序,提供了更好的弹性。然而,在混合或多云环境中的多集群工作负载和流量管理的复杂性在实践中带来了巨大挑战,例如当一些不健康的实例在故障情况下被隔离时,为客户请求提供服务的整体多集群工作负载实例数量减少。 在这次演讲中,Chaomeng介绍了Karmada和Istio共同推动多集群应用程序弹性的生产实践。Karmada和Istio策略如何在集中控制平面中配置,自动控制跨集群的副本和流量分发。在发生故障时,Istio的故障转移如何从全局负载均衡池中移除不健康的端点,以及Karmada如何在其他健康集群中重新构建相应数量的实例,确保多集群实例始终满足容量设计。
Speakers
avatar for Chaomeng Zhang

Chaomeng Zhang

Architect of UCS (HUAWEI Distributed Cloud Native), Huawei
Zhang Chaomeng is the architect of UCS (HUAWEI Distributed Cloud Native), has 9 years cloud computing related design and developing experience in HUAWEI Cloud, including service mesh, Kubernetes, micro service, cloud service catalog, big data, APM, cloud computing reliability and... Read More →
Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Connectivity

14:40 HKT

Scaling Kubernetes: Best Practices for Managing Large-Scale Batch Jobs with Spark and Argo Workflow | 扩展Kubernetes:管理大规模批处理作业的最佳实践与Spark和Argo工作流 - Yu Zhuang & Liu Jiaxu, Alibaba Cloud
Wednesday August 21, 2024 14:40 - 15:15 HKT
Are you managing large-scale batch jobs on Kubernetes, like data processing with Spark applications or genomics computing with Argo workflows? To complete these jobs promptly, a significant number of pods have to be scaled out/in quickly for parallel computation. It means a big pressure to Kubernetes control plane. In this talk, we will use Spark and Argo workflows as example, guiding you how to build a Kubernetes cluster which supports creating/deleting 20000 of pods frequently. Our focus will be on tuning the Kubernetes control plane, including optimizing the list-watch mechanism, service broadcasting, environment variable attachments, API server configurations. Additionally, we'll share some of the best practices for configuring Spark operator and Argo workflows controller.

您是否正在Kubernetes上管理大规模的批处理作业,比如使用Spark应用程序进行数据处理或使用Argo工作流进行基因组计算?为了及时完成这些作业,需要快速地扩展/缩减大量的Pod以进行并行计算,这给Kubernetes控制平面带来了巨大压力。 在本次演讲中,我们将以Spark和Argo工作流为例,指导您如何构建一个支持频繁创建/删除20000个Pod的Kubernetes集群。我们将重点放在调优Kubernetes控制平面上,包括优化列表-观察机制、服务广播、环境变量附加、API服务器配置等。此外,我们还将分享一些配置Spark操作员和Argo工作流控制器的最佳实践。
Speakers
avatar for Liu Jiaxu

Liu Jiaxu

Senior Engineer, Alibaba Cloud
Jiaxu Liu is a Senior Engineer on the Container Service Team at Alibaba Cloud. He specializes in observability enhancement and large-scale cluster management and optimization for Alibaba Cloud's container service offerings. Before joining Alibaba Cloud, he worked at Nokia as a Senior... Read More →
Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

14:50 HKT

⚡ Lightning Talk: Running Native WebAssembly AI Applications Everywhere | ⚡ 闪电演讲: 在任何地方运行原生WebAssembly人工智能应用程序 - Tiejun Chen, VMware
Wednesday August 21, 2024 14:50 - 14:55 HKT
In recent years WASM has been one of the hottest topics in the world of computing due to its portability, small size, fast loading, and compatibility. And given these advantages, WebAssembly is an ideal technology based on sandbox schemes for modern applications including ML/AI. But beyond the browser, currently WebAssembly only can leverage CPU to accelerate ML/AI mostly. Here we offer a flexible way to make running ML/AI on WebAssembly over a variety of AI Accelerators by empowering WASM with a transparent backend interposer. With this, your native ML/AI WebAssembly workloads can seamlessly enjoy the underlying AI accelerators such as CPU, GPU, FPGA and so on, with best performance. During this presentation we also would like to show our latest implementation with demos to help users get direct insight of running ML/AI with WebAssembly on AI accelerators.

近年来,由于其可移植性、体积小、加载速度快和兼容性等优势,WASM已成为计算领域最热门的话题之一。鉴于这些优势,WebAssembly是基于沙箱方案的现代应用程序,包括ML/AI的理想技术。但除了浏览器之外,目前WebAssembly只能利用CPU来加速大部分ML/AI。在这里,我们提供了一种灵活的方式,通过为WASM赋予一个透明的后端插入器,使其能够在各种AI加速器上运行ML/AI。借助这一技术,您的本地ML/AI WebAssembly工作负载可以无缝地享受CPU、GPU、FPGA等底层AI加速器的最佳性能。在本次演示中,我们还将展示我们最新的实现,并通过演示帮助用户直观了解在AI加速器上运行ML/AI的WebAssembly。
Speakers
avatar for Tiejun Chen

Tiejun Chen

Sr. Technical Lead, VMware
Tiejun Chen was Sr. technical leader. He ever worked several tech companies such as VMware, Intel, Wind River Systems and so on, involved in - cloud native, edge computing, ML/AI, RISC-V, WebAssembly, etc. He ever made many presentations at AI.Dev NA 2023, kubecon China 2021, Kube... Read More →
Wednesday August 21, 2024 14:50 - 14:55 HKT
Level 1 | Hung Hom Room 1

15:35 HKT

Sit Back and Relax with Fault Awareness and Robust Instant Recovery for Large Scale AI Workloads | 坐和放宽,了解大规模 AI 负载场景下的故障感知和健壮的快速故障恢复 - Fanshi Zhang & Kebe Liu, DaoCloud
Wednesday August 21, 2024 15:35 - 16:10 HKT
The fault tolerance during train, fine-tuning, and even inferencing is crucial to modern AI workloads when it happens on large scale, with loads of GPU clusters. For training and fine-tuning tasks, failure of GPUs, storages, any hardware issues often cause the extending the training time to weeks and even months significantly. For inferencing, when massive loads of requests income, if one of the inferencing servers went faulty, we need a policy and scheduler to perform mitigation to transfer the workloads fast and efficiently. In this talk, We will introduce a series of mechanism we have designed to help Kubernetes clusters and workloads itself to locate, diagnostic the root cause, schedule and perform mitigation when it comes to any of hardware or CUDA API call failures to reduce the overall operating challenges. But the possibilities will not stop here, the fault awareness and mitigation scheduler will help any of the workloads to mitigate during failures.

在大规模GPU集群上进行训练、微调甚至推理时的容错性对现代人工智能工作负载至关重要。 对于训练和微调任务,GPU、存储等硬件故障经常会导致训练时间延长至数周甚至数月。对于推理任务,当大量请求涌入时,如果其中一个推理服务器出现故障,我们需要一种策略和调度程序来快速高效地转移工作负载。 在本次演讲中,我们将介绍一系列我们设计的机制,帮助Kubernetes集群和工作负载本身定位、诊断根本原因,并在硬件或CUDA API调用失败时进行调度和执行缓解,以减少整体运营挑战。但可能性不会止步于此,故障感知和缓解调度程序将帮助任何工作负载在故障期间进行缓解。
Speakers
avatar for Kebe Liu

Kebe Liu

Senior software engineer, DaoCloud
Member of Istio Steering Committee, focused on cloud-native and Istio, eBPF and other areas in recent years. Founder of Merbridge project.
avatar for Neko Ayaka

Neko Ayaka

Software Engineer, DaoCloud
Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase
Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 3

15:35 HKT

Tackling Operational Time-to-Market Decelerators in AI/ML Projects | 应对人工智能/机器学习项目中的运营时间市场减速器 - Adrian Matei & Andreea Munteanu, Canonical
Wednesday August 21, 2024 15:35 - 16:10 HKT
In the competitive AI market, Time To Market (TTM) is crucial for success. Ensuring secure, scalable, and compliant ML infrastructures often slows TTM due to the complexities of updates, patches, monitoring, and security enforcement. This leads to decreases in ROI, profitability, reproducibility, and competitive edge. To address this, companies can engage Managed Service Providers (MSPs) to offload operational burdens and focus on innovation, yet selecting the right MSP requires consideration of expertise, automation capabilities, and compliance adherence. This presentation explores the AI operational landscape, highlighting indicators and challenges in MSP collaboration. We will focus on the management of open source tools like Kubeflow and MLflow across hybrid and multicloud environments. By understanding operational excellence in AI and available options to achieve it, attendees will gain insights into choosing an approach that aligns with their greater objectives.

在竞争激烈的人工智能市场中,上市时间对于成功至关重要。确保安全、可扩展和合规的机器学习基础设施通常会因更新、补丁、监控和安全执行的复杂性而减慢上市时间,导致投资回报率、盈利能力、可复制性和竞争优势下降。为了解决这个问题,公司可以与托管服务提供商(MSPs)合作,减轻运营负担,专注于创新,但选择合适的MSP需要考虑专业知识、自动化能力和合规性。 本次演讲探讨了人工智能运营领域,重点介绍了MSP合作中的指标和挑战。我们将重点关注在混合和多云环境中管理开源工具如Kubeflow和MLflow。通过了解人工智能运营卓越性以及实现卓越性的可用选项,与会者将获得选择与其更大目标一致的方法的见解。
Speakers
avatar for Andreea Munteanu

Andreea Munteanu

AI Product Manager, Canonical
Andreea Munteanu is a Product Manager at Canonical, leading the MLOps area. With a background in Data Science in various industries, she used AI techniques to enable enterprises to benefit from their initiatives and make data-driven decisions. Nowadays, Andreea is looking to help... Read More →
avatar for Adrian Matei

Adrian Matei

Product Manager, Canonical
With a degree in Information Management for Business, Adrian is now guiding Canonical’s open-source operational management toolset as Product Manager. He has been working in open source operations for the past two years, having previously accumulated experience in technology consulting... Read More →
Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

16:25 HKT

Simplify AI Infrastructure with Kubernetes Operators | 使用Kubernetes Operators简化AI基础设施 - Ganeshkumar Ashokavardhanan, Microsoft & Tariq Ibrahim US, NVIDIA
Wednesday August 21, 2024 16:25 - 17:00 HKT
ML applications often require specialized hardware and additional configuration to run efficiently and reliably on Kubernetes. However, managing the cluster lifecycle, the diversity and complexity of hardware configuration across nodes can be challenging. How can we simplify and automate this process to ensure a smooth experience for kubernetes users? Kubernetes Operators offer a great solution. In this session, we will go over operators and demonstrate how they can help automate the installation, configuration, and lifecycle management of AI-ready infra end to end from cluster provisioning and k8s node configuration to deep learning model deployments. We will demo a fine-tuning LLM workload, to showcase how existing operators in the ecosystem such as Cluster API Operator, GPU Operator, Network Operator, and the Kubernetes AI Toolchain Operator, can be used to simplify the infra. Finally, we will discuss challenges and best practices of using operators in production.

ML 应用通常需要专门的硬件和额外的配置才能在 Kubernetes 上高效可靠地运行。然而,管理集群生命周期、节点间硬件配置的多样性和复杂性可能具有挑战性。我们如何简化和自动化这个过程,以确保 Kubernetes 用户的顺畅体验? Kubernetes 运算符提供了一个很好的解决方案。在本场演讲中,我们将介绍运算符,并演示它们如何帮助自动化 AI-ready 基础架构的安装、配置和生命周期管理,从集群提供和 k8s 节点配置到深度学习模型部署。我们将演示一个微调 LLM 工作负载,展示生态系统中现有运算符(如 Cluster API Operator、GPU Operator、Network Operator 和 Kubernetes AI Toolchain Operator)如何简化基础架构。最后,我们将讨论在生产环境中使用运算符的挑战和最佳实践。
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Tariq Ibrahim US

Tariq Ibrahim US

Senior Cloud Platform Engineer, NVIDIA
Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →
Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 3

16:25 HKT

Istio and Modern API Gateways: Navigating the Future of Service Meshes | Istio和现代API网关:引领服务网格的未来 - Jimmy Song & Jianpeng He, Tetrate; Jiaqi Zhang, Alibaba Cloud; Jintao Zhang, Kong Inc.; Xunzhuo Liu, Tencent
Wednesday August 21, 2024 16:25 - 17:00 HKT
Join our esteemed panel of experts as they delve into the latest advancements and integrations in the world of Istio and API gateways. This discussion, led by Jimmy Song from Tetrate and founder of the China Cloud Native Community, will feature insights from core contributors and thought leaders including Jianpeng He (Tetrate), Jintao Zhang (Kong), Xunzhuo Liu (Tencent) and Zhang Jiaqi (Alibaba Cloud). The panel will explore Istio's recent developments such as Ambient Mesh, sidecar-less architectures, and the application of eBPF, along with the evolving role of Envoy Gateway. Participants will gain an in-depth understanding of how API gateways are blending with service meshes to create more dynamic, efficient, and secure cloud-native environments.

加入我们尊贵的专家小组,他们将深入探讨 Istio 和 API 网关领域的最新进展和集成。这次讨论由 Tetrate 的 Jimmy Song 主持,他是中国云原生社区的创始人,将邀请核心贡献者和思想领袖,包括 Jianpeng He(Tetrate)、Jintao Zhang(Kong)、Xunzhuo Liu(腾讯)和张佳琦(阿里云)分享见解。小组将探讨 Istio 的最新发展,如环境网格、无边车架构以及 eBPF 的应用,以及 Envoy 网关的不断演变角色。参与者将深入了解 API 网关如何与服务网格融合,创造更具动态、高效和安全的云原生环境。
Speakers
avatar for Jintao Zhang

Jintao Zhang

Sr. SE, Kong
Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.
avatar for Jimmy Song

Jimmy Song

Developer Advocate, Tetrate
Jimmy Song is a developer advocate at Tetrate, CNCF Ambassador, Cloud Native Community founder. He is an outstanding translator, author, and producer of PHEI. Early adopters and evangelists of Kubernetes and Istio. Previously, he worked at iFlytek, TalkingData, and Ant Group.
avatar for Xunzhuo

Xunzhuo

Software Engineer, Tencent
Xunzhuo Liu, Software Engineer working at Tencent Kubernetes Engine Team. He is an Open Source Enthusiast, focusing on API Gateway, Service Mesh, and Kubernetes Networking. He is the steering committee member, core maintainer of Envoy Gateway, also maintaining a couple of CNCF projects... Read More →
avatar for Jianpeng He

Jianpeng He

Software Engineer, Tetrate
Jianpeng is a core maintainer of istio, co-leader of Extensions and Telemetry wroking group, has been working on Istio for almost 3 years, he is the maintainer of Envoy Gateway.
avatar for Jiaqi Zhang

Jiaqi Zhang

software engineer, Alibaba Cloud
Zhang Jiaqi, working on Alibaba Cloud Service Mesh as software engineer, , focusing on traffic management and telemetry related fields, after graduated from the School of Computer Science, Peking University. Participated in several software computer academic conferences, and keen... Read More →
Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Connectivity

16:25 HKT

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training | 利用拓扑建模和拓扑感知调度加速LLM训练 - Yang Wang, Huawei
Wednesday August 21, 2024 16:25 - 17:00 HKT
In the LLM training and inference era, the bottle neck has changed from computing to network. A lot of high throughput and low latency inter-connect technology are widely used, e.g. nvlink, nvswitch to build hyper computer such as nvidia super pod, google multi-slice, AWS placement group. However, Kubernetes has net yet addressed topology awareness efficiently, resulting in low performance when sub-optimal resources are provisioned. This talk will explore the inter-node communication and resources within node inter-connect. Also analyze how these two toplogical factors impacts on the runtime performance of AI workload especially for large language model training. The talk will cover: - How to model the topology on underlying resources like NUMA, Rack, Super Pod, Hyper Computer - How to make scheduler to aware of topology and make the best scheduling - How to coordinate topology-aware scheduling with DRA on node

在LLM训练和推断时代,瓶颈已经从计算转变为网络。许多高吞吐量和低延迟的互连技术被广泛使用,例如nvlink、nvswitch用于构建超级计算机,如nvidia超级Pod、谷歌多片、AWS放置组。 然而,Kubernetes尚未有效地解决拓扑意识问题,导致在资源配置不佳时性能较低。 本次演讲将探讨节点间通信和节点内部资源的互连。还将分析这两个拓扑因素如何影响AI工作负载的运行性能,特别是对于大型语言模型训练。 演讲内容包括: - 如何对底层资源(如NUMA、机架、超级计算机)建模拓扑 - 如何使调度程序意识到拓扑并进行最佳调度 - 如何协调拓扑感知调度与节点上的DRA
Speakers
avatar for Yang Wang

Yang Wang

Senior engineer and maintainer of Volcano, Huawei Cloud Technologies Co., LTD
Volcano maintainer and speaker at KCD and GOTC. Focus on cloud native scheduling and multi-cluster managment.
Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, AI + ML

17:15 HKT

Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi | 解锁异构AI基础设施K8s集群:发挥HAMi的力量 - Xiao Zhang, DaoCloud & Mengxuan Li, The 4th Paradigm
Wednesday August 21, 2024 17:15 - 17:50 HKT
With AI's growing popularity, Kubernetes has become the de facto AI infrastructure. However, the increasing number of clusters with diverse AI devices (e.g., NVIDIA, Intel, Huawei Ascend) presents a major challenge. AI devices are expensive, how to better improve resource utilization? How to better integrate with K8s clusters? How to manage heterogeneous AI devices consistently, support flexible scheduling policies, and observability all bring many challenges The HAMi project was born for this purpose. This session including: * How K8s manages heterogeneous AI devices (unified scheduling, observability) * How to improve device usage by GPU share * How to ensure the QOS of high-priority tasks in GPU share stories * Support flexible scheduling strategies for GPU (NUMA affinity/anti-affinity, binpack/spread etc) * Integration with other projects (such as volcano, scheduler-plugin, etc.) * Real-world case studies from production-level users. * Some other challenges still faced and roadmap

随着人工智能的日益普及,Kubernetes已成为事实上的人工智能基础设施。然而,不断增加的具有多样化人工智能设备(如NVIDIA、Intel、华为Ascend)的集群数量带来了重大挑战。人工智能设备价格昂贵,如何更好地提高资源利用率?如何更好地与K8s集群集成?如何一致地管理异构人工智能设备,支持灵活的调度策略和可观察性都带来了许多挑战。HAMi项目应运而生。本场演讲包括: * K8s如何管理异构人工智能设备(统一调度、可观察性) * 如何通过GPU共享提高设备使用率 * 如何确保GPU共享故事中高优先级任务的QOS * 为GPU支持灵活的调度策略(NUMA亲和性/反亲和性、binpack/spread等) * 与其他项目的集成(如volcano、scheduler-plugin等) * 来自生产级用户的实际案例研究。 * 仍然面临的一些其他挑战和路线图
Speakers
avatar for xiaozhang

xiaozhang

Senior Technical Lead, DaoCloud
- Xiao Zhang is leader of the Container team(focus on infra,AI,Muti-Cluster,Cluster - LCM,OCI) - Kubernetes / Kubernetes-sigs active Contributor、member - Karmada maintainer,kubean maintainer,HAMi maintainer - Cloud-Native Developer - CNCF Open Source Enthusiast. - GithubID: waw... Read More →
avatar for Mengxuan Li

Mengxuan Li

senior developer, The 4th Paradigm Co., Ltd
Reviewer of volcano community Founder of CNCF Landscape project HAMi responsible for the development of gpu virtualization mechanism on volcano. It have been merged in the master branch of volcano, and will be released in v1.8. speaker, in OpenAtom Global Open Source Commit#2023 speaker... Read More →
Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 3

17:15 HKT

Multi-Cluster Networking and Service Discovery Leveraging NRI | 利用NRI的多集群网络和服务发现 - LingMing Xia, Purple Mountain Laboratories & Di Xu, Xiaohongshu
Wednesday August 21, 2024 17:15 - 17:50 HKT
Connection and service discovery are usually key challenges for multi-cluster management, existing solutions such as Submariner introduce pre-conditions for public IP and specific CNI. This is problematic for projects like the "East-to-West Computing Resource Transfer Project" where clusters lack public IPs and have diverse CNIs due to different ownership. This session introduces a solution to establish an independent and unified parallel network for east-west traffic cross clusters based on Node Resource Interface (NRI) to avoid intrusive modifications for clusters and limitations on CNI. A hybrid approach is provided for inter-cluster traffic: clusters can communicate through a hub cluster with public IP or connect directly if public IP is equipped. Moreover, cross-cluster service discovery follows the MCS standard to ensure seamless service access. All functionalities remain agnostic to Kubernetes and applications. A live demo will be shown in this session.

连接和服务发现通常是多集群管理的关键挑战,现有解决方案如Submariner引入了公共IP和特定CNI的先决条件。这对于像“东西计算资源转移项目”这样的项目是有问题的,因为集群缺乏公共IP并且由于不同所有权而具有不同的CNI。 本场演讲介绍了一种解决方案,基于节点资源接口(NRI)建立一个独立和统一的跨集群东西流量网络,以避免对集群进行侵入性修改和对CNI的限制。提供了一种混合方法用于集群间流量:集群可以通过具有公共IP的中心集群进行通信,或者如果具有公共IP则可以直接连接。此外,跨集群服务发现遵循MCS标准,以确保无缝的服务访问。所有功能都与Kubernetes和应用程序无关。 本场演讲将展示现场演示。
Speakers
avatar for Di Xu

Di Xu

Principle Software Engineer, Xiaohongshu
Currently, he serves as a Tech Lead at Xiaohongshu, where he leads a team focused on building a highly reliable and scalable container platform. He is the founder of CNCF Sandbox Project Clusternet. Also, he is a top 50 code contributor in Kubernetes community. He had spoken many... Read More →
avatar for Lingming

Lingming

Researcher in Purple Mountain Laboratories, Purple Mountain Laboratories
Focusing on subjects such as cloud-native and distributed clouds. I am currently working as a researcher in the New Computing Architecture Research group of Purple Mountain Laboratories.
Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Connectivity
 
Thursday, August 22
 

11:00 HKT

A Story of Managing Kubernetes Watch Events End-to End Flow in Extremely Large Clusters | 在极大规模集群中管理Kubernetes watch事件端到端流程的故事 - Bo Tang, Ant Group
Thursday August 22, 2024 11:00 - 11:35 HKT
The K8s watching mechanism has not been given the attention it deserves for an extended period. However, it is critical to the K8s cluster in both stability and perfermance aspsects and watch latency is a perfect indicator of cluster health. This talk begins by introducing the measurement of watch events latency and then defines watch SLI and SLO metrics. Using watch SLO as a guide, the talk will show the bottleneck identification process for watching. And the talk will describe the optimizations made to apiserver, etcd, kubelet, controller-runtime and clients such as controllers and schedulers in various aspects wrt watching, including watch latency, pod provisioning time, bandwidth, cpu/mem etc. With these optimizations, daily P99 watch latency has improved by over 90% in large clusters (~20K nodes) impacting billions of watch events. Pod provisioning time has improved by over 60%. Apiserver bandwidth has decreased by 50%. The overall stability of K8s cluster has improved greatly.

K8s观察机制长期以来并未得到应有的重视。然而,它对于K8s集群的稳定性和性能至关重要,观察延迟是集群健康的完美指标。 本次演讲将首先介绍观察事件延迟的测量,然后定义观察SLI和SLO指标。通过观察SLO作为指导,演讲将展示观察瓶颈识别过程。演讲将描述在观察方面对apiserver、etcd、kubelet、controller-runtime和客户端(如控制器和调度器)进行的各种优化,包括观察延迟、Pod提供时间、带宽、CPU/内存等方面。 通过这些优化,大型集群(~20K节点)中每日P99观察延迟已经提高了超过90%,影响了数十亿次观察事件。Pod提供时间已经提高了超过60%。Apiserver带宽减少了50%。K8s集群的整体稳定性得到了极大的改善。
Speakers
avatar for Bo Tang

Bo Tang

Senior Engineer, Ant Group
Bo Tang is a senior engineer in Ant Group. He is currently working on scalability and performance optimization of Kubernetes clusters.
Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

11:00 HKT

Dollars and PPM's - Carbon Emissions and Cloud Spend | 美元和PPM - 碳排放和云支出 - Bryan Oliver, Thoughtworks
Thursday August 22, 2024 11:00 - 11:35 HKT
Cloud Carbon emissions are unfortunately not the priority of most enterprises. Costs, however, are. In the Cloud Native space, there is an ever-growing list of spend tracking and reduction tools. In this talk, we'll discuss several strategies you can adopt to unify the prioritization of cloud costs and carbon impact. We want to show how you can align with your business goal of simultaneously reducing cloud spend and overall carbon emissions.

云计算的碳排放很可惜并不是大多数企业的首要任务。成本,然而,是。在云原生领域,有越来越多的支出跟踪和降低工具。 在这次讨论中,我们将讨论几种您可以采用的策略,统一云成本和碳影响的优先级。我们希望展示如何与您同时降低云支出和整体碳排放的业务目标保持一致。
Speakers
avatar for Bryan Oliver

Bryan Oliver

Principal, Thoughtworks
Bryan is an experienced engineer and leader who designs and builds complex distributed systems. He has spent his career developing mobile and back-end systems whilst building autonomous teams. More recently he has been focused on delivery and cloud native at Thoughtworks. In his free... Read More →
Thursday August 22, 2024 11:00 - 11:35 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Observability

11:00 HKT

OpenYurt & Dragonfly: Enhancing Efficient Distribution of LLMs in Cloud-Edge Collaborative Scenarios | OpenYurt和Dragonfly:增强云边协作场景中LLM的高效分发 - Linbo He, alibaba cloud & Jim Ma, Ant Group
Thursday August 22, 2024 11:00 - 11:35 HKT
As LLMs continue to grow in size, their deployment and delivery in cloud-edge environments are faced with substantial challenges, especially within edge computing settings that encompass multiple sites with thousands of edge nodes. In this presentation, we will explore how to efficiently distribute LLM applications across dispersed edge nodes using OpenYurt. We will also delve into how Dragonfly’s P2P image distribution technology can address the issue of public network bandwidth consumption encountered during cross-site transmission, reducing public network traffic consumption by up to 90% compared to conventional LLM distribution, and achieving rapid and efficient sharing of LLMs in physically isolated environments. During this presentation, container service experts from Alibaba Cloud and Ant Group will share this solution and introduce the practical application of combining OpenYurt with Dragonfly in edge computing scenarios for LLMs.

随着LLM的规模不断增长,它们在云边缘环境中的部署和交付面临着重大挑战,特别是在涵盖数千个边缘节点的边缘计算环境中。在本次演讲中,我们将探讨如何使用OpenYurt在分散的边缘节点上高效分发LLM应用程序。我们还将深入探讨Dragonfly的P2P图像分发技术如何解决跨站点传输中遇到的公共网络带宽消耗问题,与传统的LLM分发相比,将公共网络流量消耗降低高达90%,实现在物理隔离环境中LLM的快速高效共享。 在本次演示中,来自阿里巴巴云和蚂蚁集团的容器服务专家将分享这一解决方案,并介绍在LLM的边缘计算场景中将OpenYurt与Dragonfly结合应用的实际应用。
Speakers
avatar for Jim Ma

Jim Ma

Senior Engineer, Ant Group
Kubernetes enthusiast at Ant Group, diving deep into Kubernetes CSI storage, OCI image distribution and maintaining CNCF Dragonfly.
avatar for Linbo He

Linbo He

senior software engineer, alibaba cloud
I am a member of the Alibaba Cloud Container Service team and one of the founding contributors to the OpenYurt project. Since 2015, I have been actively engaged in the design, development, and open-source initiatives related to Kubernetes. I have taken on responsibilities in a variety... Read More →
Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Connectivity

11:50 HKT

VeScale: A PyTorch Native LLM Training Framework | veScale:一个PyTorch原生LLM训练框架 - Hongyu Zhu, ByteDance
Thursday August 22, 2024 11:50 - 12:25 HKT
The era of giant LLM today calls forth distributed training. Despite countless distributed training frameworks that have been published in the past decade, few have excelled at real industry production, as the quality favored the most is often the Ease of Use instead of pure Performance. The Ease of Use lies in two essentials -- PyTorch and Automatic Parallelism, because: i) PyTorch ecosystem dominates and owns 92% of models on HuggingFace, and ii) giant models cannot be trained without complex nD Parallelism. Currently, this Ease of Use is "broken" for industry-level frameworks, as they are either not PyTorch-native (TensorFlow/JAX) or not fully Automated (Megatron/DeepSpeed/torch). We propose a novel framework that combines PyTorch Nativeness and Automatic Parallelism for scaling LLM training with Ease of Use. We only expect developers to write single-device torch code but automatically parallelize it into nD parallelism with all heavy lifting handled transparently.

当今巨型LLM时代呼唤分布式训练。尽管过去十年中已经发布了无数分布式训练框架,但很少有能够在真实产业生产中表现出色,因为最受青睐的质量往往是易用性而不是纯性能。易用性在于两个关键点--PyTorch和自动并行性,因为:i)PyTorch生态系统主导并拥有HuggingFace上92%的模型,ii)巨型模型无法在没有复杂的nD并行性的情况下进行训练。 目前,这种易用性对于产业级框架来说已经“破碎”,因为它们要么不是PyTorch原生的(TensorFlow/JAX),要么不是完全自动化的(Megatron/DeepSpeed/torch)。 我们提出了一个结合了PyTorch原生性和自动并行性的新型框架,以便通过易用性扩展LLM训练。我们只期望开发人员编写单设备torch代码,但自动将其并行化为nD并行性,所有繁重的工作都由框架透明地处理。
Speakers
avatar for Hongyu Zhu

Hongyu Zhu

Machine Learning System Software Engineer, ByteDance
Hongyu is a Machine Learning System Engineer in ByteDance AML group, working on systems and compilers for training workloads. He got his PhD degree from University of Toronto, where he worked with Professor Gennady Pekhimenko. He is generally interested in machine learning compilers... Read More →
Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 3

11:50 HKT

Beyond the Basics: Towards Making Thanos Production-Ready | 超越基础:朝着使Thanos达到生产就绪状态的方向前进 - Benjamin Huo & Junhao Zhang, QingCloud Technologies
Thursday August 22, 2024 11:50 - 12:25 HKT
As one of the most popular and powerful Prometheus long-term storage projects, Thanos is widely adopted by the community. But to use Thanos in production, there are still a lot of day-2 operations that need to be automated. In this talk, KubeSphere maintainers will share their experiences in using and maintaining Thanos in production including: - Kubernetes native definition of all Thanos components - Tenant isolation of ingestion, rule evaluation, compaction - Tenant-based autoscaling mechanism of Thanos Ingester, Ruler, and Compactor - The time-based partition of Thanos store - Tenant-based data lifetime management - The sharding mechanism of the global ruler to handle massive recording rules and alerting rules evaluation workload - The gateway & agent proxy mechanism for read/write with tenant access control - The basic_auth, built-in query UI, and external remote write and query support of the gateway - The tls support between Thanos components - The 3-tier config management

作为最受欢迎和强大的Prometheus长期存储项目之一,Thanos被社区广泛采用。但要在生产环境中使用Thanos,仍然需要自动化许多第二天的运维工作。在这次演讲中,KubeSphere的维护者将分享他们在生产环境中使用和维护Thanos的经验,包括: - 所有Thanos组件的Kubernetes本地定义 - 数据摄入、规则评估、压缩的租户隔离 - 基于租户的Thanos Ingester、Ruler和Compactor的自动扩展机制 - Thanos存储的基于时间的分区 - 基于租户的数据生命周期管理 - 全局规则分片机制,用于处理大量录制规则和警报规则评估工作负载 - 用于读写的网关和代理机制,带有租户访问控制 - 网关的basic_auth、内置查询UI以及外部远程写入和查询支持 - Thanos组件之间的tls支持 - 三层配置管理
Speakers
avatar for Benjamin Huo

Benjamin Huo

Manager of the Architect and Observability Team, QingCloud Technologies, QingCloud Technologies
Benjamin Huo leads QingCloud Technologies' Architect team and Observability Team. He is the founding member of KubeSphere and the co-author of Fluent Operator, Kube-Events, Notification Manager, OpenFunction, and most recently eBPFConductor. He loves cloud-native technologies especially... Read More →
avatar for Junhao Zhang

Junhao Zhang

Senior Software Engineer, QingCloud Technologies
Junhao Zhang, Senior Development Engineer at QingCloud Technologies, is responsible for the research and development of container platform monitoring, alerting, and other cloud-native services. With many years of industry experience, he has previously held positions at companies such... Read More →
Thursday August 22, 2024 11:50 - 12:25 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Observability

11:50 HKT

Unlocking Scalability and Simplifying Multi-Cloud Management with Karmada and PipeCD | 使用Karmada和PipeCD解锁可扩展性并简化多云管理 - Khanh Tran, CyberAgent, Inc. & Hongcai Ren, Huawei
Thursday August 22, 2024 11:50 - 12:25 HKT
In the new AI coming age, it has become inevitable for any organizations to embrace the multi-cloud approach. Managing applications across multiple clouds can present various challenges, including resilience, performance, security, cost, and deployment management. How well did you prepare yourself and your services for that new coming age? This presentation will introduce Karmada and PipeCD, two powerful tools designed to support organizations in effectively addressing these challenges and achieving seamless multi-cloud management. Karmada is a multi-cloud container orchestration, while PipeCD is a multi-cloud continuous delivery solution. Both tools are built based on extensive experience in managing applications at scale across multiple clouds. We will delve into the key features and benefits of Karmada and PipeCD, and how they can simplify multi-cloud management. Together, we can unlock the true potential of multi-cloud systems and empower organizations to thrive in the era of AI.

在新的人工智能时代,任何组织都不可避免地需要采用多云方法。在多个云上管理应用程序可能会带来各种挑战,包括弹性、性能、安全性、成本和部署管理。您为新时代做好了多少准备?本次演讲将介绍Karmada和PipeCD,这两款强大的工具旨在支持组织有效应对这些挑战,实现无缝的多云管理。Karmada是一个多云容器编排工具,而PipeCD是一个多云持续交付解决方案。这两款工具都是基于在多个云上管理应用程序的丰富经验构建的。我们将深入探讨Karmada和PipeCD的关键特性和优势,以及它们如何简化多云管理。让我们一起释放多云系统的真正潜力,赋予组织在人工智能时代蓬勃发展的力量。
Speakers
avatar for Hongcai Ren

Hongcai Ren

Senior Software Engineer, Huawei
Hongcai Ren(@RainbowMango) is the CNCF Ambassador, who has been working on Kubernetes and other CNCF projects since 2019, and is the maintainer of the Kubernetes and Karmada projects.
avatar for Khanh Tran

Khanh Tran

Software Engineer, CyberAgent, Inc.
Khanh is a maintainer of the PipeCD project. He is currently employed at CyberAgent Inc, and responsible for the CI/CD system across the organization. As a member of the developer productivity team, his primary focus is on automation and anything that enhances the development process... Read More →
Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, Platform Engineering

13:50 HKT

Implement Auto Instrumentation Under GraalVM Static Compilation on OTel Java Agent | GraalVM 静态编译下 OTel Java Agent 的自动增强方案与实现 - Zihao Rao & Ziyi Lin, Alibaba Cloud
Thursday August 22, 2024 13:50 - 14:25 HKT
GraalVM static compilation has a significant effect on improving Java application startup speed and runtime memory usage. It is very valuable for the Java to flourish in Cloud Native ecosystem. However, the automatic instrumentation originally provided based on Java Agent will become invalid after static compilation. We designed a static instrumentation solution in GraalVM to solve above problem. This speech will introduce the overall design idea of the solution and related test results in OTel Java Agent.

GraalVM静态编译对于提升Java应用的启动速度和运行时内存占用有着显著的效果,对于Java在云生态中的蓬勃发展有着十分宝贵的价值。然而,原本基于Java Agent提供的自动插桩功能在静态编译之后将会失效。针对上述问题我们在GraalVM中设计了静态插桩方案,本演讲将介绍该方案的整体设计思路以及在OTel Java Agent中的相关测试结果。
Speakers
avatar for Zihao Rao

Zihao Rao

Software Engineer, Alibaba Cloud
Zihao is a software engineer at Alibaba Cloud. Over the past few years, he has participated in several well-known open source projects, he is steering committee member of Spring Cloud Alibaba project, and is a triager for OpenTelemetry Java Instrumentation now.
avatar for Ziyi Lin

Ziyi Lin

Senior Software Engineer, Alibaba Cloud
Author of book "Static compilation for Java in GraalVM: the principles and practice". ACM SIGSOFT distinguished paper award winner (ICSE'23). Committor of Apache incubating Teaclave Java TEE SDK(https://github.com/apache/incubator-teaclave-java-tee-sdk). Active contributor of GraalVM(https://github.com/pulls?q=is%3Apr+org%3Aoracle+author%3Aziyilin... Read More →
Thursday August 22, 2024 13:50 - 14:25 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Observability

13:50 HKT

Testing and Release Patterns for Crossplane | 跨平面的测试和发布模式 - Yury Tsarev & Steven Borrelli, Upbound
Thursday August 22, 2024 13:50 - 14:25 HKT
Crossplane has become the foundation of many Internal Developer Platforms (IDPs). A requirement for any IDP in production is the ability to make changes and upgrades to the platform with confidence. This talk will cover testing and release patterns based on our experience building production-ready environments across a range of Crossplane users. We’ll cover the lifecycle of a Crossplane Composition upgrade, from local commit to pull request to target customer environment, end-to-end testing tools, handling API changes, and how to control updates to customer environments. For quite a while, testing Crossplane Compositions meant relying exclusively on costly end-to-end layers. In this talk, we're unveiling new unit testing capabilities that allow you to evaluate and test your Composition code in complete isolation.

Crossplane已成为许多内部开发者平台(IDPs)的基础。在生产中,任何IDP的要求都是能够有信心地对平台进行更改和升级。 本次演讲将涵盖基于我们在跨多个Crossplane用户构建生产就绪环境的经验,讨论测试和发布模式。 我们将介绍Crossplane Composition升级的生命周期,从本地提交到拉取请求再到目标客户环境,端到端测试工具,处理API更改以及如何控制对客户环境的更新。 相当长一段时间以来,测试Crossplane Compositions意味着完全依赖昂贵的端到端层。在本次演讲中,我们将揭示新的单元测试功能,使您能够在完全隔离的环境中评估和测试您的Composition代码。
Speakers
avatar for Steven Borrelli

Steven Borrelli

Principal Solutions Architect, Upbound
Steven is a Principal Solutions Architect for Upbound, where he helps customers adopt Crossplane.
avatar for Yury Tsarev

Yury Tsarev

Principal Solutions Architect, Upbound
Yury is an experienced software engineer who strongly focuses on open-source, software quality and distributed systems. As the creator of k8gb (https://www.k8gb.io) and active contributor to the Crossplane ecosystem, he frequently speaks at conferences covering topics such as Control... Read More →
Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, Platform Engineering

13:50 HKT

OS Migration Solution on Cloud | 云上操作系统迁移解决方案 - Jianlin Lv, eBay
Thursday August 22, 2024 13:50 - 14:25 HKT
Each Linux distribution has a lifecycle; this refers to when the OS developers stop providing updates or any form of support. Continuing to use EOL Linux poses risks such as security vulnerabilities, compatibility issues, and lack of official support. Cloud providers face the challenge of quickly and safely migrating OS to a supported distribution. There are several challenges involved in the process of migrating OS: 1. Ensuring the safety of application data, which is especially significant during OS migrations between different Linux distributions; 2. Customizing the OS based on the Linux distribution, which includes changes to the kernel, deb packages, specific configurations, and tools; 3. How to quickly rollout new OS to the production environment. Achieving the goal of transitioning over 100,000 physical nodes each month without affecting customer operations and minimizing node downtime. This talk will detail the issues encountered in OS migration and the proposed solutions.

每个Linux发行版都有一个生命周期;这指的是当操作系统开发者停止提供更新或任何形式的支持时。继续使用EOL Linux会带来风险,如安全漏洞、兼容性问题和缺乏官方支持。 云服务提供商面临着快速且安全地将操作系统迁移到受支持的发行版的挑战。 在迁移操作系统的过程中涉及到几个挑战: 1. 确保应用数据的安全性,在不同Linux发行版之间迁移操作系统时尤为重要; 2. 根据Linux发行版定制操作系统,包括对内核、deb软件包、特定配置和工具的更改; 3. 如何快速将新操作系统推出到生产环境。实现每月迁移超过10万个物理节点的目标,同时不影响客户运营并最小化节点停机时间。 本次演讲将详细介绍操作系统迁移中遇到的问题和提出的解决方案。
Speakers
avatar for Jianlin Lv

Jianlin Lv

Senior Linux Kernel Development Engineer, eBay
Jianlin Lv currently works at eBay CCOE as a Senior Kernel Engineer, responsible for the maintenance and release of eBay TessOS. He has long been involved in the development and maintenance of open-source software and operating systems and has contributed code to multiple open-source... Read More →
Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 5
  Open Source Summit Sessions, Operating Systems

14:40 HKT

Kelemetry: Global Control Plane Tracing for Kubernetes | Kelemetry:面向Kubernetes控制面的全局追踪系统 - Wei Shao & Jonathan Chan, ByteDance
Thursday August 22, 2024 14:40 - 15:15 HKT
Debugging Kubernetes system issues is complicated: different controllers manipulate objects independently, sometimes triggering changes in other controllers. Unlike traditional RPC-based services, the relationship between components is not explicit; identifying which component causes an issue could be like finding a needle in a haystack. Components expose their own fragmented data, often limited to the lifecycle of a single request and fail to illustrate the bigger picture of asynchronous causal events. This talk introduces Kelemetry, a global tracing system for the Kubernetes control plane using scattered data sources from audit log, events, informers and component traces. Through several demonstrations of troubleshooting online problems, we will see how Kelemetry reveals the state transition of related objects over a long timespan and reconstructs the causal hierarchy of events to provide intuitive insight into the What, When and Why of everything going on in a Kubernetes system.

调试Kubernetes系统问题是复杂的:不同的控制器独立地操作对象,有时会触发其他控制器的变化。与传统的基于RPC的服务不同,组件之间的关系并不明确;确定哪个组件引起了问题就像在一堆草堆中找针一样困难。组件展示它们自己的碎片化数据,通常仅限于单个请求的生命周期,并未展示异步因果事件的整体情况。 本次演讲介绍了Kelemetry,这是一个利用审计日志、事件、通知器和组件跟踪的分散数据源的Kubernetes控制平面全局跟踪系统。通过几次在线问题排查演示,我们将看到Kelemetry如何揭示相关对象在长时间跨度内的状态转换,并重建事件的因果层次结构,以提供对Kubernetes系统中发生的一切的直观洞察。
Speakers
avatar for Wei Shao

Wei Shao

Senior Software Engineer, ByteDance
Wei Shao is a tech lead on the Orchestration & Scheduling team at ByteDance, and a maintainer of KubeWharf projects. Wei has 6+ years of experience in the cloud native area, focusing on resource management and performance-enhanced systems in K8s. Wei led the development of multiple... Read More →
avatar for Jonathan Chan

Jonathan Chan

Software engineer, ByteDance
Jonathan is a software engineer at ByteDance working on Kubernetes related infrastructure such as observability systems and cluster federation. He is also a passionate contributor to a number of open source projects.
Thursday August 22, 2024 14:40 - 15:15 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Observability

14:40 HKT

NanoVisor: Revolutionizing FaaS Cold Start Performance with Secure, Lightweight Container Runtime | NanoVisor:通过安全、轻量级容器运行时改变FaaS冷启动性能 - Tianyu Zhou, Ant Group
Thursday August 22, 2024 14:40 - 15:15 HKT
Function as a Service(FaaS) is booming, but cold start time, the time it takes to create a new container for a function, remains a significant bottleneck. This not only impacts user experience with noticeable delays, but also incurs unnecessary costs due to wasted resources. NanoVisor, a groundbreaking container runtime built on gVisor, tackles the challenge of slow cold start time in FaaS. It achieves this by a series of optimizations specifically designed for FaaS: lightweight containerd interaction for faster setup, read-only filesystem for enhanced efficiency, and a sandbox fork mechanism that replaces the heavy container creation for significant performance gains. These empower NanoVisor to create secure, sandboxed containers ready for function execution within an astonishing 5ms,

Function as a Service(FaaS)正在蓬勃发展,但冷启动时间,即为函数创建新容器所需的时间,仍然是一个重要的瓶颈。这不仅影响用户体验,导致明显的延迟,还因浪费资源而产生不必要的成本。NanoVisor是一种基于gVisor构建的开创性容器运行时,解决了FaaS中慢冷启动时间的挑战。它通过一系列专为FaaS设计的优化来实现:轻量级的containerd交互以加快设置速度,只读文件系统以提高效率,以及一个替代繁重容器创建的沙箱分叉机制,以获得显著的性能提升。这些优化使NanoVisor能够在惊人的5毫秒内创建安全的、沙箱化的容器,每个实例的内存开销不到1MB,每个节点的QPS为1.5K。它已成功应用于蚂蚁集团的生态系统,包括支付宝云基地和SOFA Function,以及CI/CD加速。
Speakers
avatar for Tianyu Zhou

Tianyu Zhou

System Engineer, Ant Group
Tianyu Zhou, a system engineer at Ant Group. I graduated from Zhejiang University with a master's degree in cyberspace security. My research interests include kernel, system security and container security.
Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, Emerging + Advanced

15:35 HKT

Empower Large Language Models (LLMs) Serving in Production with Cloud Native AI Technologies | 利用云原生人工智能技术在生产环境中赋能大型语言模型(LLMs) - Lize Cai, SAP & Yang Che, Alibaba Cloud Intelligence
Thursday August 22, 2024 15:35 - 16:10 HKT
LLMs have heightened public expectations of generative models. However, as noted in the Gartner report, running AI applications in production poses significant challenges. To tackle the challenges, we have redesigned and optimized the software capabilities of Cloud Native AI Technologies. By extending KServe to handle OpenAI's streaming requests, it can accommodate the inference load of LLM. With Fluid and Vineyard, It shows a result of reducing Llama-30B model loading time from 10 minutes to under 25 seconds. However, the above optimizations do not stop there. Since LLM loading is not a high-frequency operation,It is crucial to utilize cronHPA for timed auto-scaling in order to achieve a balance between cost and performance, and to evaluate the cost-effectiveness of the scaling process. As KServe and Fluid's reviewer and maintainer, we share our insights on the challenges in the session. We will showcase effective use of Cloud Native AI and share our experiences in production.

LLM让公众对生成式大模型的期望提高。然而,正如Gartner报告所指出的,将AI应用程序投入生产中存在重大挑战。为了解决这些挑战,我们重新设计和优化了云原生AI技术的软件能力。通过扩展KServe以处理OpenAI的流式请求,它可以容纳LLM的推理负载。通过Fluid和Vineyard,我们成功将Llama-30B模型的加载时间从10分钟缩短到不到25秒。然而,上述优化并不止于此。由于LLM加载不是高频操作,利用cronHPA进行定时自动扩展至关重要,以实现成本和性能之间的平衡,并评估扩展过程的成本效益。作为KServe和Fluid的审阅者和维护者,我们在本场演讲中分享了对挑战的见解。我们将展示云原生AI的有效使用,并分享我们在生产中的经验。
Speakers
avatar for Yang Che

Yang Che

senior engineer, Alibaba Cloud Intelligence
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →
avatar for Lize Cai

Lize Cai

Senior Software Engineer, SAP
Lize is a senior software engineer at SAP, based in Singapore. With a strong product mindset, Lize has extensive experience in building enterprise-grade machine learning platforms. A passionate advocate for open source technology, Lize actively contributes to various projects, including... Read More →
Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 3

15:35 HKT

KubeSkoop: Deal with the Complexity of Network Issues and Monitoring with eBPF | KubeSkoop:使用eBPF处理网络问题和监控的复杂性 - Yutong Li, Alibaba Cloud & Bingshen Wang, AlibabaCloud
Thursday August 22, 2024 15:35 - 16:10 HKT
Troubleshooting network issues has always been one of the most difficult parts, especially on Kubernetes. Containerization and microservice results in a denser network topology and more dependencies on various layers of network stack modules, and the new network technology and architecture introduced by AI also provided a significant challenge in observability and diagnosis. We developed KubeSkoop, the networking monitoring and diagnosis suite for Kubernetes. With the eBPF technology, it provides a deep monitoring and tracing of Kubernetes network, to help users quickly locate the network jitter problem happened in the cluster. It also provides the network connectivity check ability, which can help users solve network connectivity issues by one click. This topic will introduce as follows: ● What makes Kubernetes networking complex. ● Introduction to KubeSkoop. ● How we use eBPF to monitor container networking. ● The practices of KubeSkoop in large-scale production environment.

网络问题的故障排除一直是最困难的部分之一,尤其是在Kubernetes上。容器化和微服务导致了更密集的网络拓扑结构,以及对各个网络堆栈模块的更多依赖,人工智能引入的新网络技术和架构也在可观察性和诊断方面提出了重大挑战。 我们开发了KubeSkoop,这是专为Kubernetes设计的网络监控和诊断套件。利用eBPF技术,它提供了对Kubernetes网络的深度监控和跟踪,帮助用户快速定位集群中发生的网络抖动问题。它还提供了网络连接性检查功能,可以帮助用户通过一键解决网络连接问题。 本主题将介绍以下内容: ● 什么使Kubernetes网络变得复杂。 ● KubeSkoop的介绍。 ● 我们如何使用eBPF来监控容器网络。 ● KubeSkoop在大规模生产环境中的实践。
Speakers
avatar for wang bingshen

wang bingshen

Senior Engineer, AlibabaCloud
Bingshen Wang is a Senior Engineer in Alibaba Could, a maintainer of KubeSkoop/Terway/OpenYurt, and a contributor of Kubernetes/Containerd. He mainly focuses on container networking and runtime, and has many years of experience around managing Alibaba Cloud Kubernetes clusters. He... Read More →
avatar for Tony Li

Tony Li

Software Engineer, Alibaba Cloud
Yutong Li is a Software Engineer at Alibaba Cloud. He is working on designing and maintaining container network for Alibaba Cloud Container Service, and open source Kubernetes networking diagnose tool KubeSkoop.
Thursday August 22, 2024 15:35 - 16:10 HKT
Level 2 | Grand Ballroom 1-2
  KubeCon + CloudNativeCon Sessions, Observability

16:25 HKT

Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes | 无缝扩展性:使用Kubernetes编排大型语言模型推理 - Joinal Ahmed & Nirav Kumar, Navatech Group
Thursday August 22, 2024 16:25 - 17:00 HKT
In the dynamic landscape of AI/ML, deploying and orchestrating large open-source inference models on Kubernetes has become paramount. This talk delves into the intricacies of automating the deployment of heavyweight models like Falcon and Llama 2, leveraging Kubernetes Custom Resource Definitions (CRDs) to manage large model files seamlessly through container images. The deployment is streamlined with an HTTP server facilitating inference calls using the model library. This session will explore eliminating manual tuning of deployment parameters to fit GPU hardware by providing preset configurations. Learn how to auto-provision GPU nodes based on specific model requirements, ensuring optimal utilization of resources. We'll discuss empowering users to deploy their containerized models effortlessly by allowing them to provide a pod template in the workspace custom resource inference field. The controller dynamically, in turn, creates deployment workloads utilizing all GPU nodes.

在AI/ML不断发展的领域中,在Kubernetes上部署和编排大型开源推理模型变得至关重要。本次演讲将深入探讨自动化部署像Falcon和Llama 2这样的重型模型的复杂性,利用Kubernetes自定义资源定义(CRDs)通过容器镜像无缝管理大型模型文件。部署通过HTTP服务器简化,以便使用模型库进行推理调用。 本场演讲将探讨通过提供预设配置来消除手动调整部署参数以适应GPU硬件的需求。了解如何根据特定模型要求自动配置GPU节点,确保资源的最佳利用。我们将讨论如何赋予用户轻松部署其容器化模型的能力,允许他们在工作区自定义资源推理字段中提供一个pod模板。控制器动态地创建部署工作负载,利用所有GPU节点。
Speakers
avatar for Joinal Ahmed

Joinal Ahmed

AI Architect, Navatech Group
Joinal is a seasoned Data Science expert passionate about rapid prototyping, community involvement, and driving technology adoption. With a robust technical background, he excels in leading diverse teams through ML projects, recruiting and mentoring talent, optimizing workflows, and... Read More →
avatar for Nirav Kumar

Nirav Kumar

Head of AI and Engineering, Navatech Group
Nirav Kumar is a leader in the field of Artificial Intelligence with over 13 years of experience in data science and machine learning. As Head of AI and Engineering at Navatech Group, he spearheads cutting-edge research and development initiatives aimed at pushing the boundaries of... Read More →
Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 3

16:25 HKT

Uniting Sustainability and Edge Computing: Kepler & Open Horizon on RISC-V and Heterogeneous System | 团结可持续性和边缘计算:Kepler和Open Horizon在RISC-V和异构系统上 - Peng Hui Jiang & David Yao, IBM
Thursday August 22, 2024 16:25 - 17:00 HKT
The dynamic landscape of cloud-edge computing demands solutions to mitigate energy consumption and promote sustainability. Our proposal advocates for the integration of Kepler and Open Horizon with CNCF and LF Edge ecosystem to address diverse hardware requirements in Cloud and Edge deployments, including x86, arm, s390, and the emerging RISC-V architectures. Notably, the Chinese market, characterized by edge devices in manufacturing, retail and surveillance domains, stands to benefit significantly from this initiative. By using Kepler’s sophisticated energy estimation capabilities and Open Horizon’s autonomous workload management features, this proposal endeavors to optimize energy efficiency across heterogeneous edge environments. In the session, we will demonstrate one use case to build and integrate Kepler and Open Horizon to work on RISC-V platform, and monitor and optimize distributed and heterogeneous system to build a greener and more resilient cloud-edge computing paradigm.

云边计算的动态景观需要解决能源消耗问题并促进可持续发展。我们的提案主张将Kepler和Open Horizon与CNCF和LF Edge生态系统整合,以解决云和边缘部署中多样化的硬件需求,包括x86、arm、s390和新兴的RISC-V架构。值得注意的是,中国市场以制造、零售和监控领域的边缘设备为特征,这一举措将使其受益匪浅。通过利用Kepler的先进能源估算能力和Open Horizon的自主工作负载管理功能,本提案旨在优化异构边缘环境的能源效率。 在本场演讲中,我们将演示一个使用案例,展示如何构建和整合Kepler和Open Horizon在RISC-V平台上运行,并监控和优化分布式和异构系统,以构建更环保、更具弹性的云边计算范式。
Speakers
avatar for Peng Hui Jiang

Peng Hui Jiang

Architect, IBM
Peng Hui Jiang is working for IBM as Senior Software Engineer to build and operate Public Cloud services. He has rich experience in Cloud, Database, and Security. He is CNCF Kepler Maintainer and Apache CouchDB committer and Master Inventor in IBM holding more than 200 patents or... Read More →
avatar for 勇 姚

勇 姚

Program Director, IBM Cloud Platform, IBM
David Yao is the Program Director of IBM Cloud Platform in IBM China Development Lab, developing and managing the entire product development lifecycle and team for the dynamic cloud and edge environment. Passionate on learning open technology, building and transforming an open and... Read More →
Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Observability

17:15 HKT

Addressing the #1 Threat to the Web: Authorization | 应对网络的头号威胁:授权 - Jimmy Zelinskie, authzed
Thursday August 22, 2024 17:15 - 17:50 HKT
As more folks deploy cloud-native architectures and technologies, store ever larger amounts of data, and build ever more complex software suites, the complexity required to correctly and securely authorize requests only becomes exponentially more difficult. Broken authorization now tops OWASP's Top 10 Security Risks for Web Apps. Their recommendation? Adopt an ABAC or ReBAC authorization model. This talk establishes the problems with the status quo, explains the core concepts behind ReBAC, and introduces SpiceDB, a widely adopted open source system inspired by the system internally powering Google: Zanzibar.

随着越来越多的人部署云原生架构和技术,存储越来越多的数据,并构建越来越复杂的软件套件,正确和安全地授权请求所需的复杂性变得指数级增加。 破解授权现在已经成为OWASP Web应用程序安全风险前十名之首。他们的建议是采用ABAC或ReBAC授权模型。本次演讲将阐明现状存在的问题,解释ReBAC背后的核心概念,并介绍SpiceDB,这是一个广泛采用的开源系统,受到Google内部系统Zanzibar的启发。
Speakers
avatar for Jimmy Zelinskie

Jimmy Zelinskie

cofounder, authzed
Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →
Thursday August 22, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 7
  KubeCon + CloudNativeCon Sessions, Security

17:15 HKT

Working with Raw Disk Drives in Kubrenetes — YDB's Experience | 在Kubernetes中使用原始磁盘驱动器——YDB的经验 - Ivan Blinkov, YDB
Thursday August 22, 2024 17:15 - 17:50 HKT
YDB is an open-source distributed database management system that, for performance reasons, uses raw disk drives (block devices) to store all data, without any filesystem. It was relatively straightforward to manage such setup in the bare-metal world of the past, but the dynamic nature of cloud-native environments introduced new challenges to keep this performance benefit. In this talk, we'll explore how to leverage Kubernetes and the Operator design pattern to modernize how stateful distributed database clusters are managed without changing the primary approach to how the data is physically stored.

YDB是一个开源的分布式数据库管理系统,为了性能考虑,使用原始磁盘驱动器(块设备)存储所有数据,而不使用任何文件系统。在过去的裸金属世界中管理这样的设置相对比较简单,但云原生环境的动态特性引入了新的挑战,以保持这种性能优势。在这次演讲中,我们将探讨如何利用Kubernetes和运算符设计模式来现代化管理有状态的分布式数据库集群,而不改变数据物理存储的主要方法。
Speakers
avatar for Ivan Blinkov

Ivan Blinkov

VP, Product and Open-Source, YDB
Ivan Blinkov is a seasoned technical leader specializing in data storage and processing. Over the last decade, he was involved in the development of several database management systems, two of which are open-source: ClickHouse in the past and, more recently, YDB.
Thursday August 22, 2024 17:15 - 17:50 HKT
Level 2 | Grand Ballroom 1-2
 
Friday, August 23
 

09:05 HKT

Keynote: Deploying LLM Workloads on Kubernetes by WasmEdge and Kuasar | 主论坛演讲: 使用WasmEdge和Kuasar在Kubernetes上部署LLM工作负载 - Tianyang Zhang, Huawei Cloud & Xiaowei Hu, Second State
Friday August 23, 2024 09:05 - 09:20 HKT
LLMs are powerful artificial intelligence models capable of comprehending and generating natural language. However, the conventional methods for running LLMs pose significant challenges, including complex package installations, GPU devices compatibility concerns, inflexible scaling, limited resource monitoring and statistics, and security vulnerabilities on native platforms. WasmEdge introduces a solution enabling the development of swift, agile, resource-efficient, and secure LLMs applications. Kuasar enables running applications on Kubernetes with faster container startup and reduced management overheads. This session will demonstrate running Llama3-8B on a Kubernetes cluster using WasmEdge and Kuasar as container runtimes. Attendees will explore how Kubernetes enhances efficiency, scalability, and stability in LLMs deployment and operations.

LLM是强大的人工智能模型,能够理解和生成自然语言。然而,传统的运行LLM的方法存在重大挑战,包括复杂的软件包安装、GPU设备兼容性问题、不灵活的扩展性、有限的资源监控和统计,以及在本地平台上的安全漏洞。 WasmEdge提出了一种解决方案,可以开发快速、灵活、资源高效和安全的LLM应用程序。Kuasar使应用程序能够在Kubernetes上运行,具有更快的容器启动速度和减少的管理开销。本场演讲将演示如何使用WasmEdge和Kuasar作为容器运行时,在Kubernetes集群上运行Llama3-8B。与会者将探索Kubernetes如何提高LLM部署和运营的效率、可扩展性和稳定性。
Speakers
avatar for Vivian Hu

Vivian Hu

Product Manager, Second State
Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.
avatar for Tianyang Zhang

Tianyang Zhang

Software Engineer, Huawei Cloud
Working on container runtime at Huawei Cloud. He is the maintainer of Kuasar and the reviewer of Containerd rust-extension repository.
Friday August 23, 2024 09:05 - 09:20 HKT
Level 2 | Grand Ballroom 1-2
  Keynote Sessions | 主论坛演讲, AI + ML

10:35 HKT

Deep Dive Into Windows CSI Driver HostProcess Containers | 深入探讨Windows CSI驱动程序HostProcess容器 - Andy Zhang (OSTC) & Weizhi Chen, Microsoft
Friday August 23, 2024 10:35 - 11:10 HKT
Currently, most Windows CSI drivers depend on Windows csi-proxy because various privileged operations cannot be done from a containerized application running on a Windows node. Beginning in Kubernetes 1.23, HostProcess container is supported and it can run directly on the host as a regular process. Switching to HostProcess container deployment will make Windows CSI driver development and deployment easier. This session will cover the history and implementation details of Windows csi-proxy project, why csi-proxy is needed on Windows CSI driver starting in kubernetes 1.18, and why we removed this csi-proxy dependency from Kubernetes 1.26. We will explore the key learnings and gotchas we resolved while migrating Windows CSI driver development from csi-proxy dependent deployment to HostProcess container deployment. After attending this session, you will understand why and how to migrate your Windows applications to gain the benefits of using HostProcess containers.

目前,大多数Windows CSI驱动程序依赖于Windows csi-proxy,因为各种特权操作无法从在Windows节点上运行的容器化应用程序中执行。从Kubernetes 1.23开始,支持HostProcess容器,它可以直接在主机上作为常规进程运行。切换到HostProcess容器部署将使Windows CSI驱动程序的开发和部署变得更加简单。本场演讲将涵盖Windows csi-proxy项目的历史和实施细节,解释为什么从Kubernetes 1.18开始在Windows CSI驱动程序中需要csi-proxy,以及为什么我们在Kubernetes 1.26中删除了这种csi-proxy依赖性。我们将探讨在将Windows CSI驱动程序开发从依赖于csi-proxy的部署迁移到HostProcess容器部署时解决的关键问题和注意事项。参加本场演讲后,您将了解为什么以及如何将您的Windows应用程序迁移到使用HostProcess容器以获得更多好处。
Speakers
avatar for Andy Zhang (OSTC)

Andy Zhang (OSTC)

Principal Software Engineer, Microsoft
Andy Zhang is the storage lead in Azure Kubernetes Service team at Microsoft, maintainer of multiple Kubernetes projects, including Windows csi-proxy project, Azure CSI drivers, SMB, NFS, iSCSI CSI drivers, etc. Andy focuses on improving the experience of using storage in Kuberne... Read More →
avatar for Weizhi Chen

Weizhi Chen

Senior Software Engineer, Microsoft
Work at Microsoft AKS team on Kubernetes. Focus on k8s storage drivers on Azure.
Friday August 23, 2024 10:35 - 11:10 HKT
Level 2 | Grand Ballroom 1-2

10:35 HKT

Empower WebAssembly and Container Both on RISC-V | 在RISC-V上加强WebAssembly和容器 - Tiejun Chen, VMware
Friday August 23, 2024 10:35 - 11:10 HKT
RISC-V has got noticed from many areas apparently. But in the real world there are the existing challenges for running workload on RISC-V based targets. From cloud to edge you can see the trend of deploying workloads on such sandboxed microservice platforms - containers, k8s, etc. Actually the underlying sandbox technologies are also evolving with something new like WebAssembly that's been considered as the future computing. In the real world we start running WebAssembly as an alternative lightweight runtime side-by-side with Containers and VMs. Here we'd like to review if-how we can build this multi-runtime platform on RISC-V where WebAssembly and container coexists. We will enable to deploy {WebAssembly, Docker} to RISC-V Linux running on a real RISC-V target, and further enable other open source utilities to RISC-V Linux distribution in order to help fit workload into WebAssembly and containers on RISC-V for next explore accelerating open software ecosystem on RISC-V.

RISC-V 显然已经引起了许多领域的关注。但在现实世界中,在基于 RISC-V 的目标上运行工作负载存在着现有的挑战。从云端到边缘,您可以看到在这种沙箱化微服务平台上部署工作负载的趋势 - 容器、k8s 等。实际上,底层的沙箱技术也在不断发展,出现了一些新技术,比如被认为是未来计算的 WebAssembly。在现实世界中,我们开始将 WebAssembly 作为一种轻量级运行时的替代方案与容器和虚拟机并存。在这里,我们想要审查如何在 RISC-V 上构建这种多运行时平台,其中 WebAssembly 和容器共存。我们将使 {WebAssembly,Docker} 能够部署到运行在真实 RISC-V 目标上的 RISC-V Linux,并进一步使其他开源实用工具能够适配到 RISC-V Linux 发行版,以帮助将工作负载适配到 RISC-V 上的 WebAssembly 和容器,以便探索加速 RISC-V 上开放软件生态系统的可能性。
Speakers
avatar for Tiejun Chen

Tiejun Chen

Sr. Technical Lead, VMware
Tiejun Chen was Sr. technical leader. He ever worked several tech companies such as VMware, Intel, Wind River Systems and so on, involved in - cloud native, edge computing, ML/AI, RISC-V, WebAssembly, etc. He ever made many presentations at AI.Dev NA 2023, kubecon China 2021, Kube... Read More →
Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 5

11:25 HKT

Evolution of SPDK Vhost-FS Solution to Accelerate File Access in VMs and Secure Containers | SPDK Vhost-FS解决方案的演进,加速虚拟机中的文件访问并保护容器 - Changpeng Liu, Intel
Friday August 23, 2024 11:25 - 12:00 HKT
Virtio-fs is a shared file system between virtual machines or secure containers and host, Storage Performance Development Kit(SPDK) vhost-fs is the backend implementation of virtio-fs in userspace, in this presentation, we will summarize typical storage solutions that use SPDK vhost-fs and components to build the storage stack, then go through the evolution of SPDK vhost-fs from BlobFS to latest FSDEV module, advanced features such as interrupt mode and thread modeling for data processing in SPDK vhost-fs are also covered.

Virtio-fs是虚拟机或安全容器与主机之间共享文件系统,Storage Performance Development Kit(SPDK) vhost-fs是virtio-fs在用户空间的后端实现。在本次演讲中,我们将总结使用SPDK vhost-fs和组件构建存储栈的典型存储解决方案,然后介绍SPDK vhost-fs从BlobFS到最新的FSDEV模块的演变过程,还将涵盖SPDK vhost-fs中用于数据处理的高级功能,如中断模式和线程建模。
Speakers
avatar for Changpeng Liu

Changpeng Liu

Cloud Solution Architect, Intel
Changpeng is a Cloud Solution Architect at Intel. He has been working on Storage Performance Development Kit since 2014. Currently, Changpeng is a core maintainer for the SPDK. His areas of expertise include NVMe, I/O Virtualization, and storage offload on IPU.
Friday August 23, 2024 11:25 - 12:00 HKT
Level 2 | Grand Ballroom 1-2

13:20 HKT

Build Container Runtime Based on Sandbox API of Containerd | 基于Containerd的Sandbox API构建容器运行时 - Shaobao Feng, Huawei Cloud & Cai Wei, DaoCloud
Friday August 23, 2024 13:20 - 13:55 HKT
Sandbox API is released in containerd 1.7 and will be stable in containerd 2.0. It provides a clean way to implement a sandbox oriented container runtime. Container is more a set of API specifications than a single technology now, with the introduction of different kinds of isolation techiques as sandboxes, We need a clear and abstract definition of Sandbox API, to make it easy to integrate different kinds of sandboxing techiniques to become a container runtime. In this sharing, We will: 1. Make an introduction of Sandbox API of containerd, and why we need it. 2. Show how we build our container runtimes based on the Sandobx API and the benefits comes with it. 3. We will show the demostration of different kinds of sandboxed containers created by Kuasar, a container runtime framework based on the new Sandbox API, currently supports sandboxes of VMM, UserMode Kernel, WebAssembly and Runc.

在KubeCon的会议描述中,我们将介绍Sandbox API在containerd 1.7中发布,并将在containerd 2.0中稳定。它提供了一种清晰的方式来实现面向沙箱的容器运行时。随着不同类型的隔离技术(如沙箱)的引入,容器现在更多地是一组API规范,而不是单一技术。我们需要对Sandbox API进行清晰和抽象的定义,以便轻松集成不同类型的沙箱技术,使其成为容器运行时。 在这次分享中,我们将: 1. 介绍containerd的Sandbox API,以及为什么我们需要它。 2. 展示我们如何基于Sandbox API构建我们的容器运行时以及带来的好处。 3. 我们将展示由基于新Sandbox API的容器运行时框架Kuasar创建的不同类型的沙箱容器的演示,目前支持VMM、UserMode Kernel、WebAssembly和Runc的沙箱。
Speakers
avatar for Wei Cai(Iceber Gu)

Wei Cai(Iceber Gu)

Software Engineer, DaoCloud
Senior open source enthusiast, focused on cloud runtime, multi-cloud and WASM. I am a CNCF Ambassador and founded Clusterpedia and promoted it as a CNCF Sandbox project. I also created KasmCloud to promote the integration of WASM with Kubernetes and contribute it to the WasmCloud... Read More →
avatar for Shaobao Feng

Shaobao Feng

Principal Engineer, Huawei Cloud
Shaobao is Principal Engineer working on Huawei Cloud, with his work focusing on the Serverless Platforms. He has been a leader in building secure container runtime of the first Serverless Kubernetes on public cloud. He is the main code contributor and maintainer of the open source... Read More →
Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Platform Engineering

13:20 HKT

What if Your System Experiences an Outage? Let's Build a Resilient Systems with Chaos Engineering | 如果您的系统遇到故障怎么办?让我们通过混沌工程构建弹性系统 - NamKyu Park, LitmusChaos
Friday August 23, 2024 13:20 - 13:55 HKT
This session explores how LitmusChaos improves the resilience of cloud-native applications by injecting chaos. It also showcases the streamlined management of chaos engineering software through Backstage. Cloud-native applications can be complex to navigate and secure. Our session will present strategies to identify vulnerabilities using GitOps and monitoring, integrated seamlessly into your system. Learn how Backstage and LitmusChaos can enhance your application's resilience with ease! The session starts with chaos orchestration and analysis using LitmusChaos, followed by a live demo highlighting the utilization of LitmusChaos' Backstage plugin and others like Prometheus and ArgoCD. Learn how these plugins, when integrated with Backstage, effectively manage all components necessary for executing chaos engineering.

本场演讲探讨了LitmusChaos如何通过注入混沌来提高云原生应用程序的弹性。它还展示了通过Backstage简化混沌工程软件的管理。 云原生应用程序可能很复杂,难以导航和保护。我们的会议将介绍使用GitOps和监控来识别漏洞的策略,无缝集成到您的系统中。了解如何使用Backstage和LitmusChaos轻松增强您的应用程序的弹性! 本场演讲从使用LitmusChaos进行混沌编排和分析开始,然后展示了使用LitmusChaos的Backstage插件以及其他插件如Prometheus和ArgoCD的实时演示。了解这些插件与Backstage集成后,如何有效管理执行混沌工程所需的所有组件。
Speakers
avatar for Namkyu Park

Namkyu Park

Maintainer, LitmusChaos
Namkyu Park is a CNCF Ambassador and a Software Developer. He worked at several startups in South Korea. He has completed Linux Foundation Mentorship Programme(LitmusChaos) as a mentee and is currently a mentor and maintainer of LitmusChaos. He has previously spoken at GopherCon Korea... Read More →
Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 7

15:15 HKT

Detecting and Overcoming GPU Failures During ML Training | 在ML训练过程中检测和克服GPU故障 - Ganeshkumar Ashokavardhanan, Microsoft & Sarah Belghiti, Wayve
Friday August 23, 2024 15:15 - 15:50 HKT
Scaling ML training demands powerful GPU infrastructure, and as model sizes and training scale increases, GPU failures become an expensive risk. From outright hardware faults to subtle performance degradation, undetected GPU problems can sabotage training jobs, inflating costs and slowing development. This talk dives into GPU failure challenges in the context of ML training, particularly distributed training. We will explore the spectrum of GPU issues, and why even minor performance drops can cripple large jobs. Learn how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks. Understand principles of fault-tolerant distributed training to mitigate GPU failure fallout. Drawing on cloud provider and autonomous vehicle company experience, we will share best practices for efficient identification, remediation, and prevention of GPU failures. We will also explore cutting-edge ideas like CRIU and task pre-emption for GPU workloads.

随着模型规模和训练规模的增加,机器学习训练需要强大的GPU基础设施,而GPU故障成为一种昂贵的风险。从硬件故障到性能逐渐下降,未被发现的GPU问题可能会破坏训练任务,增加成本并减缓开发速度。本次演讲将深入探讨在机器学习训练中GPU故障所带来的挑战,特别是在分布式训练中。我们将探讨各种GPU问题的范围,以及为什么即使是轻微的性能下降也可能瘫痪大型任务。 了解如何通过观测性(利用诸如NVIDIA DCGM之类的工具)通过GPU健康检查实现问题的主动检测。了解容错分布式训练的原则,以减轻GPU故障的后果。借鉴云服务提供商和自动驾驶汽车公司的经验,我们将分享高效识别、纠正和预防GPU故障的最佳实践。我们还将探讨像CRIU和任务抢占等尖端想法,以应对GPU工作负载。
Speakers
avatar for Ganeshkumar Ashokavardhanan

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft
Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →
avatar for Sarah Belghiti

Sarah Belghiti

ML Platform Engineer, Wayve
Sarah Belghiti is an ML Platform Engineer at Wayve, a leading developer of embodied intelligence for autonomous vehicles. She works on the infrastructure, scheduling and monitoring of ML workloads. With GPUs becoming an increasingly scarce resource, her focus has been on building... Read More →
Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 3

15:15 HKT

Expanding Cloud Native Capabilities with WASM: A Case Study of Harbor and WASM Integration | 通过WASM扩展云原生能力:Harbor和WASM集成案例研究 - Chenyu Zhang, AntGroup & Yan Wang, Broadcom
Friday August 23, 2024 15:15 - 15:50 HKT
In the cloud-native realm, eBPF's versatility has led to scalable solutions in observability and security by attaching to system event checkpoints without kernel code modification. This concept has paved the way for extending business applications non-invasively and flexibly without altering the original code. In this session, we'll use Harbor, the cloud-native artifact registry, to showcase how WASM (WebAssembly) extends Harbor's functionalities without code modification. Here, Harbor is analogous to the Linux kernel, and WASM to user-provided eBPF programs. Harbor provides mounting points for various events, such as pre-pull requests, enabling users to filter requests with custom WASM programs. This facilitates fine-grained permission control and artifact security auditing before a user pulls the artifacts, with more features to discover.

在云原生领域,eBPF 的多功能性使得它能够通过附加到系统事件检查点而无需修改内核代码,从而实现可扩展的可观测性和安全性解决方案。这一概念为在不改变原始代码的情况下非侵入性和灵活地扩展业务应用程序铺平了道路。 在本场演讲中,我们将使用 Harbor,云原生制品注册表,展示如何使用 WASM(WebAssembly)在不修改代码的情况下扩展 Harbor 的功能。在这里,Harbor 类似于 Linux 内核,而 WASM 则类似于用户提供的 eBPF 程序。Harbor 提供了各种事件的挂载点,例如预拉取请求,使用户能够使用自定义的 WASM 程序过滤请求。这有助于在用户拉取制品之前进行细粒度的权限控制和制品安全审计,还有更多功能等待您去发现。
Speakers
avatar for Yan Wang

Yan Wang

Staff engineer, Broadcom
Yan Wang is a Staff engineer working on VMWare. As one of the core maintainer of CNCF project Harbor and the maintainer of CNCF project distribution, his main work focuses on technology research and innovation in the cloud native field.
avatar for Chenyu Zhang

Chenyu Zhang

Software Engineer, AntGroup
Chenyu Zhang is a software engineer, currently mainly responsible for the development and maintenance of project harbor, and also has some experience in devops and cloud native related technology stacks.
Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 1
  KubeCon + CloudNativeCon Sessions, Platform Engineering

15:15 HKT

The Experience of ChillyRoom Developing & Managing Session-Based Game on K8s with OpenKruiseGame | 在K8s上使用OpenKruiseGame开发和管理基于会话的游戏的ChillyRoom经验 - Qiuyang Liu, Alibaba Cloud & Xinhao Liu, ChillyRoom
Friday August 23, 2024 15:15 - 15:50 HKT
In the era of traditional game operation and maintenance, session-based games face huge challenges in terms of delivery efficiency and resource costs. Cloud native technology brings exactly the flexibility and highly automated capabilities that session-based games need. However, due to the game servers' strong stateful characteristics, there are also various difficulties in the process of implementing games on Kubernetes. This talk will focus on the characteristics of session-based games and describe how ChillyRoom uses OpenKruiseGame, which is the subproject of CNCF incubating project OpenKruise, to develop and manage session-based games on Kubernetes, providing developers in the game industry with cloud native implementation experience in automatic network access, elastic scaling of game servers, matching logic development, and room status management, etc.

在传统游戏运维时代,基于会话的游戏在交付效率和资源成本方面面临巨大挑战。云原生技术正好为会话型游戏带来了灵活性和高度自动化能力。然而,由于游戏服务器具有强烈的有状态特性,在实现游戏在 Kubernetes 上的过程中也存在各种困难。 本次演讲将重点关注会话型游戏的特点,并描述 ChillyRoom 如何使用 OpenKruise 的子项目 OpenKruiseGame 来开发和管理基于会话的游戏在 Kubernetes 上,为游戏行业的开发人员提供云原生实现经验,包括自动网络访问、游戏服务器的弹性扩展、匹配逻辑开发和房间状态管理等。
Speakers
avatar for Qiuyang Liu

Qiuyang Liu

Senior R&D Engineer, Alibaba Cloud
Qiuyang Liu, head of cloud native game at Alibaba Cloud Container Service and maintainer of the kruise-game project. He has long been engaged in the research and development of cloud native in the gaming field and is committed to promoting the implementation of cloud native in the... Read More →
avatar for Xinhao Liu

Xinhao Liu

Engineer, ChillyRoom
Xinhao Liu, an engineer with one year experience in game server development at ChillyRoom and three years experience in Linux OS and cloud core network software development in industry. He has a passion for creating flexible, high-performance, high-available and easy-to-maintain game... Read More →
Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 7

16:05 HKT

Boosting LLM Development and Training Efficiency: Automated Parallelization with MindSpore | 提升LLM开发和培训效率:MindSpore自动并行化 - Yufeng Lyu, Huawei Technologies Co., Ltd
Friday August 23, 2024 16:05 - 16:40 HKT
With the popularity of LLM, large-scale pre-training has become an indispensable step in AI research and implementation. However, large-scale distributed parallel training requires developers to consider various factors affecting the efficiency of model development and training, such as partitioning and communication, and then modify the model accordingly. In this presentation, we will demonstrate an automatic parallelization approach that allows developers to focus on algorithm research without the need for intrusive model modifications. Distributed training on a large-scale cluster can be achieved simply by configuring strategies. Developers can also utilize MindSpore's hyperparameter search model to automatically find the best parallelization strategy. The parallel strategy obtained through search can achieve 90%-110% of the expert tuning performance, significantly reducing the time required for model modifications while efficiently accelerating LLM training.

随着LLM的流行,大规模预训练已成为人工智能研究和实施中不可或缺的一步。然而,大规模分布式并行训练需要开发人员考虑各种影响模型开发和训练效率的因素,如分区和通信,然后相应地修改模型。 在本次演示中,我们将展示一种自动并行化方法,使开发人员能够专注于算法研究,而无需进行侵入性的模型修改。通过配置策略,可以简单实现在大规模集群上的分布式训练。开发人员还可以利用MindSpore的超参数搜索模型自动找到最佳的并行化策略。通过搜索获得的并行策略可以实现专家调整性能的90%-110%,显著减少了模型修改所需的时间,同时有效加速LLM的训练。
Speakers
avatar for Yufeng Lyu

Yufeng Lyu

Senior Engineer, Huawei Technologies Co., Ltd
Lyu Yufeng, a technical architect at MindSpore and maintainer of the MindNLP framework, focuses his research on natural language processing and distributed parallelism for LLM. He possesses extensive experience in the development and implementation of LLM solutions.
Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 3

16:05 HKT

Unlocking LLM Performance with EBPF: Optimizing Training and Inference Pipelines | 通过eBPF解锁LLM性能:优化训练和推理管道 - Yang Xiang, Yunshan Networks, Inc.
Friday August 23, 2024 16:05 - 16:40 HKT
The training and inference processes of Large Language Models (LLMs) involve handling vast amounts of model data and training data, and consume significant GPU compute resources. However, enhancing GPU utilization becomes extremely challenging in the absence of observability. This presentation will introduce how to achieve observability in LLM training and inference processes with zero disruption using eBPF. This includes utilizing Memory Profiling to understand the loading performance of models and training data, Network Profiling to comprehend the data exchange performance, and GPU Profiling to analyze GPU's MFU (Model FLOPs Utilization) and performance bottlenecks. Additionally, we will share the practical effects of implementing observability in a PyTorch LLM application and the llm.c project using eBPF, aiming to enhance training and inference performance.

大型语言模型(LLMs)的训练和推断过程涉及处理大量的模型数据和训练数据,并消耗大量的GPU计算资源。然而,在缺乏可观察性的情况下,提高GPU利用率变得极具挑战性。 本次演讲将介绍如何利用eBPF在LLM训练和推理过程中实现零中断的可观察性。这包括利用内存分析来了解模型和训练数据的加载性能,网络分析来理解数据交换性能,以及GPU分析来分析GPU的MFU(模型FLOPs利用率)和性能瓶颈。 此外,我们将分享在PyTorch LLM应用程序和llm.c项目中使用eBPF实现可观察性的实际效果,旨在提高训练和推理性能。
Speakers
avatar for Yang Xiang

Yang Xiang

VP of Engineering, Yunshan Networks, Inc.
Received a Ph.D. from Tsinghua University, and currently serving as VP of Engineering at Yunshan Networks and the head of the DeepFlow open-source community. He has presented academic papers on topics such as application observability and network measurement at top international academic... Read More →
Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 2
  KubeCon + CloudNativeCon Sessions, Observability
 

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.