KubeCon + CloudNativeCon + Open Source Summit + AI

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

11:00 HKT

How to Increase the Throughput of Kubernetes Scheduler by Tens of Times | 如何将Kubernetes调度器的吞吐量提高数十倍 - Yuquan Ren & Bing Li, ByteDance

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

Currently, various Kubernetes-based task schedulers popular in the community have limited performance capabilities, which restricts the cluster scale they can handle. Due to the limitation of cluster scale, it is difficult to improve resource utilization through large-scale colocation, and more clusters also bring greater operational burdens. 1. Due to the bottleneck of the scheduler and related components, the maximum cluster scale cannot exceed 5k nodes; 2. In clusters with more than 5k Nodes, scheduling throughput cannot exceed 100 Pods/s. Godel Scheduler is a distributed high-performance scheduler based on Kubernetes, and it is now open-sourced. In this talk, we will go deep into the performance optimization methods of godel scheduler: 1. Optimize scheduling algorithms and do data structures refactor; 2. Implement optimistic concurrency under multi-shard architecture to achieve parallel computation; 3. Abstract "batch" scheduling to fully reuse scheduling computation results.

目前，社区中流行的基于Kubernetes的各种任务调度器在性能方面存在一定限制，这限制了它们能处理的集群规模。由于集群规模的限制，通过大规模的共存难以提高资源利用率，而且更多的集群也会带来更大的运维负担。1. 由于调度器及相关组件的瓶颈，最大集群规模无法超过5k个节点；2. 在超过5k个节点的集群中，调度吞吐量无法超过100个Pod/s。 Godel Scheduler是一个基于Kubernetes的分布式高性能调度器，现已开源。在本次演讲中，我们将深入探讨godel调度器的性能优化方法：1. 优化调度算法并进行数据结构重构；2. 在多分片架构下实现乐观并发以实现并行计算；3. 抽象“批量”调度以充分重用调度计算结果。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Yuquan Ren has 10+ years of working experience in the cloud-native field, contributing extensively to open-source projects such as Kubernetes. Currently, he is a tech leader at ByteDance, primarily focusing on the field of orchestration and scheduling.

Bing Li

Senior Software Engineer, Bytedance

Bing Li has participated in the open source community for nearly 3 years. Currently, he is a senior software engineer at ByteDance, focusing on scheduling system performance optimization and system evolution.

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:50 HKT

Implementing Fine-Grained and Pluggable Container Resource Management Leveraging NRI | 基于 NRI 实现精细化且可插拔的容器资源管理 - Qiang Ren, Intel & He Cao, ByteDance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

To overcome Kubernetes' limitations in resource management, ByteDance developed Katalyst, a resource management system. Katalyst employs a range of methodologies, including colocation, node over-commitment, specification recommendation, and tidal colocation, aimed at optimizing cluster resource utilization.

Initially, Katalyst introduced a QoS Resource Manager (QRM) framework within kubelet, facilitating versatile container resource allocation through a plugin architecture. Presently, the Node Resource Interface (NRI) presents a refined alternative.

This session elucidates how Katalyst leverages NRI for fine-grained and adaptable container resource management, ensuring efficiency without intrusive modifications of upstream components. This novel architecture allows Katalyst to seamlessly integrate with native Kubernetes, offering a user-friendly and easily maintainable solution.

为了克服 Kubernetes 在资源管理方面的局限性，字节跳动构建了一个资源管理系统 Katalyst，通过在离线业务常态混部、资源超分、规格推荐、潮汐混部等方式，提升集群的资源利用率。最初，Katalyst 在 kubelet 中引入了一个 QoS Resource Manager（QRM）框架，通过插件化的方式来扩展容器的资源分配策略；当前，Node Resource Interface（NRI）提供了一个原生的替代方案。

本次演讲将介绍 Katalyst 如何通过 NRI 实现精细化且可插拔的容器资源管理，在不对上游组件进行侵入性修改的情况下，提升资源利用率并保证业务的 SLO 不受影响。这种全新的架构使 Katalyst 能够与原生 Kubernetes 无缝集成，提供了一种易于使用和维护的解决方案。

Speakers

Qiang Ren

Software Engineer, Intel

Ren Qiang works as a Cloud Orchestration Software Engineer in SATG, Intel. He mainly focuses on Cloud Native technologies in the runtime. At the same time, he actively participates in open-source projects and is committed to promoting the development of runtime and resource isola... Read More →

He Cao

Senior Software Engineer, ByteDance

He Cao is a senior software engineer on the Cloud Native team at ByteDance, a maintainer of Katalyst and KubeZoo, and a member of Istio. He has 5+ years of experience in the cloud native area. Since joining ByteDance, he has designed and implemented several critical systems for VKE... Read More →

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

13:50 HKT

Kubespray Unleashed: Navigating Bare Metal Services in Kubernetes for LLM and RAG | Kubespray大放异彩：在Kubernetes中为LLM和RAG部署裸金属服务 - Kay Yan, DaoCloud & Alan Leung, Equinix

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 2

Kubespray, popular within the SIG-Cluster-Lifecycle of Kubernetes, is celebrated for deploying production-ready Kubernetes clusters, particularly on bare metal, which boosts performance for AI workloads like LLM and RAG. This session will explore using Kubespray in bare metal settings, addressing challenges, and sharing best practices. The first part of the talk will show Kubespray's key features and provide practical tips. The latter half will focus on swiftly deploying AI using Retrieval-Augmented Generation (RAG), demonstrating how Kubespray facilitates setting up Kubernetes clusters on bare metal. This setup enhances AI applications by integrating continuous knowledge updates and domain-specific information via RAG, improving the accuracy and credibility of the AI systems. The session will conclude with discussions on community engagement and future advancements, followed by a Q&A period to address participant queries.

KubeCon会议描述： Kubespray在Kubernetes的SIG-Cluster-Lifecycle中备受推崇，以在裸金属上部署可用于生产的Kubernetes集群而闻名，特别是对于像LLM和RAG这样的AI工作负载，可以提高性能。本场演讲将探讨在裸金属环境中使用Kubespray，解决挑战，并分享最佳实践。演讲的第一部分将展示Kubespray的关键特性并提供实用技巧。后半部分将重点介绍如何使用检索增强生成（RAG）快速部署AI，演示Kubespray如何在裸金属上设置Kubernetes集群。通过RAG集成持续的知识更新和领域特定信息，这种设置可以提升AI应用程序的性能，提高AI系统的准确性和可信度。本场演讲将以社区参与和未来发展的讨论结束，随后进行问答环节以解答参与者的疑问。

Speakers

Kay Yan

Principal Software Engineer, DaoCloud

Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.

Alan Leung

Digital Technical Specialist, Equinix

Alan is the Digital Technical Specialist at Equinix with focus on enabling customers, prospects and partners to develop innovative solutions to solve business challenges at the digital edge.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:40 HKT

Scaling Kubernetes: Best Practices for Managing Large-Scale Batch Jobs with Spark and Argo Workflow | 扩展Kubernetes：管理大规模批处理作业的最佳实践与Spark和Argo工作流 - Yu Zhuang & Liu Jiaxu, Alibaba Cloud

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 2

Are you managing large-scale batch jobs on Kubernetes, like data processing with Spark applications or genomics computing with Argo workflows? To complete these jobs promptly, a significant number of pods have to be scaled out/in quickly for parallel computation. It means a big pressure to Kubernetes control plane. In this talk, we will use Spark and Argo workflows as example, guiding you how to build a Kubernetes cluster which supports creating/deleting 20000 of pods frequently. Our focus will be on tuning the Kubernetes control plane, including optimizing the list-watch mechanism, service broadcasting, environment variable attachments, API server configurations. Additionally, we'll share some of the best practices for configuring Spark operator and Argo workflows controller.

您是否正在Kubernetes上管理大规模的批处理作业，比如使用Spark应用程序进行数据处理或使用Argo工作流进行基因组计算？为了及时完成这些作业，需要快速地扩展/缩减大量的Pod以进行并行计算，这给Kubernetes控制平面带来了巨大压力。在本次演讲中，我们将以Spark和Argo工作流为例，指导您如何构建一个支持频繁创建/删除20000个Pod的Kubernetes集群。我们将重点放在调优Kubernetes控制平面上，包括优化列表-观察机制、服务广播、环境变量附加、API服务器配置等。此外，我们还将分享一些配置Spark操作员和Argo工作流控制器的最佳实践。

Speakers

Liu Jiaxu

Senior Engineer, Alibaba Cloud

Jiaxu Liu is a Senior Engineer on the Container Service Team at Alibaba Cloud. He specializes in observability enhancement and large-scale cluster management and optimization for Alibaba Cloud's container service offerings. Before joining Alibaba Cloud, he worked at Nokia as a Senior... Read More →

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

15:35 HKT

Tackling Operational Time-to-Market Decelerators in AI/ML Projects | 应对人工智能/机器学习项目中的运营时间市场减速器 - Adrian Matei & Andreea Munteanu, Canonical

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 2

In the competitive AI market, Time To Market (TTM) is crucial for success. Ensuring secure, scalable, and compliant ML infrastructures often slows TTM due to the complexities of updates, patches, monitoring, and security enforcement. This leads to decreases in ROI, profitability, reproducibility, and competitive edge. To address this, companies can engage Managed Service Providers (MSPs) to offload operational burdens and focus on innovation, yet selecting the right MSP requires consideration of expertise, automation capabilities, and compliance adherence. This presentation explores the AI operational landscape, highlighting indicators and challenges in MSP collaboration. We will focus on the management of open source tools like Kubeflow and MLflow across hybrid and multicloud environments. By understanding operational excellence in AI and available options to achieve it, attendees will gain insights into choosing an approach that aligns with their greater objectives.

在竞争激烈的人工智能市场中，上市时间对于成功至关重要。确保安全、可扩展和合规的机器学习基础设施通常会因更新、补丁、监控和安全执行的复杂性而减慢上市时间，导致投资回报率、盈利能力、可复制性和竞争优势下降。为了解决这个问题，公司可以与托管服务提供商（MSPs）合作，减轻运营负担，专注于创新，但选择合适的MSP需要考虑专业知识、自动化能力和合规性。本次演讲探讨了人工智能运营领域，重点介绍了MSP合作中的指标和挑战。我们将重点关注在混合和多云环境中管理开源工具如Kubeflow和MLflow。通过了解人工智能运营卓越性以及实现卓越性的可用选项，与会者将获得选择与其更大目标一致的方法的见解。

Speakers

Andreea Munteanu

AI Product Manager, Canonical

Andreea Munteanu is a Product Manager at Canonical, leading the MLOps area. With a background in Data Science in various industries, she used AI techniques to enable enterprises to benefit from their initiatives and make data-driven decisions. Nowadays, Andreea is looking to help... Read More →

Adrian Matei

Product Manager, Canonical

With a degree in Information Management for Business, Adrian is now guiding Canonical’s open-source operational management toolset as Product Manager. He has been working in open source operations for the past two years, having previously accumulated experience in technology consulting... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

16:25 HKT

Unleashing the Power of Cluster API: Extensibility and Customization | 释放Cluster API的力量：可扩展性和定制化 - Zain Malik, CityStorageSystems & Nibir Bora, Startup

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Cluster API, designed with extensibility at its core, has revolutionized Kubernetes cluster management. Its open and pluggable architecture empowers providers to implement custom solutions tailored to their unique requirements. In this session, we will explore how Cluster API's extension-by-design philosophy has opened new horizons for organizations seeking to create bespoke Kubernetes clusters. Managing Kubernetes clusters at scale presents unique operational challenges that cannot be tamed with manual operations. Through real-world examples and lessons learned, we will demonstrate how Cluster API's flexibility allows for the integration of diverse infrastructure providers and the implementation of organization-specific customizations. Attendees will gain insights into best practices for extending Cluster API, including developing custom controllers, integrating third-party tools, and creating bespoke workflows.

Cluster API是以可扩展性为核心设计的，已经彻底改变了Kubernetes集群管理。其开放和可插拔的架构赋予提供者实施定制解决方案的能力，以满足其独特需求。在本场演讲中，我们将探讨Cluster API的“通过设计进行扩展”的理念如何为寻求创建定制化Kubernetes集群的组织开辟了新的视野。在规模化管理Kubernetes集群时，会面临无法通过手动操作解决的独特运营挑战。通过现实世界的例子和经验教训，我们将演示Cluster API的灵活性如何允许集成各种基础设施提供者，并实施组织特定的定制化。与会者将获得有关扩展Cluster API的最佳实践的见解，包括开发自定义控制器、集成第三方工具和创建定制工作流程。

Speakers

Zain Malik

Staff Software Engineer, CityStorageSystems

Zain Malik serves as a tech lead in the compute team for a startup, where he has significantly contributed to projects related to cost saving and reliability. And help mature cluster lifecycle management. Before this role, Zain was a product owner and staff software engineer in the... Read More →

Nibir Bora

Engineering Manager, Startup

Nibir is a Engineering Manager in charge of Core Infrastructure at a Stealth Startup, where he is responsible for the company's Kubernetes infrastructure running 100s of clusters globally.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 英语 (English)

17:15 HKT

How to Manage Database Clusters Without a Dedicated Operator | 如何在没有专门Operator的情况下管理数据库集群 - Shanshan Ying, ApeCloud & Shun Ding, China Mobile Cloud

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 2

As Kubernetes becomes integral to cloud-native environments, more organizations are deploying database services on K8S, facing significant challenges. Integrating new database engines typically requires developing a dedicated Kubernetes operator that manages not only resource provisioning but also essential maintenance tasks like high availability, backup & restore, and configuration management. This session introduces a universal operator framework that supports various database engines, enabling rapid, minimal-code integration. We will present a case study from China Mobile Cloud on integrating a new cloud-native database engine into K8S using this framework, achieved with minimal coding and reduced time investment, bypassing the extensive Golang coding usually required for developing a dedicated operator.

随着Kubernetes成为云原生环境中不可或缺的一部分，越来越多的组织在K8S上部署数据库服务，面临着重大挑战。集成新的数据库引擎通常需要开发一个专门的Kubernetes operator，管理资源提供以及高可用性、备份和恢复、配置管理等重要维护任务。本场演讲将介绍一个支持各种数据库引擎的通用operator框架，实现快速、最小代码集成。我们将从中国移动云的一个案例研究中介绍如何使用这个框架将新的云原生数据库引擎集成到K8S中，通过最小的编码和减少时间投入来实现，避免通常需要开发专门operator所需的大量Golang编码。

Speakers

Shanshan Ying

Maintainer, ApeCloud

Shanshan is currently a maintainer of KubeBlocks by ApeCloud. Before joining ApeCloud, she worked in Aliyun Database Group for years. She received her PhD degree from National University of Singapore.

Shun Ding

Senior Systems Architect, China Mobile Cloud

Shun is a Senior Systems Architect at China Mobile Cloud, leading the design, development, and deployment of next-generation Kubernetes-based large-scale database managing service. With over a decade of experience in cloud computing and database technologies, Shun has extensive expertise... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

11:00 HKT

A Story of Managing Kubernetes Watch Events End-to End Flow in Extremely Large Clusters | 在极大规模集群中管理Kubernetes watch事件端到端流程的故事 - Bo Tang, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

The K8s watching mechanism has not been given the attention it deserves for an extended period. However, it is critical to the K8s cluster in both stability and perfermance aspsects and watch latency is a perfect indicator of cluster health. This talk begins by introducing the measurement of watch events latency and then defines watch SLI and SLO metrics. Using watch SLO as a guide, the talk will show the bottleneck identification process for watching. And the talk will describe the optimizations made to apiserver, etcd, kubelet, controller-runtime and clients such as controllers and schedulers in various aspects wrt watching, including watch latency, pod provisioning time, bandwidth, cpu/mem etc. With these optimizations, daily P99 watch latency has improved by over 90% in large clusters (~20K nodes) impacting billions of watch events. Pod provisioning time has improved by over 60%. Apiserver bandwidth has decreased by 50%. The overall stability of K8s cluster has improved greatly.

K8s观察机制长期以来并未得到应有的重视。然而，它对于K8s集群的稳定性和性能至关重要，观察延迟是集群健康的完美指标。本次演讲将首先介绍观察事件延迟的测量，然后定义观察SLI和SLO指标。通过观察SLO作为指导，演讲将展示观察瓶颈识别过程。演讲将描述在观察方面对apiserver、etcd、kubelet、controller-runtime和客户端（如控制器和调度器）进行的各种优化，包括观察延迟、Pod提供时间、带宽、CPU/内存等方面。通过这些优化，大型集群（~20K节点）中每日P99观察延迟已经提高了超过90%，影响了数十亿次观察事件。Pod提供时间已经提高了超过60%。Apiserver带宽减少了50%。K8s集群的整体稳定性得到了极大的改善。

Speakers

Bo Tang

Senior Engineer, Ant Group

Bo Tang is a senior engineer in Ant Group. He is currently working on scalability and performance optimization of Kubernetes clusters.

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Building a High-Performance Time Series Database from Scratch: Optimization Strategies | 从零开始构建高性能时序数据库：优化策略 - Aliaksandr Valialkin, VictoriaMetrics

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

Application Performance Monitoring and Kubernetes monitoring in their current state are pretty expensive. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, and 20-40 million samples/s on the read path. The biggest installations account for 100 million samples/s on the ingestion path. This requires being very clever with data pipelines to keep them efficient and scalable by adding more resources. In this session, we'll explore essential optimizations to maintain database speed such as string interning, caching results, goroutine management and utilizing sync.Pool for efficient resource management. These techniques help strike a balance between performance and resource consumption. This talk focuses on practical strategies for enhancing database speed.

在当前状态下，应用程序性能监控和Kubernetes监控非常昂贵。平均VictoriaMetrics安装在摄入路径上处理2-4百万样本/秒，在读取路径上处理20-40百万样本/秒。最大的安装在摄入路径上占据了1亿样本/秒。这需要通过对数据管道进行非常聪明的优化，通过增加更多资源来保持其高效和可扩展性。在本场演讲中，我们将探讨保持数据库速度的基本优化，如字符串内部化、缓存结果、goroutine管理和利用sync.Pool进行有效的资源管理。这些技术有助于在性能和资源消耗之间取得平衡。本次演讲侧重于增强数据库速度的实用策略。

Speakers

Hui Wang

Software Engineer, VictoriaMetrics

I'm working on monitoring at VictoriaMetrics. My passion is cloud-native technologies and opensource.

Aliaksandr Valialkin

CTO, VictoriaMetrics

Aliaksandr is a co-founder and the principal architect of VictoriaMetrics. He is also a well-known author of the popular performance-oriented libraries: fasthttp, fastcache and quicktemplate. He holds a Master’s Degree in Computer Software Engineering. He decided to found VictoriaMetrics... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

13:50 HKT

Choose Your Own Adventure: The Struggle for Security | 选择你的冒险：安全之战 - Whitney Lee, VMware Tanzu & Viktor Farcic, Upbound

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 2

Our hero, a running application in a Kubernetes production environment, knows they are destined for greater things! They are serving end users, but currently, they are also endangering those users, the system, and themselves! But the struggle for security is HARD, filled with system design choices concerning secrets management; cluster-level and runtime policies; and securing pod-to-pod communications. It is up to you, the audience, to guide our hero and help them grow from a vulnerable, unprotected application to their final form⎯an app that is more secure against invasion. In their third ‘Choose Your Own Adventure’-style talk, Whitney and Viktor will present choices that an anthropomorphized app must make as they try to protect themselves against every kind of exploit. Throughout the presentation, the audience (YOU!) will vote to decide our hero app's path! Can we navigate CNCF projects to safeguard our app, system, and users against attack before the session time elapses?

我们的英雄是一个在Kubernetes生产环境中运行的应用程序，他知道自己注定要成为更伟大的存在！他正在为最终用户提供服务，但目前却也在危及这些用户、系统和自己！但是安全的斗争是艰难的，充满了关于秘钥管理、集群级别和运行时策略以及保护Pod之间通信的系统设计选择。观众们，你们将扮演引导我们英雄并帮助他们从一个脆弱、无保护的应用程序成长为更加安全抵御入侵的终极形态的角色。在这场第三场“选择你自己的冒险”风格的演讲中，Whitney和Viktor将呈现一个拟人化应用程序必须做出的选择，以试图保护自己免受各种利用。在整个演示过程中，观众（就是你！）将投票决定我们英雄应用程序的道路！在演讲结束之前，我们能否通过探索CNCF项目来保护我们的应用程序、系统和用户免受攻击呢？

Speakers

Viktor Farcic

Developer Advocate, Upbound

Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.

Whitney Lee

Developer Advocate, VMware Tanzu

Whitney is a lovable goofball and a CNCF Ambassador who enjoys understanding and using tools in the cloud native landscape. Creative and driven, Whitney recently pivoted from an art-related career to one in tech. You can catch her lightboard streaming show ⚡️ Enlightning on Tanzu.TV... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

14:40 HKT

Find Your Own Personal Tutor for the Study of Kubernetes | 为学习Kubernetes找到适合您的个人导师 - Hoon Jo, Megazone

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 2

Kubernetes novice users ask questions to stackoverflow or community or friends :) when they encounter the problem. However it needs to explain my environment and the background information. Even though it is not a guaranteed answer from someone. Thus I suggest to use K8sGPT with ollama to leverage the lack of knowledge at this moment. Furthermore, k8sGPT provides interactive mode that is able to ask continuing questions until I receive enough answers. Plus it could be helpful to ask other language who is not familiar with English. (Mostly it is big concern from the beginning of the stage) I highly recommend using K8sGPT to study who is a newcomer for soft landing in Kubernetes world.

在KubeCon上，我们将讨论Kubernetes新手用户在遇到问题时通常会向stackoverflow、社区或朋友提问的情况。然而，我们需要解释我的环境和背景信息。虽然并不能保证会得到答案，但我建议使用K8sGPT与ollama来弥补当前知识的不足。此外，k8sGPT提供交互模式，可以持续提问直到我得到足够的答案。此外，对于不熟悉英语的人来说，询问其他语言可能会有所帮助（这在刚开始阶段时是一个大问题）。我强烈推荐使用K8sGPT来帮助新手顺利进入Kubernetes世界。

Speakers

Hoon Jo

Cloud Solutions Architect | Cloud Native Engineer,, Megazone

Hoon Jo is Cloud Solutions Architect as well as Cloud Native engineer at Megazone. He has many times of speaker experience for cloud native technologies. And spread out Cloud Native Ubiquitous in the world. He wrote 『Python for System/Network Administrators』 (Wikibooks, 2017... Read More →

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

15:35 HKT

Kubernetes Community Panel: A Decade of Evolution and Future Trends | Kubernetes维护者圆桌：十年演变与未来趋势 - Paco Xu & Mengjiao Liu, DaoCloud; Qiming Teng, Freelance; Klaus Ma, Nvidia; Pengfei Ni, Microsoft

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 2

Join us in celebrating the 10th anniversary of Kubernetes with a panel featuring some of the community's most influential contributors and maintainers from China. Over the past decade, Kubernetes has grown to the cornerstone of cloud-native infra, thanks to the dedication and innovation of its community members. In this panel, we will talk about our journeys with Kubernetes, share stories and experience, and discuss the future of Kubernetes in the next decade. Our panelists include current and previous owners, tech leads and maintainers. Feel free to join the panel to share your perspectives on the past and next decade of the Kubernetes community and ask anything about the community.

加入我们，与中国社区最具影响力的贡献者和维护者一起庆祝Kubernetes的十周年。在过去的十年里，由于社区成员的奉献和创新，Kubernetes已经发展成为云原生基础设施的基石。在这个专题讨论中，我们将谈论与Kubernetes的旅程，分享故事和经验，并讨论Kubernetes在未来十年的发展。我们的专题讨论嘉宾包括现任和前任所有者、技术负责人和维护者。欢迎加入专题讨论，分享您对Kubernetes社区过去和未来十年的看法，并提出任何关于社区的问题。

Speakers

Pengfei Ni

Principal Software Engineer, Microsoft

Pengfei Ni is a Principal Software Engineer at Microsoft Azure and a maintainer of the Kubernetes project. With extensive experience in Cloud Computing, Kubernetes, and Software Defined Networking (SDN), he has delivered presentations at various conferences, including KubeCon, ArchSummit... Read More →

徐俊杰 Paco

Open Source Team Lead, DaoCloud

Paco is co-chair of KubeCon+CloudNativeCon China 2024, and a member of Kubernetes Steering Committee. He is the leader of open-source team in DaoCloud. He is also a KCD Chengdu 2022 organizer, and a speaker in KubeCon EU 2023 & 2024, and KubeCon China 2021. Paco is a kubeadm maintainer... Read More →

Qiming Teng

Architect, Freelance

Qiming has been a passionate open source contributor for more than 10 years. He was an active contributor to the OpenInfra community and the CNCF community. His interest spans from operating systems, programming languages to cloud platforms. His current research fields include the... Read More →

Mengjiao Liu

Software Engineer, DaoCloud

Mengjiao Liu is a Software Engineer at DaoCloud. She contributes to Kubernetes and serves as the WG Structured Logging Lead and SIG Instrumentation Reviewer, focusing on enhancing logging quality. Additionally, she actively participates in SIG Docs as a Chinese owner and English reviewer... Read More →

Klaus Ma

Principal Software Engineer, Nvidia

eam leader, system architect, designer, software developer with 10+ years of experience across a variety of industries and technology bases, including cloud computing, machine learning, bigdata and financial services. Founding Volcano & kube-batch, Kubernetes SIG-Scheduling co-Leader... Read More →

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

CNCF Maintainer Track Sessions, Kubernetes

Language | 语言 中文 (Chinese)

16:25 HKT

A Decade of Cloud-Native Journey: The Evolution of Container Technology and the Kubernetes Ecosystem | 十年云原生之旅：容器技术和Kubernetes生态系统的演变 - Jintao Zhang, Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Over the past decade, cloud-native technologies have revolutionized software development, deployment, and operations. Container technology and the Kubernetes ecosystem, as transformation leaders, have enhanced development agility, and provided enterprises with unmatched scalability, flexibility, and efficiency. This talk navigates the evolution of these technologies, highlighting their impact on the cloud-native landscape. Starting my journey in 2014, I will share insights into the decade-long evolution of Kubernetes, its community, and technology stacks, alongside personal experiences. Attendees will learn about successes, challenges, and future trends, gaining knowledge to navigate their cloud-native transformations.

在过去的十年里，云原生技术已经彻底改变了软件开发、部署和运营。容器技术和Kubernetes生态系统作为变革的领导者，提升了开发的灵活性，并为企业提供了无与伦比的可扩展性、灵活性和效率。本次演讲将探讨这些技术的演变，突出它们对云原生领域的影响。从2014年开始我的旅程，我将分享关于Kubernetes、其社区和技术堆栈十年演变的见解，以及个人经验。与会者将了解成功、挑战和未来趋势，获得知识来引领他们的云原生转型。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

17:15 HKT

KubeEdge DeepDive: Extending Kubernetes to the Edge with Real-World Industry Use Case | KubeEdge深入探讨：将Kubernetes扩展到边缘，实现真实行业用例 - Yue Bao, Huawei Cloud Computing Technology Co., Ltd. & Hongbing Zhang, DaoCloud

Thursday August 22, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 2

In this session, KubeEdge project maintainers will provide an overview of KubeEdge's architecture, explore how KubeEdge with its industry-specific use cases. The session will kick off with a brief introduction to edge computing and its growing importance in IoT and distributed systems. The maintainers will then delve into the core components and architecture of KubeEdge, showcasing how it extends the capabilities of Kubernetes to manage edge computing workloads efficiently. Drawing on a range of industry use cases, including smart cities, industrial IoT, edge AI, robotics, and retail, the maintainers will share success stories and insights from organizations that have deployed KubeEdge in their edge environments, highlighting the tangible benefits and transformational possibilities it offers. The session will provide a detailed introduction to the certified KubeEdge conformance test. The maintainers will also share the advancements in technology and community governance in KubeEdge.

在这场演讲中，KubeEdge项目的维护者将介绍KubeEdge的架构，探讨KubeEdge与行业特定用例的关系。会议将以简要介绍边缘计算及其在物联网和分布式系统中日益重要的作用开始。维护者将深入探讨KubeEdge的核心组件和架构，展示它如何扩展Kubernetes的能力，以有效管理边缘计算工作负载。维护者将借助一系列行业用例，包括智慧城市、工业物联网、边缘人工智能、机器人和零售，分享已在其边缘环境中部署KubeEdge的组织的成功故事和见解，突出其提供的切实利益和变革可能性。会议将详细介绍认证的KubeEdge一致性测试。维护者还将分享KubeEdge技术和社区治理方面的进展。

Speakers

Yue Bao

Senior Software Engineer, Huawei Cloud Computing Technology Co., Ltd.

Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source and the member of KubeEdge maintainers, focusing on lightweight edge and edge api-server for KubeEdge. Before that, Yue worked on Huawei Cloud Intelligent EdgeFabric Service and participated... Read More →

Hongbing Zhang

Chief Operating Officer, DaoCloud

Hongbing Zhang is Chief Operating Officer of DaoCloud. He is a veteran in open source areas, he founded IBM China Linux team in 2011 and organized team to make significant contributions in Linux Kernel/openstack/hadoop projects. Now he is focusing on cloud native domain and leading... Read More →

Thursday August 22, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 2

CNCF Maintainer Track Sessions, KubeEdge

Language | 语言 英语 (English)

10:35 HKT

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano | 通过Volcano增强的智能基础设施优化LLM工作流程 - Xin Li, qihoo360 & William Wang, Huawei Cloud Technologies Co., LTD

Friday August 23, 2024 10:35 - 11:10 HKT

Level 1 | Hung Hom Room 2

As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型（LLMs）革新我们生活的各个方面，许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而，管理大规模LLM训练和推理平台面临更为关键的挑战，如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上，演讲者将分享他们在使用基于Kubernetes的智能基础设施（由Volcano增强）管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖：故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。

Speakers

Xin Li

Senior Engineer of Server Development, qihoo360

Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →

Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:25 HKT

New Advances for Cross-Platform AI Applications in Docker | Docker中跨平台AI应用程序的新进展 - Michael Yuan, Second State

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 2

The talk proposes to delve into novel methods for enhancing cross-platform GPU/AI workloads within container ecosystems, with a specific emphasis on Docker's incorporation of the WebGPU standard. This standard empowers containerized applications to utilize host GPUs and additional AI accelerators via a flexible API. Consequently, there's no longer a necessity to construct Docker images tailored to individual GPU vendors and their proprietary drivers. The presentation will feature a demonstration highlighting how the WasmEdge project capitalizes on the WebGPU standard to craft portable LLM inference applications in Rust. Additionally, Docker's seamless management and orchestration of these applications will be showcased.

本次演讲旨在探讨增强容器生态系统中跨平台GPU/AI工作负载的新方法，特别强调Docker对WebGPU标准的整合。该标准使容器化应用程序能够通过灵活的API利用主机GPU和额外的AI加速器。因此，不再需要构建针对个别GPU供应商及其专有驱动程序的Docker镜像。演示将展示WasmEdge项目如何利用WebGPU标准在Rust中创建可移植的LLM推理应用程序。此外，还将展示Docker对这些应用程序的无缝管理和编排能力。

Speakers

Michael Yuan

Product Manager, Second State

Dr. Michael Yuan is a maintainer of WasmEdge Runtime (a project under CNCF) and a co-founder of Second State. He is the author of 5 books on software engineering published by Addison-Wesley, Prentice-Hall, and O'Reilly. Michael is a long-time open-source developer and contributor... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

13:20 HKT

Constructing the 10x Efficiency of Cloud-Native AI Infrastructure | 如何让你的 AI 底座效能提升 10 倍？ - Peter Pan, DaoCloud & 秋萍戴, daocloud

Friday August 23, 2024 13:20 - 13:55 HKT

Level 1 | Hung Hom Room 2

Enterprises keep invested in AI. But once GPU are installed in a data center, a challenge arises: how to construct an "AI cloud" atop bare-metal. Even when K8S is recognized as the foundational infrastructure for AI, But K8S only is merely the initial step. Organizations may face challenges: - Maximizing GPU utilization - Unifying multi-arch accelerators/GPUs (k8s DRA) - Organization quotas and cost management - Resource isolation among organizations - Smarter scheduling, tiered GPU allocation, task prioritization.. - Sharing GPU clusters between VMs & containers - Harnessing the full potential of high-speed networks , Storage optimization and dataset orchestration Leveraging open source stacks in Linux Foundation and CNCF, we've experience in building AI clouds for IDC or internal usage. We can share experiences to empower communities' journey towards constructing the 10x efficiency of cloud-native AI. Refer to `Additional resources` chapter for more details

企业继续投资于人工智能。但是一旦在数据中心安装了GPU，就会面临一个挑战：如何在裸金属之上构建一个“AI云”。即使K8S被认为是AI的基础基础设施，但K8S只是一个起步。组织可能面临的挑战包括： - 最大化GPU利用率 - 统一多架构加速器/GPU（k8s DRA） - 组织配额和成本管理 - 组织之间的资源隔离 - 更智能的调度，分层GPU分配，任务优先级... - 在虚拟机和容器之间共享GPU集群 - 充分利用高速网络的潜力，优化存储和数据集编排利用Linux基金会和CNCF中的开源堆栈，我们在为IDC或内部使用构建AI云方面有经验。我们可以分享经验，以赋予社区构建云原生AI的效率提升10倍的旅程。有关更多详细信息，请参考“附加资源”章节。

Speakers

Peter Pan

VP of R&D Engineering, DaoCloud

├ DaoCloud R&D Engineering VP├ CNCF wg-AI (AI Working-Group) member├ Maintainer of a few CNCF projects (GithubID: panpan0000): CloudTTY, KuBean, HwameiStor├ Public Tech Events:└─ 2023 KubeCon SH Speaker (https://sched.co/1PTFI)└─ 2023 KubeCon EU Program Committee... Read More →

秋萍戴

product mananger, daocloud

QiuPing Dai is a senior Technology Product Manager at DaoCloud for 5 years and involved in Cloud Computing ( including Kubernetes Computing, Storage, Network) development work. Before that, Qiuping worked at IBM for Cloud Computing. QiuPing is interested in Storage, Network , Scheduling... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 2

AI_dev: Open Source GenAI & ML Summit Sessions, Foundations + Frameworks + Tools for Machine Learning

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

14:10 HKT

Model Service Mesh: A New Paradigm for Large-Scale AI Model Service Deployment and Management | 模型服务网格：大规模AI模型服务部署和管理的新范式 - Xi Ning Wang, Alibaba Cloud & Huailong Zhang, Intel China

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 2

As AI/ML models grow in scale and complexity, how to efficiently deploy and manage model service in cloud-native environments has become a significant challenge. This proposal will introduce the Model Service Mesh (MSM), an emerging architectural paradigm designed specifically for large-scale AI model service deployment and management, to address the challenge. This new paradigm focuses on: 1. How to build a highly scalable and reliable model delivery system and the key features include dynamic model service routing, unified management for multi-models within single endpoint, an optimized caching layer, and cache-aware scheduling,etc. 2. How to leverage the MSM to optimize AI models service in lifecycle management, resource utilization improvement, security enhancement, and observability and resilience insurance. In essence, this architecture ensures a scalable, secure, and efficient model service in cloud native environment.

随着人工智能/机器学习模型规模和复杂性的增长，如何在云原生环境中高效部署和管理模型服务已成为一个重大挑战。本提案将介绍模型服务网格（MSM），这是一种专门为大规模人工智能模型服务部署和管理而设计的新兴架构范式，旨在解决这一挑战。这种新范式关注以下几点： 1. 如何构建一个高度可扩展和可靠的模型交付系统，关键特性包括动态模型服务路由、单个端点内多模型的统一管理、优化缓存层和缓存感知调度等。 2. 如何利用MSM优化人工智能模型服务的生命周期管理、资源利用率改善、安全增强以及可观察性和弹性保障。总的来说，这种架构确保了在云原生环境中可扩展、安全和高效的模型服务。

Speakers

王夕宁

Technical Leader, Alibaba Cloud

Wang Xining, senior technical expert of Alibaba Cloud, technical leader of ACK(Kubernetes)/ASM(Service Mesh) , focusing on Kubernetes, service mesh and other cloud native fields. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the... Read More →

Huailong Zhang

Cloud Software Engineer, Intel China

Steve(Huailong) Zhang has worked for Alcatel-Lucent, Baidu and IBM to engage in cloud computing research and development. Huailong is currently working for Intel China as a cloud-native software engineer, focusing on cloud-native technical fields, such as kubernetes and service mesh... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:15 HKT

No More Runtime Setup! Let's Bundle, Distribute, Deploy, Scale LLMs Seamlessly with Ollama Operator | 无需运行时设置！让我们使用Ollama Operator轻松捆绑、分发、部署、扩展LLMs - Fanshi Zhang, DaoCloud

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 2

Seeking out a way to ship LLMs more seamlessly? Way too complicated to manage, composite, and setup a runtime with Python, C++, CUDA, GPUs when deploying LLMs? Tired of fighting against dependencies, model sizes, syncing deliverable model images across nodes? It's true that people often find it hard to bundle, distribute, deploy, and scale their own LLM workloads, but no worries, here is Ollama Operator, a scheduler, and utilizer for LLM models powered by Modelfile introduced by Ollama. You can now enjoy then unified bundled, runtime powered by llama.cpp with simple lines of CRD definition or the natively included kollama CLI with single command line, bundling, distributing, deploying, scaling of LLMs can never be easily and seamlessly accomplished across OS and environments. Let's dive in and find out what Ollama Operator with Ollama can do to deploy our own large langaugae models, what can we do and combine these features with Modelfile then bring them into the Kubernetes world!

寻找一种更无缝地运输LLM的方式？在部署LLM时，使用Python、C++、CUDA、GPU设置运行时太复杂？厌倦了与依赖、模型大小、在节点间同步可交付模型图像等问题作斗争？人们常常发现很难捆绑、分发、部署和扩展自己的LLM工作负载，但不用担心，这里有Ollama Operator，一个由Ollama引入的基于Modelfile的LLM模型调度器和利用者。现在，您可以通过简单的CRD定义行或内置的kollama CLI命令行，享受由llama.cpp提供支持的统一捆绑运行时，轻松实现LLM的捆绑、分发、部署和扩展，跨操作系统和环境都可以轻松实现。让我们深入了解一下Ollama Operator与Ollama能够做些什么来部署我们自己的大型语言模型，我们可以如何结合这些功能与Modelfile，然后将它们带入Kubernetes世界！

Speakers

Neko Ayaka

Software Engineer, DaoCloud

Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

16:05 HKT

Unlocking LLM Performance with EBPF: Optimizing Training and Inference Pipelines | 通过eBPF解锁LLM性能：优化训练和推理管道 - Yang Xiang, Yunshan Networks, Inc.

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 2

The training and inference processes of Large Language Models (LLMs) involve handling vast amounts of model data and training data, and consume significant GPU compute resources. However, enhancing GPU utilization becomes extremely challenging in the absence of observability. This presentation will introduce how to achieve observability in LLM training and inference processes with zero disruption using eBPF. This includes utilizing Memory Profiling to understand the loading performance of models and training data, Network Profiling to comprehend the data exchange performance, and GPU Profiling to analyze GPU's MFU (Model FLOPs Utilization) and performance bottlenecks. Additionally, we will share the practical effects of implementing observability in a PyTorch LLM application and the llm.c project using eBPF, aiming to enhance training and inference performance.

大型语言模型（LLMs）的训练和推断过程涉及处理大量的模型数据和训练数据，并消耗大量的GPU计算资源。然而，在缺乏可观察性的情况下，提高GPU利用率变得极具挑战性。本次演讲将介绍如何利用eBPF在LLM训练和推理过程中实现零中断的可观察性。这包括利用内存分析来了解模型和训练数据的加载性能，网络分析来理解数据交换性能，以及GPU分析来分析GPU的MFU（模型FLOPs利用率）和性能瓶颈。此外，我们将分享在PyTorch LLM应用程序和llm.c项目中使用eBPF实现可观察性的实际效果，旨在提高训练和推理性能。

Speakers

Yang Xiang

VP of Engineering, Yunshan Networks, Inc.

Received a Ph.D. from Tsinghua University, and currently serving as VP of Engineering at Yunshan Networks and the head of the DeepFlow open-source community. He has presented academic papers on topics such as application observability and network measurement at top international academic... Read More →

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)