KubeCon + CloudNativeCon + Open Source Summit + AI

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

11:00 HKT

How to Increase the Throughput of Kubernetes Scheduler by Tens of Times | 如何将Kubernetes调度器的吞吐量提高数十倍 - Yuquan Ren & Bing Li, ByteDance

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

Currently, various Kubernetes-based task schedulers popular in the community have limited performance capabilities, which restricts the cluster scale they can handle. Due to the limitation of cluster scale, it is difficult to improve resource utilization through large-scale colocation, and more clusters also bring greater operational burdens. 1. Due to the bottleneck of the scheduler and related components, the maximum cluster scale cannot exceed 5k nodes; 2. In clusters with more than 5k Nodes, scheduling throughput cannot exceed 100 Pods/s. Godel Scheduler is a distributed high-performance scheduler based on Kubernetes, and it is now open-sourced. In this talk, we will go deep into the performance optimization methods of godel scheduler: 1. Optimize scheduling algorithms and do data structures refactor; 2. Implement optimistic concurrency under multi-shard architecture to achieve parallel computation; 3. Abstract "batch" scheduling to fully reuse scheduling computation results.

目前，社区中流行的基于Kubernetes的各种任务调度器在性能方面存在一定限制，这限制了它们能处理的集群规模。由于集群规模的限制，通过大规模的共存难以提高资源利用率，而且更多的集群也会带来更大的运维负担。1. 由于调度器及相关组件的瓶颈，最大集群规模无法超过5k个节点；2. 在超过5k个节点的集群中，调度吞吐量无法超过100个Pod/s。 Godel Scheduler是一个基于Kubernetes的分布式高性能调度器，现已开源。在本次演讲中，我们将深入探讨godel调度器的性能优化方法：1. 优化调度算法并进行数据结构重构；2. 在多分片架构下实现乐观并发以实现并行计算；3. 抽象“批量”调度以充分重用调度计算结果。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Yuquan Ren has 10+ years of working experience in the cloud-native field, contributing extensively to open-source projects such as Kubernetes. Currently, he is a tech leader at ByteDance, primarily focusing on the field of orchestration and scheduling.

Bing Li

Senior Software Engineer, Bytedance

Bing Li has participated in the open source community for nearly 3 years. Currently, he is a senior software engineer at ByteDance, focusing on scheduling system performance optimization and system evolution.

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:50 HKT

Implementing Fine-Grained and Pluggable Container Resource Management Leveraging NRI | 基于 NRI 实现精细化且可插拔的容器资源管理 - Qiang Ren, Intel & He Cao, ByteDance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

To overcome Kubernetes' limitations in resource management, ByteDance developed Katalyst, a resource management system. Katalyst employs a range of methodologies, including colocation, node over-commitment, specification recommendation, and tidal colocation, aimed at optimizing cluster resource utilization.

Initially, Katalyst introduced a QoS Resource Manager (QRM) framework within kubelet, facilitating versatile container resource allocation through a plugin architecture. Presently, the Node Resource Interface (NRI) presents a refined alternative.

This session elucidates how Katalyst leverages NRI for fine-grained and adaptable container resource management, ensuring efficiency without intrusive modifications of upstream components. This novel architecture allows Katalyst to seamlessly integrate with native Kubernetes, offering a user-friendly and easily maintainable solution.

为了克服 Kubernetes 在资源管理方面的局限性，字节跳动构建了一个资源管理系统 Katalyst，通过在离线业务常态混部、资源超分、规格推荐、潮汐混部等方式，提升集群的资源利用率。最初，Katalyst 在 kubelet 中引入了一个 QoS Resource Manager（QRM）框架，通过插件化的方式来扩展容器的资源分配策略；当前，Node Resource Interface（NRI）提供了一个原生的替代方案。

本次演讲将介绍 Katalyst 如何通过 NRI 实现精细化且可插拔的容器资源管理，在不对上游组件进行侵入性修改的情况下，提升资源利用率并保证业务的 SLO 不受影响。这种全新的架构使 Katalyst 能够与原生 Kubernetes 无缝集成，提供了一种易于使用和维护的解决方案。

Speakers

Qiang Ren

Software Engineer, Intel

Ren Qiang works as a Cloud Orchestration Software Engineer in SATG, Intel. He mainly focuses on Cloud Native technologies in the runtime. At the same time, he actively participates in open-source projects and is committed to promoting the development of runtime and resource isola... Read More →

He Cao

Senior Software Engineer, ByteDance

He Cao is a senior software engineer on the Cloud Native team at ByteDance, a maintainer of Katalyst and KubeZoo, and a member of Istio. He has 5+ years of experience in the cloud native area. Since joining ByteDance, he has designed and implemented several critical systems for VKE... Read More →

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

13:50 HKT

Kubespray Unleashed: Navigating Bare Metal Services in Kubernetes for LLM and RAG | Kubespray大放异彩：在Kubernetes中为LLM和RAG部署裸金属服务 - Kay Yan, DaoCloud & Alan Leung, Equinix

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 2

Kubespray, popular within the SIG-Cluster-Lifecycle of Kubernetes, is celebrated for deploying production-ready Kubernetes clusters, particularly on bare metal, which boosts performance for AI workloads like LLM and RAG. This session will explore using Kubespray in bare metal settings, addressing challenges, and sharing best practices. The first part of the talk will show Kubespray's key features and provide practical tips. The latter half will focus on swiftly deploying AI using Retrieval-Augmented Generation (RAG), demonstrating how Kubespray facilitates setting up Kubernetes clusters on bare metal. This setup enhances AI applications by integrating continuous knowledge updates and domain-specific information via RAG, improving the accuracy and credibility of the AI systems. The session will conclude with discussions on community engagement and future advancements, followed by a Q&A period to address participant queries.

KubeCon会议描述： Kubespray在Kubernetes的SIG-Cluster-Lifecycle中备受推崇，以在裸金属上部署可用于生产的Kubernetes集群而闻名，特别是对于像LLM和RAG这样的AI工作负载，可以提高性能。本场演讲将探讨在裸金属环境中使用Kubespray，解决挑战，并分享最佳实践。演讲的第一部分将展示Kubespray的关键特性并提供实用技巧。后半部分将重点介绍如何使用检索增强生成（RAG）快速部署AI，演示Kubespray如何在裸金属上设置Kubernetes集群。通过RAG集成持续的知识更新和领域特定信息，这种设置可以提升AI应用程序的性能，提高AI系统的准确性和可信度。本场演讲将以社区参与和未来发展的讨论结束，随后进行问答环节以解答参与者的疑问。

Speakers

Kay Yan

Principal Software Engineer, DaoCloud

Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.

Alan Leung

Digital Technical Specialist, Equinix

Alan is the Digital Technical Specialist at Equinix with focus on enabling customers, prospects and partners to develop innovative solutions to solve business challenges at the digital edge.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:40 HKT

Scaling Kubernetes: Best Practices for Managing Large-Scale Batch Jobs with Spark and Argo Workflow | 扩展Kubernetes：管理大规模批处理作业的最佳实践与Spark和Argo工作流 - Yu Zhuang & Liu Jiaxu, Alibaba Cloud

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 2

Are you managing large-scale batch jobs on Kubernetes, like data processing with Spark applications or genomics computing with Argo workflows? To complete these jobs promptly, a significant number of pods have to be scaled out/in quickly for parallel computation. It means a big pressure to Kubernetes control plane. In this talk, we will use Spark and Argo workflows as example, guiding you how to build a Kubernetes cluster which supports creating/deleting 20000 of pods frequently. Our focus will be on tuning the Kubernetes control plane, including optimizing the list-watch mechanism, service broadcasting, environment variable attachments, API server configurations. Additionally, we'll share some of the best practices for configuring Spark operator and Argo workflows controller.

您是否正在Kubernetes上管理大规模的批处理作业，比如使用Spark应用程序进行数据处理或使用Argo工作流进行基因组计算？为了及时完成这些作业，需要快速地扩展/缩减大量的Pod以进行并行计算，这给Kubernetes控制平面带来了巨大压力。在本次演讲中，我们将以Spark和Argo工作流为例，指导您如何构建一个支持频繁创建/删除20000个Pod的Kubernetes集群。我们将重点放在调优Kubernetes控制平面上，包括优化列表-观察机制、服务广播、环境变量附加、API服务器配置等。此外，我们还将分享一些配置Spark操作员和Argo工作流控制器的最佳实践。

Speakers

Liu Jiaxu

Senior Engineer, Alibaba Cloud

Jiaxu Liu is a Senior Engineer on the Container Service Team at Alibaba Cloud. He specializes in observability enhancement and large-scale cluster management and optimization for Alibaba Cloud's container service offerings. Before joining Alibaba Cloud, he worked at Nokia as a Senior... Read More →

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

15:35 HKT

Tackling Operational Time-to-Market Decelerators in AI/ML Projects | 应对人工智能/机器学习项目中的运营时间市场减速器 - Adrian Matei & Andreea Munteanu, Canonical

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 2

In the competitive AI market, Time To Market (TTM) is crucial for success. Ensuring secure, scalable, and compliant ML infrastructures often slows TTM due to the complexities of updates, patches, monitoring, and security enforcement. This leads to decreases in ROI, profitability, reproducibility, and competitive edge. To address this, companies can engage Managed Service Providers (MSPs) to offload operational burdens and focus on innovation, yet selecting the right MSP requires consideration of expertise, automation capabilities, and compliance adherence. This presentation explores the AI operational landscape, highlighting indicators and challenges in MSP collaboration. We will focus on the management of open source tools like Kubeflow and MLflow across hybrid and multicloud environments. By understanding operational excellence in AI and available options to achieve it, attendees will gain insights into choosing an approach that aligns with their greater objectives.

在竞争激烈的人工智能市场中，上市时间对于成功至关重要。确保安全、可扩展和合规的机器学习基础设施通常会因更新、补丁、监控和安全执行的复杂性而减慢上市时间，导致投资回报率、盈利能力、可复制性和竞争优势下降。为了解决这个问题，公司可以与托管服务提供商（MSPs）合作，减轻运营负担，专注于创新，但选择合适的MSP需要考虑专业知识、自动化能力和合规性。本次演讲探讨了人工智能运营领域，重点介绍了MSP合作中的指标和挑战。我们将重点关注在混合和多云环境中管理开源工具如Kubeflow和MLflow。通过了解人工智能运营卓越性以及实现卓越性的可用选项，与会者将获得选择与其更大目标一致的方法的见解。

Speakers

Andreea Munteanu

AI Product Manager, Canonical

Andreea Munteanu is a Product Manager at Canonical, leading the MLOps area. With a background in Data Science in various industries, she used AI techniques to enable enterprises to benefit from their initiatives and make data-driven decisions. Nowadays, Andreea is looking to help... Read More →

Adrian Matei

Product Manager, Canonical

With a degree in Information Management for Business, Adrian is now guiding Canonical’s open-source operational management toolset as Product Manager. He has been working in open source operations for the past two years, having previously accumulated experience in technology consulting... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

16:25 HKT

Unleashing the Power of Cluster API: Extensibility and Customization | 释放Cluster API的力量：可扩展性和定制化 - Zain Malik, CityStorageSystems & Nibir Bora, Startup

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Cluster API, designed with extensibility at its core, has revolutionized Kubernetes cluster management. Its open and pluggable architecture empowers providers to implement custom solutions tailored to their unique requirements. In this session, we will explore how Cluster API's extension-by-design philosophy has opened new horizons for organizations seeking to create bespoke Kubernetes clusters. Managing Kubernetes clusters at scale presents unique operational challenges that cannot be tamed with manual operations. Through real-world examples and lessons learned, we will demonstrate how Cluster API's flexibility allows for the integration of diverse infrastructure providers and the implementation of organization-specific customizations. Attendees will gain insights into best practices for extending Cluster API, including developing custom controllers, integrating third-party tools, and creating bespoke workflows.

Cluster API是以可扩展性为核心设计的，已经彻底改变了Kubernetes集群管理。其开放和可插拔的架构赋予提供者实施定制解决方案的能力，以满足其独特需求。在本场演讲中，我们将探讨Cluster API的“通过设计进行扩展”的理念如何为寻求创建定制化Kubernetes集群的组织开辟了新的视野。在规模化管理Kubernetes集群时，会面临无法通过手动操作解决的独特运营挑战。通过现实世界的例子和经验教训，我们将演示Cluster API的灵活性如何允许集成各种基础设施提供者，并实施组织特定的定制化。与会者将获得有关扩展Cluster API的最佳实践的见解，包括开发自定义控制器、集成第三方工具和创建定制工作流程。

Speakers

Zain Malik

Staff Software Engineer, CityStorageSystems

Zain Malik serves as a tech lead in the compute team for a startup, where he has significantly contributed to projects related to cost saving and reliability. And help mature cluster lifecycle management. Before this role, Zain was a product owner and staff software engineer in the... Read More →

Nibir Bora

Engineering Manager, Startup

Nibir is a Engineering Manager in charge of Core Infrastructure at a Stealth Startup, where he is responsible for the company's Kubernetes infrastructure running 100s of clusters globally.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 英语 (English)

17:15 HKT

How to Manage Database Clusters Without a Dedicated Operator | 如何在没有专门Operator的情况下管理数据库集群 - Shanshan Ying, ApeCloud & Shun Ding, China Mobile Cloud

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 2

As Kubernetes becomes integral to cloud-native environments, more organizations are deploying database services on K8S, facing significant challenges. Integrating new database engines typically requires developing a dedicated Kubernetes operator that manages not only resource provisioning but also essential maintenance tasks like high availability, backup & restore, and configuration management. This session introduces a universal operator framework that supports various database engines, enabling rapid, minimal-code integration. We will present a case study from China Mobile Cloud on integrating a new cloud-native database engine into K8S using this framework, achieved with minimal coding and reduced time investment, bypassing the extensive Golang coding usually required for developing a dedicated operator.

随着Kubernetes成为云原生环境中不可或缺的一部分，越来越多的组织在K8S上部署数据库服务，面临着重大挑战。集成新的数据库引擎通常需要开发一个专门的Kubernetes operator，管理资源提供以及高可用性、备份和恢复、配置管理等重要维护任务。本场演讲将介绍一个支持各种数据库引擎的通用operator框架，实现快速、最小代码集成。我们将从中国移动云的一个案例研究中介绍如何使用这个框架将新的云原生数据库引擎集成到K8S中，通过最小的编码和减少时间投入来实现，避免通常需要开发专门operator所需的大量Golang编码。

Speakers

Shanshan Ying

Maintainer, ApeCloud

Shanshan is currently a maintainer of KubeBlocks by ApeCloud. Before joining ApeCloud, she worked in Aliyun Database Group for years. She received her PhD degree from National University of Singapore.

Shun Ding

Senior Systems Architect, China Mobile Cloud

Shun is a Senior Systems Architect at China Mobile Cloud, leading the design, development, and deployment of next-generation Kubernetes-based large-scale database managing service. With over a decade of experience in cloud computing and database technologies, Shun has extensive expertise... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

11:00 HKT

A Story of Managing Kubernetes Watch Events End-to End Flow in Extremely Large Clusters | 在极大规模集群中管理Kubernetes watch事件端到端流程的故事 - Bo Tang, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

The K8s watching mechanism has not been given the attention it deserves for an extended period. However, it is critical to the K8s cluster in both stability and perfermance aspsects and watch latency is a perfect indicator of cluster health. This talk begins by introducing the measurement of watch events latency and then defines watch SLI and SLO metrics. Using watch SLO as a guide, the talk will show the bottleneck identification process for watching. And the talk will describe the optimizations made to apiserver, etcd, kubelet, controller-runtime and clients such as controllers and schedulers in various aspects wrt watching, including watch latency, pod provisioning time, bandwidth, cpu/mem etc. With these optimizations, daily P99 watch latency has improved by over 90% in large clusters (~20K nodes) impacting billions of watch events. Pod provisioning time has improved by over 60%. Apiserver bandwidth has decreased by 50%. The overall stability of K8s cluster has improved greatly.

K8s观察机制长期以来并未得到应有的重视。然而，它对于K8s集群的稳定性和性能至关重要，观察延迟是集群健康的完美指标。本次演讲将首先介绍观察事件延迟的测量，然后定义观察SLI和SLO指标。通过观察SLO作为指导，演讲将展示观察瓶颈识别过程。演讲将描述在观察方面对apiserver、etcd、kubelet、controller-runtime和客户端（如控制器和调度器）进行的各种优化，包括观察延迟、Pod提供时间、带宽、CPU/内存等方面。通过这些优化，大型集群（~20K节点）中每日P99观察延迟已经提高了超过90%，影响了数十亿次观察事件。Pod提供时间已经提高了超过60%。Apiserver带宽减少了50%。K8s集群的整体稳定性得到了极大的改善。

Speakers

Bo Tang

Senior Engineer, Ant Group

Bo Tang is a senior engineer in Ant Group. He is currently working on scalability and performance optimization of Kubernetes clusters.

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Building a High-Performance Time Series Database from Scratch: Optimization Strategies | 从零开始构建高性能时序数据库：优化策略 - Aliaksandr Valialkin, VictoriaMetrics

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

Application Performance Monitoring and Kubernetes monitoring in their current state are pretty expensive. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, and 20-40 million samples/s on the read path. The biggest installations account for 100 million samples/s on the ingestion path. This requires being very clever with data pipelines to keep them efficient and scalable by adding more resources. In this session, we'll explore essential optimizations to maintain database speed such as string interning, caching results, goroutine management and utilizing sync.Pool for efficient resource management. These techniques help strike a balance between performance and resource consumption. This talk focuses on practical strategies for enhancing database speed.

在当前状态下，应用程序性能监控和Kubernetes监控非常昂贵。平均VictoriaMetrics安装在摄入路径上处理2-4百万样本/秒，在读取路径上处理20-40百万样本/秒。最大的安装在摄入路径上占据了1亿样本/秒。这需要通过对数据管道进行非常聪明的优化，通过增加更多资源来保持其高效和可扩展性。在本场演讲中，我们将探讨保持数据库速度的基本优化，如字符串内部化、缓存结果、goroutine管理和利用sync.Pool进行有效的资源管理。这些技术有助于在性能和资源消耗之间取得平衡。本次演讲侧重于增强数据库速度的实用策略。

Speakers

Hui Wang

Software Engineer, VictoriaMetrics

I'm working on monitoring at VictoriaMetrics. My passion is cloud-native technologies and opensource.

Aliaksandr Valialkin

CTO, VictoriaMetrics

Aliaksandr is a co-founder and the principal architect of VictoriaMetrics. He is also a well-known author of the popular performance-oriented libraries: fasthttp, fastcache and quicktemplate. He holds a Master’s Degree in Computer Software Engineering. He decided to found VictoriaMetrics... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)