KubeCon + CloudNativeCon + Open Source Summit + AI

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

09:20 HKT

Keynote: Accelerating Electric Vehicle Innovation with Cloud Native Technologies | 主论坛演讲: 使用云原生技术加速电动汽车创新 - Kevin Wang, Huawei & Saint Jiang, NIO

Wednesday August 21, 2024 09:20 - 09:35 HKT

Level 2 | Grand Ballroom 1-2

The electric vehicle (EV) industry is rapidly advancing towards a future where intelligence and connectivity are paramount. As we embrace this new era, the challenges in automotive software development escalate, such as software consistency, testing efficiency, data utilization etc., between simulated environments and real-world vehicle runtime environments. In this session, discover how NIO, an innovator in the global EV sphere, harnesses the power of cloud native technologies such as Containerd, Kubernetes, KubeEdge, and AI cloud-edge collaboration. Learn about NIO's journey to augment the development efficiency and quality of EV software, propelling us towards the zenith of vehicular intelligence. Delve into the transformative impact and future prospects of cloud native solutions in revolutionizing the EV landscape.

电动汽车（EV）行业正迅速向着智能和连接至关重要的未来发展。随着我们迎接这个新时代，汽车软件开发中的挑战不断升级，例如在模拟环境和真实车辆运行环境之间的软件一致性、测试效率、数据利用等等。在这场演讲中，探索全球EV领域的创新者NIO如何利用云原生技术，如Containerd、Kubernetes、KubeEdge和AI云边协作。了解NIO如何提高EV软件开发效率和质量，推动我们走向车辆智能的巅峰。深入探讨云原生解决方案在革新EV领域中的转变影响和未来前景。

Speakers

Kevin Wang

Lead of Cloud Native Open Source Team, Huawei

Kevin Wang has been an outstanding contributor in the CNCF community since its beginning and is the leader of the cloud native open source team at Huawei. Kevin has contributed critical enhancements to Kubernetes, led the incubation of the KubeEdge, Volcano, Karmada projects in CNCF... Read More →

Saint Jiang

NIO

Saint Jiang has over 10 years of experience in automotive software development. He is currently responsible for the software platform development in the intelligent cockpit domain at NIO, a global leader in electric vehicles. Prior to that, he was the system manager of the software... Read More →

Wednesday August 21, 2024 09:20 - 09:35 HKT
Level 2 | Grand Ballroom 1-2

Keynote Sessions | 主论坛演讲, Cloud Native Experience

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:00 HKT

SIG-Multicluster Intro and Deep Dive | SIG-Multicluster介绍和深入探讨 - Jeremy Olmsted-Thompson, Google; Hongcai Ren, Huawei; Jian Qiu, Red Hat

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 6

SIG-Multicluster is focused on solving common challenges related to the management of many Kubernetes clusters, and applications deployed across many clusters, or even across cloud providers. In this session, we'll give attendees an overview of the current status of the multi-cluster problem space in Kubernetes and of the SIG. We’ll discuss current thinking around best practices for multi-cluster deployments and what it means to be part of a ClusterSet. Then we’ll highlight current SIG projects, focused use cases, and ideas for what’s next. Most importantly, we’ll provide information on how you can get involved either as a contributor or as a user who wants to provide feedback about the SIG's current efforts and future direction. Bring your questions, problems, and ideas - help us expand the multi-cluster Kubernetes landscape!

SIG-Multicluster专注于解决与管理许多Kubernetes集群和部署在许多集群甚至跨云提供商的应用程序相关的常见挑战。在本场演讲中，我们将向与会者概述Kubernetes中多集群问题空间的当前状态和SIG。我们将讨论关于多集群部署最佳实践的当前思考以及成为ClusterSet的一部分意味着什么。然后，我们将重点介绍当前的SIG项目、关注的用例和下一步的想法。最重要的是，我们将提供有关如何参与其中的信息，无论是作为贡献者还是作为希望就SIG当前工作和未来方向提供反馈的用户。带上你的问题、问题和想法 - 帮助我们扩展多集群Kubernetes领域！

Speakers

Jeremy Olmsted-Thompson

Principal Engineer, Google

Jeremy is a software engineer who works on Google Kubernetes Engine. His main focus is on simplifying the Kubernetes experience, and making it as easy as possible to deploy applications both within a cluster with things like GKE Autopilot, and across clusters with multi-cluster solutions... Read More →

Hongcai Ren

Senior Software Engineer, Huawei

Hongcai Ren(@RainbowMango) is the CNCF Ambassador, who has been working on Kubernetes and other CNCF projects since 2019, and is the maintainer of the Kubernetes and Karmada projects.

Jian Qiu

Senior Principal Software Engineer, Red Hat

Qiu Jian is a developer at Redhat mainly focusing on multiple cluster management.

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Multicluster

Language | 语言 中文 (Chinese)

11:00 HKT

Accelerating Serverless AI Large Model Inference with Functionalized Scheduling and RDMA | 通过功能化调度和RDMA加速无服务器AI大模型推理 - Yiming Li, Tianjin University& Chenglong Wang, Jinan Inspur Data Technology Co., Ltd.

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 7

The deployment of AI large models on standard Serverless inference platforms like KServe is gaining popularity due to its ability to improve resource utilization and reduce costs. However, existing large model inference faces significant scheduling and communication bottlenecks, making it challenging to meet low-latency and high-throughput demands. The centralized control plane of Kubernetes leads to low scheduling efficiency, unable to achieve second-level response to large-scale burst requests. Additionally, the large model inference needs to transfer GB-level KV cache for each request, resulting in high communication overhead. So, we have developed a highly elastic functionalized scheduling framework to guarantee second-level scheduling for thousands of Serverless AI large model inference task instances. Additionally, we leverage RDMA technology to achieve high-speed KV cache migration, avoiding the high overhead caused by traditional network protocol stacks.

AI大模型在像KServe这样的标准无服务器推理平台上的部署越来越受欢迎，因为它能够提高资源利用率并降低成本。然而，现有的大模型推理面临着重要的调度和通信瓶颈，使得满足低延迟和高吞吐量需求变得具有挑战性。Kubernetes的集中式控制平面导致低调度效率，无法实现对大规模突发请求的秒级响应。此外，大模型推理需要为每个请求传输GB级别的KV缓存，导致高通信开销。因此，我们开发了一个高度弹性的功能化调度框架，以确保对数千个无服务器AI大模型推理任务实例进行秒级调度。此外，我们利用RDMA技术实现高速KV缓存迁移，避免传统网络协议栈引起的高开销。

Speakers

Cookie

Senior Software Engineer, Jinan Inspur Data Technology Co., Ltd.

I'm employed in Inspur. I mainly do container computing related development and are familiar with container networks, especially Calico and Cilium. I'm also a contributor to the Openyurt community and mainly participate in the development of the raven project.

Yiming Li

PhD candidate, Tianjin University

Yiming Li received the bachelor’s and master’s degrees from Tianjin University, China, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include cloud com... Read More →

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

How to Increase the Throughput of Kubernetes Scheduler by Tens of Times | 如何将Kubernetes调度器的吞吐量提高数十倍 - Yuquan Ren & Bing Li, ByteDance

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

Currently, various Kubernetes-based task schedulers popular in the community have limited performance capabilities, which restricts the cluster scale they can handle. Due to the limitation of cluster scale, it is difficult to improve resource utilization through large-scale colocation, and more clusters also bring greater operational burdens. 1. Due to the bottleneck of the scheduler and related components, the maximum cluster scale cannot exceed 5k nodes; 2. In clusters with more than 5k Nodes, scheduling throughput cannot exceed 100 Pods/s. Godel Scheduler is a distributed high-performance scheduler based on Kubernetes, and it is now open-sourced. In this talk, we will go deep into the performance optimization methods of godel scheduler: 1. Optimize scheduling algorithms and do data structures refactor; 2. Implement optimistic concurrency under multi-shard architecture to achieve parallel computation; 3. Abstract "batch" scheduling to fully reuse scheduling computation results.

目前，社区中流行的基于Kubernetes的各种任务调度器在性能方面存在一定限制，这限制了它们能处理的集群规模。由于集群规模的限制，通过大规模的共存难以提高资源利用率，而且更多的集群也会带来更大的运维负担。1. 由于调度器及相关组件的瓶颈，最大集群规模无法超过5k个节点；2. 在超过5k个节点的集群中，调度吞吐量无法超过100个Pod/s。 Godel Scheduler是一个基于Kubernetes的分布式高性能调度器，现已开源。在本次演讲中，我们将深入探讨godel调度器的性能优化方法：1. 优化调度算法并进行数据结构重构；2. 在多分片架构下实现乐观并发以实现并行计算；3. 抽象“批量”调度以充分重用调度计算结果。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Yuquan Ren has 10+ years of working experience in the cloud-native field, contributing extensively to open-source projects such as Kubernetes. Currently, he is a tech leader at ByteDance, primarily focusing on the field of orchestration and scheduling.

Bing Li

Senior Software Engineer, Bytedance

Bing Li has participated in the open source community for nearly 3 years. Currently, he is a senior software engineer at ByteDance, focusing on scheduling system performance optimization and system evolution.

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:00 HKT

A New Choice for Istio Data Plane: Architectural Innovation for a Brand-New Performance Experience | Istio数据平面的新选择：全新性能体验的架构创新 - Zhonghu Xu, Huawei

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 5

With the deployment of service mesh technologies like Istio, reducing latency overhead caused by data plane proxy architecture has become a critical concern for mesh providers. In this conference, Zhong Hu and Song Yang will propose a fresh solution for the service mesh data plane from an operating system perspective. By leveraging eBPF + kernel enhancements, they enable native traffic governance capabilities in the OS. Unlike other solutions, this approach significantly simplifies the forwarding path of the mesh data plane, resulting in a 60%+ reduction in data plane forwarding latency. In addition, it features low resource overhead and secure isolation. The project redefines the mesh data plane, with Istiod as the control plane, and Huawei is currently conducting internal verification. Furthermore, they will discuss the future evolution of service mesh and exploring the potential of sidecarless architecture in diverse deployment scenarios.

随着像Istio这样的服务网格技术的部署，减少由数据平面代理架构引起的延迟开销已成为网格提供商的一个关键关注点。在本场演讲中，钟虎和宋洋将从操作系统的角度提出一种全新的服务网格数据平面解决方案。通过利用eBPF +内核增强功能，他们在操作系统中实现了原生流量治理能力。与其他解决方案不同，这种方法显著简化了网格数据平面的转发路径，导致数据平面转发延迟降低了60%以上。此外，它具有低资源开销和安全隔离的特点。该项目重新定义了网格数据平面，以Istiod作为控制平面，华为目前正在进行内部验证。此外，他们将讨论服务网格的未来演变，并探索在不同部署场景中无边车架构的潜力。

Speakers

Zhonghu Xu

Principle Engineer, huawei

Zhonghu is an open-source enthusiast and has focused on oss since 2017. In 2023, Zhonghu was awarded `Google Open Source Peer Bonus`. He has worked on istio for more than 6 years and has been a core Istio maintainer and the TOP 3 contributors. He has been continuously serving as Istio... Read More →

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 5

Open Source Summit Sessions, Networking + Edge Computing

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:50 HKT

AI Inference Performance Acceleration: Methods, Tools, and Deployment Workflows | AI推理性能加速：方法、工具和部署工作流程 - Yifei Zhang & 磊钱, Bytedance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 7

As AI rapidly evolves and embraces cloud-native technologies, inference performance has become crucial for application value. GPU selection, serving framework configuration, and model/data loading significantly impact inference efficiency. We'll focus on cloud-native solutions to storage performance issues and tools for evaluating inference performance across configurations, offering optimal deployment setups integrated into cloud-native workflows. We'll discuss inference performance's impact on user experience and how optimization can reduce costs and improve efficiency. Using technologies like Fluid and model optimization, we'll share strategies to enhance inference performance. Based on performance and cost analysis of various GPUs, we'll guide AI engineers in hardware selection. Additionally, we'll introduce a performance testing tool to evaluate and recommend the best model, hardware, and acceleration scheme combinations, aligning with deployment workflows based on test results.

随着人工智能的快速发展和对云原生技术的采用，推理性能对应用价值变得至关重要。 GPU选择、服务框架配置以及模型/数据加载对推理效率有着重大影响。我们将专注于云原生解决方案，解决存储性能问题，并提供评估不同配置下推理性能的工具，为云原生工作流程提供最佳部署设置。我们将讨论推理性能对用户体验的影响，以及优化如何降低成本并提高效率。利用Fluid和模型优化等技术，我们将分享增强推理性能的策略。基于各种GPU的性能和成本分析，我们将指导人工智能工程师进行硬件选择。此外，我们将介绍一种性能测试工具，评估并推荐最佳模型、硬件和加速方案组合，根据测试结果与部署工作流程相匹配。

Speakers

Yifei Zhang

Software Engineer, Bytedance

Yifei Zhang, Software Engineer at Volcengine, focuses on technical research and product development in Kubernetes and AI, and has rich experience in public cloud, and is now fully working on VKE (Volcengine Kubernetes Engine), which is the managed Kubernetes product in Volcengine... Read More →

钱磊

Software Engineer, Bytedance

a kubernetes developer in bytedance. focus on building a stable kubernetes engine on public cloud.

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Extend Kubernetes to Edge Using Event-Based Transport | 使用基于事件的传输将Kubernetes扩展到边缘 - Longlong Cao & Meng Yan, Red Hat

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 1

Struggling with extensive edge cluster management? Kubernetes adoption brings new challenges, especially in sectors like telecom, retail, and manufacturing. The surge in clusters highlights Kubernetes' limitations, worsened by unreliable networks between data centers and edge clusters. Without scalable control, organizations resort to sending engineers to maintain thousands or even millions of edge clusters, slowing progress. But, we have a solution: connecting Kubernetes and edge clusters via event-based transport, utilizing standard open-source protocols like Kafka, MQTT, and NATS. This enhances Kubernetes-style events, making them resilient to network delays or disconnects. With these capabilities, we can effortlessly construct a central control plane scalable to millions of edge clusters. Join us for an intuitive control plane, handling a million edge clusters across regions. Learn an approach that can be adapted to your edge management infrastructure today.

在KubeCon的会议描述中，若您正在为庞大的边缘集群管理而苦恼？Kubernetes的采用带来了新的挑战，尤其是在电信、零售和制造等行业。集群数量的激增凸显了Kubernetes的局限性，加剧了数据中心和边缘集群之间不稳定网络的问题。在缺乏可扩展控制的情况下，组织不得不派遣工程师去维护成千上万甚至数百万个边缘集群，从而拖慢了进展。但是，我们有解决方案：通过基于事件的传输将Kubernetes和边缘集群连接起来，利用标准的开源协议如Kafka、MQTT和NATS。这样可以增强Kubernetes风格的事件，使其能够抵御网络延迟或断开连接。有了这些功能，我们可以轻松构建一个可扩展到数百万个边缘集群的中央控制平台。加入我们，体验一个直观的控制平台，可以跨区域管理数百万个边缘集群。学习一种可以立即应用于您的边缘管理基础设施的方法。

Speakers

Longlong Cao

Senior Software Engineer, Red Hat

Long Long Cao currently works as a cloud engineer at Red Hat, he is also maintainer of the Istio project and member of the Kubernetes SIGs. He is passionate about open source projects and has extensive experience in Docker, Kubernetes and Service Mesh. He writes blogs/articles and... Read More →

Meng Yan

Software Engineer, Red Hat

Meng Yan currently works as a software engineer at Red Hat. What he mainly does is the management of large-scale clusters. Mainly contributed to open source projects are multicluster-global-hub, multicluster-controlplane, etc, also participating in the improvement of Cloudevent.

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

13:50 HKT

⚡ Lightning Talk: Continuously Profile Your Applications in Kubernetes with Pyroscope | ⚡ 闪电演讲: 使用Pyroscope在Kubernetes中持续对应用程序进行性能分析 - Kerrigan Lin, Amazon Web Services

Wednesday August 21, 2024 13:50 - 13:55 HKT

Level 1 | Hung Hom Room 1

Explore performance optimization in Kubernetes using Pyroscope. This Lightning Talk will cover advanced strategies to uncover and resolve performance bottlenecks, enhancing application efficiency and reliability. Tailored for developers and SRE engineers, the session will highlight case studies and demonstrate practical applications of these technologies in real-world scenarios. Attendees will leave with actionable insights for effective performance tuning of containerized applications in Kubernetes.

在KubeCon中探索使用Pyroscope进行Kubernetes性能优化。这场闪电演讲将涵盖发现和解决性能瓶颈的高级策略，提升应用程序的效率和可靠性。针对开发人员和SRE工程师定制，本场演讲将重点介绍案例研究，并演示这些技术在实际场景中的实际应用。与会者将获得有关在Kubernetes中对容器化应用程序进行有效性能调优的可操作见解。

Speakers

Kerrigan Lin

Solutions Architect, Amazon Web Services

Kerrigan Lin brings over 14 years of experience in the information technology industry, with a background in software development and architecture. Currently, he serves as a Solutions Architect at AWS, where he helps clients build cloud-native systems.

Wednesday August 21, 2024 13:50 - 13:55 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Observability

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

13:50 HKT

Boundaryless Computing: Optimizing LLM Performance, Cost, and Efficiency in Multi-Cloud Architecture | 无边界计算：在多云架构中优化LLM性能、成本和效率 - Jian Zhu, Red Hat & Kai Zhang, Alibaba Cloud Intelligence

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 7

For large language model (LLM) inference, GPU resources within a single data center or cloud region often cannot meet all user demands. Additionally, for the end-users, deploying across multiple geographic regions is necessary to provide an optimal user experience. However, managing model distribution, synchronization, and consistency across multiple regions presents new challenges. To address this, the OCM and Fluid communities have collaborated to automate the multi-region distribution of inference applications through OCM's multi-cluster application deployment capabilities, combined with Fluid's data orchestration capabilities. This automation facilitates the cross-regional distribution and pre-warming of large models, enhancing the efficiency of model deployment and upgrades.

对于大型语言模型（LLM）推理，单个数据中心或云区域内的GPU资源通常无法满足所有用户需求。此外，对于最终用户来说，跨多个地理区域部署是为了提供最佳用户体验。然而，在多个地区管理模型分发、同步和一致性会带来新的挑战。为了解决这个问题，OCM和Fluid社区合作，通过OCM的多集群应用部署能力和Fluid的数据编排能力自动化实现推理应用的多地区分发。这种自动化促进了大型模型的跨地区分发和预热，提高了模型部署和升级的效率。

Speakers

Kai Zhang

Senior Staff Engineer, Alibaba

Kai Zhang is a Senior Staff Engineer at Alibaba Cloud Intelligence, where he has been part of the team developing the Alibaba Cloud container service for Kubernetes (ACK) for over 6 years. He currently leads ACK’s Cloud native AI product and solution offerings. Before this, he spent... Read More →

Jian Zhu

Senior Software Engineer, RedHat

Zhu Jian is a senior software engineer at RedHat, core contributor to open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:55 HKT

⚡ Lightning Talk: Discussion on CNAI Widely Used in Education | ⚡ 闪电演讲: 教育中广泛使用的CNAI讨论 - Chen Lin, VMware by Broadcom

Wednesday August 21, 2024 13:55 - 14:00 HKT

Level 1 | Hung Hom Room 1

This lightening talk will discuss Cloud Native Artificial Intelligence (CNAI) used in education from three aspects. Firstly, introduce the current situations on CNAI applied on children education. Secondly, demo a kids-friendly prototype of AI training process on cloud native infrastructure. Thirdly, talk about more possibilities of CNAI used in pre-school and in-school education(children enlightenment, students assignment corrections, AI teaching... ), also brings up the foreseen malicious abuse of CNAI problems.

这个闪电演讲将从三个方面讨论在教育领域中使用的云原生人工智能（CNAI）。首先，介绍目前在儿童教育中应用CNAI的现状。其次，演示一个儿童友好的AI培训过程原型在云原生基础设施上的应用。第三，讨论CNAI在学前和学校教育中的更多可能性（儿童启蒙，学生作业批改，AI教学...），同时提出CNAI可能存在的恶意滥用问题。

Speakers

Chen Lin

Software Engineer, VMware by Broadcom

Chen Lin joined VMware in 2019 and has 5 years cloud native experience. Chen worked on PKS, Tanzu and TKGs product targeting at networking and production CI/CD. Chen is also member of Kubernetes community, and the maintainer of cloud-provider-vsphere.

Wednesday August 21, 2024 13:55 - 14:00 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Cloud Native Experience

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese), 英语 (English)

14:05 HKT

⚡ Lightning Talk: How Prometheus AI Agent Helps Build Interactive Monitoring? | ⚡ 闪电演讲: Prometheus AI代理如何帮助构建交互式监控？ - Zhihao Liu, Quwan

Wednesday August 21, 2024 14:05 - 14:10 HKT

Level 1 | Hung Hom Room 1

In day-to-day work, both SREs and developers often struggle when working with the observability tools like Prometheus, mainly due to the complex PromQL syntax and disorganized metrics. This talk will showcase how to build Agent. It will have the ability to think, act, and analyze like a human, and it will solve user issues through conversation. This talk presents two main standout ideas: 1. Leveraging RAG technology, it performs multi-path retrieval from local metric knowledge, Prometheus API, Request Logs, and public domain knowledge to produce a consolidated answer. 2. Using the ReAct method, it engages in multi-round dialogues to refine and generate the correct PromQL, call api, and render the dashboard return. This talk, we hope the audience will learn: 1. How to integrate LLM effectively within the observability space. 2. The steps to create an easy-to-use and practical Prometheus AI Agent. 3. Gain experience and insights from practical examples of the Prometheus AI Agent.

在日常工作中，SRE和开发人员在使用像Prometheus这样的可观察性工具时经常遇到困难，主要是由于复杂的PromQL语法和混乱的指标。本次演讲将展示如何构建Agent。它将具有像人类一样思考、行动和分析的能力，并通过对话解决用户问题。本次演讲提出了两个主要的突出想法： 1. 利用RAG技术，从本地度量知识、Prometheus API、请求日志和公共领域知识中进行多路径检索，以生成一个整合的答案。 2. 使用ReAct方法，进行多轮对话以完善和生成正确的PromQL，调用api，并呈现仪表板返回。通过本次演讲，我们希望观众能学到： 1. 如何在可观察性领域有效地整合LLM。 2. 创建一个易于使用和实用的Prometheus人工智能Agent的步骤。 3. 从Prometheus人工智能Agent的实际示例中获得经验和见解。

Speakers

Zhihao Liu

Senior Devops Engineer, Quwan

three years of experience in the observability field. I have been involved in the development of the company's observability platform.

Wednesday August 21, 2024 14:05 - 14:10 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

14:40 HKT

⚡ Lightning Talk: Kubernetes Raises Questions. Can a PaaS Answer Them? | ⚡ 闪电演讲: Kubernetes引发了问题。 PaaS能解答吗？ - Ram Iyengar, Cloud Foundry Foundation

Wednesday August 21, 2024 14:40 - 14:45 HKT

Level 1 | Hung Hom Room 1

The enormous success of the CNCF Landscape has produced an overwhelming number of options in the space, where organizations struggle to establish their platforms quickly. This talk will help guide the community through the thought process of building these platforms, explore some examples of what a healthy source-driven platform ecosystem looks like, and showcase the power that a good cloud native platform will deliver to an organization. Though there are variations of platforms (i.e data, application, machine learning, etc) many start to have the same problems. These include artifact management, secrets management, TLS certificates, cloud permissions, and the list goes on. Providing turnkey solutions for platforms that can be ready in minutes adds much velocity to engineering teams across organizations that adopt the platform engineering model.

CNCF景观的巨大成功在该领域产生了大量的选择，组织往往难以快速建立自己的平台。本次演讲将帮助指导社区通过构建这些平台的思考过程，探讨健康的源驱动平台生态系统的一些示例，并展示一个优秀的云原生平台将为组织带来的力量。尽管平台有各种变化（如数据、应用程序、机器学习等），许多开始出现相同的问题。这些问题包括工件管理、密钥管理、TLS证书、云权限等等。为平台提供即插即用的解决方案，可以在几分钟内准备就绪，为采用平台工程模型的组织的工程团队带来更大的速度。

Speakers

Ram Iyengar

Chief Evangelist, Cloud Foundry Foundation

Ram Iyengar is an engineer by practice and an educator at heart. He was (cf) pushed into technology evangelism along his journey as a developer and hasn’t looked back since! He enjoys helping engineering teams around the world discover new and creative ways to work. He is a proponent... Read More →

Wednesday August 21, 2024 14:40 - 14:45 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Platform Engineering

Experience Level | 内容经验水平 任意程度 (Any), 中级 (Intermediate)
Language | 语言 中文 (Chinese), 英语 (English)

14:40 HKT

Best Practice: Karmada & Istio Improve Workload & Traffic Resilience of Production Distributed Cloud | 最佳实践：Karmada和Istio提高生产分布式云的工作负载和流量弹性 - Chaomeng Zhang, Huawei

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 2 | Grand Ballroom 1-2

The Distributed cloud offers better resilience by providing redundancy, scalability and flexibility, especially for cloud native applications. However the complexity of multi-cluster workload and traffic management in hybrid or multi-cloud environment brings huge challenges in practice, such as the number of overall multi-cluster workload instances serve for customer request decreased when some unhealthy ones isolated in case of failures. In this speech, Chaomeng introduces a production practice of Karmada and Istio work together to promote resilience of multi-cluster application. How Karmada and Istio policies configured in a centralized control plane controls both replica and traffic distribution across cluster automatically. In case of failures, how Istio’s failover acts to remove unhealthy endpoints from global load balancing pool, and how Karmada rebuild the according number of instance in other healthy clusters, ensure multi-cluster instances always meet the capacity design.

分布式云通过提供冗余、可伸缩性和灵活性，特别是对于云原生应用程序，提供了更好的弹性。然而，在混合或多云环境中的多集群工作负载和流量管理的复杂性在实践中带来了巨大挑战，例如当一些不健康的实例在故障情况下被隔离时，为客户请求提供服务的整体多集群工作负载实例数量减少。在这次演讲中，Chaomeng介绍了Karmada和Istio共同推动多集群应用程序弹性的生产实践。Karmada和Istio策略如何在集中控制平面中配置，自动控制跨集群的副本和流量分发。在发生故障时，Istio的故障转移如何从全局负载均衡池中移除不健康的端点，以及Karmada如何在其他健康集群中重新构建相应数量的实例，确保多集群实例始终满足容量设计。

Speakers

Chaomeng Zhang

Architect of UCS (HUAWEI Distributed Cloud Native), Huawei

Zhang Chaomeng is the architect of UCS (HUAWEI Distributed Cloud Native), has 9 years cloud computing related design and developing experience in HUAWEI Cloud, including service mesh, Kubernetes, micro service, cloud service catalog, big data, APM, cloud computing reliability and... Read More →

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

14:40 HKT

Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience | 连接点：走向统一的多集群AI/ML体验 - Qing Hao, RedHat & Chen Yu, Microsoft

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 7

Today cloud-native infra is vital for AI/ML, administrative complexities and the growing demand for compute resources drive devs towards multi-cluster patterns. Batch scheduling projects, like Kueue, are valuable for efficient AI/ML training in a single Kubernetes cluster. Multi-cluster management platforms like OCM and Fleet simplify cluster management and provide advanced scheduling features. We hope to bridge the best of both worlds to simplify user operations and reduce confusion between different systems. In this talk, we will showcase that with the help of Sig Multi-Cluster's newly proposed API - ClusterProfile, combined with OCM, Fleet, and Kueue, to address these challenges. We will demonstrate that MultiKueue setup can be easily automated with the help of the ClusterProfile API; with a few tweaks, users can use OCM and Fleet's advanced scheduling features through MultiKueue to smart place AI/ML jobs across the clusters to maximize resource utilization like GPU to save costs.

今天，云原生基础设施对于人工智能/机器学习、管理复杂性以及对计算资源需求不断增长至关重要，这推动开发人员转向多集群模式。像Kueue这样的批处理调度项目对于在单个Kubernetes集群中高效进行人工智能/机器学习训练非常有价值。OCM和Fleet等多集群管理平台简化了集群管理，并提供了高级调度功能。我们希望将两者的优势结合起来，简化用户操作，减少不同系统之间的混乱。在本次演讲中，我们将展示如何借助Sig Multi-Cluster最新提出的API - ClusterProfile，结合OCM、Fleet和Kueue来解决这些挑战。我们将演示如何通过ClusterProfile API轻松自动化MultiKueue设置；通过一些调整，用户可以利用OCM和Fleet的高级调度功能，通过MultiKueue智能地在集群之间放置人工智能/机器学习作业，以最大化资源利用率，如GPU，以节省成本。

Speakers

Qing Hao

Senior Software Engineer, RedHat

Qing Hao is a senior software engineer at RedHat, where she works as the maintainer of Open Cluster Management. Qing has interest in solving complex problems in the multi-clusters areas, eg, application scheduling, and management components rolling upgrade. Prior to RedHat, she worked... Read More →

Chen Yu

Senior Software Engineer, Microsoft

Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:45 HKT

⚡ Lightning Talk: Rocket Power Your Kubernetes Career with Kubestronaut Program | ⚡ 闪电演讲: 用Kubestronaut计划提升您的Kubernetes职业生涯火力 - Giorgi Keratishvili, EPAM Systems

Wednesday August 21, 2024 14:45 - 14:50 HKT

Level 1 | Hung Hom Room 1

Are you a person who wants to fly high? Conquer mountains of Kubernetes certifications then this talk is for you, Giorgi will share all details of kubestronaut program, what benefits does it gives to person and his certification journey as he holds all 5 and even more certificates from CNCF also he has been beta tester and exam developer some of them...

您是想要飞得更高的人吗？征服 Kubernetes 认证的高山？那么这个讲座适合您。Giorgi 将分享 kubestronaut 计划的所有细节，以及它对个人和他的认证之旅带来的好处。他拥有 CNCF 颁发的所有 5 个甚至更多证书，并且还曾担任其中一些证书的测试人员和考试开发人员...

Speakers

Giorgi Keratishvili

Lead System Engineer (DevOps), EPAM Systems

Giorgi has been in IT field a decade, during this period he has been exposed to majority fields of Development and Operation starting from bear metal infrastructure to higher level of automatization, beside working hour Giorgi is very actively participating in community He plays role... Read More →

Wednesday August 21, 2024 14:45 - 14:50 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Cloud Native Experience

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese), 英语 (English)

14:50 HKT

⚡ Lightning Talk: Running Native WebAssembly AI Applications Everywhere | ⚡ 闪电演讲: 在任何地方运行原生WebAssembly人工智能应用程序 - Tiejun Chen, VMware

Wednesday August 21, 2024 14:50 - 14:55 HKT

Level 1 | Hung Hom Room 1

In recent years WASM has been one of the hottest topics in the world of computing due to its portability, small size, fast loading, and compatibility. And given these advantages, WebAssembly is an ideal technology based on sandbox schemes for modern applications including ML/AI. But beyond the browser, currently WebAssembly only can leverage CPU to accelerate ML/AI mostly. Here we offer a flexible way to make running ML/AI on WebAssembly over a variety of AI Accelerators by empowering WASM with a transparent backend interposer. With this, your native ML/AI WebAssembly workloads can seamlessly enjoy the underlying AI accelerators such as CPU, GPU, FPGA and so on, with best performance. During this presentation we also would like to show our latest implementation with demos to help users get direct insight of running ML/AI with WebAssembly on AI accelerators.

近年来，由于其可移植性、体积小、加载速度快和兼容性等优势，WASM已成为计算领域最热门的话题之一。鉴于这些优势，WebAssembly是基于沙箱方案的现代应用程序，包括ML/AI的理想技术。但除了浏览器之外，目前WebAssembly只能利用CPU来加速大部分ML/AI。在这里，我们提供了一种灵活的方式，通过为WASM赋予一个透明的后端插入器，使其能够在各种AI加速器上运行ML/AI。借助这一技术，您的本地ML/AI WebAssembly工作负载可以无缝地享受CPU、GPU、FPGA等底层AI加速器的最佳性能。在本次演示中，我们还将展示我们最新的实现，并通过演示帮助用户直观了解在AI加速器上运行ML/AI的WebAssembly。

Speakers

Tiejun Chen

Sr. Technical Lead, VMware

Tiejun Chen was Sr. technical leader. He ever worked several tech companies such as VMware, Intel, Wind River Systems and so on, involved in - cloud native, edge computing, ML/AI, RISC-V, WebAssembly, etc. He ever made many presentations at AI.Dev NA 2023, kubecon China 2021, Kube... Read More →

Wednesday August 21, 2024 14:50 - 14:55 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, AI + ML

Experience Level | 内容经验水平 任意程度 (Any), 中级 (Intermediate)
Language | 语言 中文 (Chinese), 英语 (English)

14:55 HKT

⚡ Lightning Talk: Tips and Tricks to (Right) Size Your Kubernetes Cluster for Efficiency and Cost Saving | ⚡ 闪电演讲: 为了提高效率和节约成本，调整Kubernetes集群大小的技巧和窍门 - Daniele Polencic, Learnk8s

Wednesday August 21, 2024 14:55 - 15:00 HKT

Level 1 | Hung Hom Room 1

In this session, you will learn how Kubernetes allocates resources in worker nodes and how you can obtain the most out of them by choosing the right kind of limits and requests for your workloads. You will cover some practical tips to allocate the right number of nodes and resources to your cluster: - Should you have larger or smaller nodes? - How reservation affects efficiency and cost savings? - How to "defrag" your cluster to optimize allocations And more.

在这场演讲中，您将学习Kubernetes如何在工作节点中分配资源，以及如何通过为工作负载选择正确的限制和请求来充分利用它们。您将学习一些实用的技巧，来为您的集群分配正确数量的节点和资源： - 您应该选择更大还是更小的节点？ - 预留资源如何影响效率和节约成本？ - 如何“整理”您的集群以优化分配等等。

Speakers

Daniele Polencic

Instructor, Learnk8s

Daniele teaches containers and Kubernetes at Learnk8s. Daniele is a certified Kubernetes administrator by the Linux Foundation. In the last decade, Daniele trained developers for companies in the e-commerce, finance and public sector.

Wednesday August 21, 2024 14:55 - 15:00 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Platform Engineering

Experience Level | 内容经验水平 任意程度 (Any), 初级 (Beginner)
Language | 语言 中文 (Chinese), 英语 (English)

15:00 HKT

⚡ Lightning Talk: Use Keycloak to Build an Authentication System for Cloud-Native Application | ⚡ 闪电演讲: 使用Keycloak为云原生应用构建身份验证系统 - Yiting Jiang, DaoCloud

Wednesday August 21, 2024 15:00 - 15:05 HKT

Level 1 | Hung Hom Room 1

The identity authentication mechanism is the most basic function for applications, especially for the enterprise-level management system. They usually need to implement functions such as Identity management, single sign-on, and security policy settings. Keycloak is an open source identity and access management (IAM) solution, it can be easily deployed on Kubernetes, and provide applications with features such as centralized authentication. This speech will explain how our cloud native management system makes full use of the powerful and comprehensive features of Keycloak to implement enterprise-level identity and security access management functions. In order to meet our own requirement, we also created some Keycloak plugins to extend its IDP and Event functions, which can be a good example to learn when customization is needed.

身份认证机制是应用程序最基本的功能，尤其对于企业级管理系统而言。它们通常需要实现身份管理、单点登录和安全策略设置等功能。Keycloak 是一个开源的身份和访问管理（IAM）解决方案，可以轻松部署在 Kubernetes 上，为应用程序提供集中认证等功能。本次演讲将解释我们的云原生管理系统如何充分利用 Keycloak 强大而全面的功能来实现企业级身份和安全访问管理功能。为了满足我们的需求，我们还创建了一些 Keycloak 插件来扩展其身份提供者（IDP）和事件功能，当需要定制化时，这些插件是很好的学习例子。

Speakers

Yiting Jiang

Dev Manager, DaoCloud

Graduated at Tong ji University with Master degree, majored in Computer Software and Theory. Worked in EMC, VMWare and DellEMC Companies before.

Wednesday August 21, 2024 15:00 - 15:05 HKT
Level 1 | Hung Hom Room 1

⚡ Lightning Talks | ⚡ 闪电演讲, Security

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:35 HKT

How Fast Can Your Model Composition Run in Serverless Inference? | 您的模型组合在无服务器推理中可以运行多快？ - Fog Dong, BentoML & Wenbo Qi, Ant Group

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 7

Are you struggling with slow deployment times, high operational costs, or scalability issues when serving your ML models? Now, imagine the added complexity when typical AI apps require not just one, but an interconnected suite of models. In this session, discover how the integration of BentoML with Dragonfly effectively addresses these challenges, transforming the landscape of multi-model composition and inference within serverless Kubernetes envs. Join the co-presentation by the BentoML and Dragonfly communities to explore a compelling case study: a RAG app that combines 3 models—LLM, embedding, and OCR. Learn how our framework not only packages these diverse models efficiently but also utilizes Dragonfly's innovative P2P network for swift distribution. We'll further delve into how other open-source technologies like JuiceFS and VLLM have enabled us to achieve remarkable deployment times of just 40 seconds and establish a scalable blueprint for multi-model composition deployments.

您是否在为机器学习模型的部署时间慢、运营成本高或可扩展性问题而苦恼？现在，想象一下当典型的人工智能应用程序不仅需要一个模型，而是一个相互连接的模型套件时所增加的复杂性。在本场演讲中，了解BentoML与Dragonfly的集成如何有效解决这些挑战，改变了无服务器Kubernetes环境中多模型组合和推理的格局。加入BentoML和Dragonfly社区的联合演示，探索一个引人注目的案例研究：一个结合了LLM、嵌入和OCR三个模型的RAG应用程序。了解我们的框架不仅高效打包这些多样化的模型，还利用Dragonfly创新的P2P网络进行快速分发。我们还将深入探讨其他开源技术，如JuiceFS和VLLM，如何帮助我们实现仅需40秒的部署时间，并为多模型组合部署建立可扩展的蓝图。

Speakers

Wenbo Qi

Senior Software Engineer, Ant Group

Wenbo Qi is a software engineer at Ant Group working on Dragonfly. He is a maintainer of the Dragonfly. He hopes to do some positive contributions to open source software and believe that fear springs from ignorance.

Fog Dong

Senior Software Engineer, BentoML

Fog Dong, a Senior Engineer at BentoML, KubeVela maintainer, CNCF Ambassador, and LFAPAC Evangelist, has a rich background in cloud native. Previously instrumental in developing Alibaba's large-scale Serverless Application Engine workflows and Bytedance's cloud-native CI/CD platform... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 中文 (Chinese)

15:35 HKT

Implementing Seamless Connectivity and Service Governance in Multi Kubernetes Cluster with ZTM | 在多个Kubernetes集群中使用ZTM实现无缝连接和服务治理 - Xiaohui Zhang, Flomesh

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 1

In the evolving cloud-native ecosystem, Kubernetes is vital for microservices. As enterprises adopt multi-cluster Kubernetes setups, securely managing cross-cluster communications becomes challenging due to the limitations of traditional gateways and Ingress solutions. This session explores how ZTM (Zero Trusted Mesh) acts as a bridge across K8s clusters, bypassing traditional gateways and network constraints, thus ensuring zero exposure and boosting security. ZTM uses an HTTP/2-based tunneling mechanism with end-to-end encryption, minimizing public exposure and securing data during transmission. Its design enables quick deployment of cross-cluster communications without altering existing networks or applications, easing management. Furthermore, ZTM integrates with service mesh technologies to provide a secure framework for microservices, supporting service discovery, load balancing, and advanced routing policies, allowing flexible and secure cross-cluster service management.

在不断发展的云原生生态系统中，Kubernetes 对于微服务至关重要。随着企业采用多集群 Kubernetes 设置，由于传统网关和入口解决方案的限制，安全地管理跨集群通信变得具有挑战性。本场演讲探讨了 ZTM（Zero Trusted Mesh）如何作为跨 K8s 集群的桥梁，绕过传统网关和网络限制，从而确保零暴露并提升安全性。 ZTM 使用基于 HTTP/2 的隧道机制进行端到端加密，最大程度减少公开暴露并在传输过程中保护数据安全。其设计能够快速部署跨集群通信，而无需改变现有网络或应用程序，简化管理。此外，ZTM 还与服务网格技术集成，为微服务提供安全框架，支持服务发现、负载均衡和高级路由策略，实现灵活且安全的跨集群服务管理。

Speakers

AddoZhang

Cloud Native Architect, Flomesh

Senior programmer, LFAPAC open source evangelist, CNCF Ambassador, Microsoft MVP, author of the WeChat public account "云原生指北". Years of practical experience in microservices and cloud-native, the main work involves microservices, containers, Kubernetes, DevOps, etc.

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

16:25 HKT

Istio and Modern API Gateways: Navigating the Future of Service Meshes | Istio和现代API网关：引领服务网格的未来 - Jimmy Song & Jianpeng He, Tetrate; Jiaqi Zhang, Alibaba Cloud; Jintao Zhang, Kong Inc.; Xunzhuo Liu, Tencent

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 1

Join our esteemed panel of experts as they delve into the latest advancements and integrations in the world of Istio and API gateways. This discussion, led by Jimmy Song from Tetrate and founder of the China Cloud Native Community, will feature insights from core contributors and thought leaders including Jianpeng He (Tetrate), Jintao Zhang (Kong), Xunzhuo Liu (Tencent) and Zhang Jiaqi (Alibaba Cloud). The panel will explore Istio's recent developments such as Ambient Mesh, sidecar-less architectures, and the application of eBPF, along with the evolving role of Envoy Gateway. Participants will gain an in-depth understanding of how API gateways are blending with service meshes to create more dynamic, efficient, and secure cloud-native environments.

加入我们尊贵的专家小组，他们将深入探讨 Istio 和 API 网关领域的最新进展和集成。这次讨论由 Tetrate 的 Jimmy Song 主持，他是中国云原生社区的创始人，将邀请核心贡献者和思想领袖，包括 Jianpeng He（Tetrate）、Jintao Zhang（Kong）、Xunzhuo Liu（腾讯）和张佳琦（阿里云）分享见解。小组将探讨 Istio 的最新发展，如环境网格、无边车架构以及 eBPF 的应用，以及 Envoy 网关的不断演变角色。参与者将深入了解 API 网关如何与服务网格融合，创造更具动态、高效和安全的云原生环境。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Jimmy Song

Developer Advocate, Tetrate

Jimmy Song is a developer advocate at Tetrate, CNCF Ambassador, Cloud Native Community founder. He is an outstanding translator, author, and producer of PHEI. Early adopters and evangelists of Kubernetes and Istio. Previously, he worked at iFlytek, TalkingData, and Ant Group.

Xunzhuo

Software Engineer, Tencent

Xunzhuo Liu, Software Engineer working at Tencent Kubernetes Engine Team. He is an Open Source Enthusiast, focusing on API Gateway, Service Mesh, and Kubernetes Networking. He is the steering committee member, core maintainer of Envoy Gateway, also maintaining a couple of CNCF projects... Read More →

Jianpeng He

Software Engineer, Tetrate

Jianpeng is a core maintainer of istio, co-leader of Extensions and Telemetry wroking group, has been working on Istio for almost 3 years, he is the maintainer of Envoy Gateway.

Jiaqi Zhang

software engineer, Alibaba Cloud

Zhang Jiaqi, working on Alibaba Cloud Service Mesh as software engineer, , focusing on traffic management and telemetry related fields, after graduated from the School of Computer Science, Peking University. Participated in several software computer academic conferences, and keen... Read More →

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:25 HKT

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training | 利用拓扑建模和拓扑感知调度加速LLM训练 - Yang Wang, Huawei

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 7

In the LLM training and inference era, the bottle neck has changed from computing to network. A lot of high throughput and low latency inter-connect technology are widely used, e.g. nvlink, nvswitch to build hyper computer such as nvidia super pod, google multi-slice, AWS placement group. However, Kubernetes has net yet addressed topology awareness efficiently, resulting in low performance when sub-optimal resources are provisioned. This talk will explore the inter-node communication and resources within node inter-connect. Also analyze how these two toplogical factors impacts on the runtime performance of AI workload especially for large language model training. The talk will cover: - How to model the topology on underlying resources like NUMA, Rack, Super Pod, Hyper Computer - How to make scheduler to aware of topology and make the best scheduling - How to coordinate topology-aware scheduling with DRA on node

在LLM训练和推断时代，瓶颈已经从计算转变为网络。许多高吞吐量和低延迟的互连技术被广泛使用，例如nvlink、nvswitch用于构建超级计算机，如nvidia超级Pod、谷歌多片、AWS放置组。然而，Kubernetes尚未有效地解决拓扑意识问题，导致在资源配置不佳时性能较低。本次演讲将探讨节点间通信和节点内部资源的互连。还将分析这两个拓扑因素如何影响AI工作负载的运行性能，特别是对于大型语言模型训练。演讲内容包括： - 如何对底层资源（如NUMA、机架、超级计算机）建模拓扑 - 如何使调度程序意识到拓扑并进行最佳调度 - 如何协调拓扑感知调度与节点上的DRA

Speakers

Yang Wang

Senior engineer and maintainer of Volcano, Huawei Cloud Technologies Co., LTD

Volcano maintainer and speaker at KCD and GOTC. Focus on cloud native scheduling and multi-cluster managment.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Enhancing Application Delivery with KubeVela: Introducing the New Cuex Feature | 通过KubeVela增强应用交付：介绍新的Cuex功能 - Fog Dong, BentoML

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 6

As the pace of software development accelerates, the need for more dynamic and flexible application delivery systems becomes crucial. KubeVela, as a modern application delivery system, has continuously evolved to meet these demands. In this session, the maintainers are excited to present the latest advancements in KubeVela, focusing on the introduction of our groundbreaking feature: Cuex. Cuex is designed to revolutionize the way developers interact with KubeVela by simplifying the process of writing and managing application definitions. This innovative feature enhances the core capabilities of KubeVela, allowing users not only to write definitions more effectively but also to extend the platform's functionality by registering their own custom functions as Cuex actions. Join us to explore how KubeVela is setting new standards in application delivery and how you can be a part of this evolving journey.

随着软件开发速度加快，对更具动态性和灵活性的应用交付系统的需求变得至关重要。作为现代应用交付系统，KubeVela不断发展以满足这些需求。在本场演讲中，维护人员很高兴地介绍KubeVela的最新进展，重点介绍我们的突破性功能Cuex的介绍。 Cuex旨在通过简化编写和管理应用程序定义的过程，彻底改变开发人员与KubeVela互动的方式。这一创新功能增强了KubeVela的核心功能，使用户不仅可以更有效地编写定义，还可以通过注册自定义函数作为Cuex操作来扩展平台的功能。加入我们，探索KubeVela如何在应用交付领域树立新的标准，以及您如何成为这一不断发展之旅的一部分。

Speakers

Fog Dong

Senior Software Engineer, BentoML

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, KubeVela

Language | 语言 中文 (Chinese)

17:15 HKT

Multi-Cluster Networking and Service Discovery Leveraging NRI | 利用NRI的多集群网络和服务发现 - LingMing Xia, Purple Mountain Laboratories & Di Xu, Xiaohongshu

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 1

Connection and service discovery are usually key challenges for multi-cluster management, existing solutions such as Submariner introduce pre-conditions for public IP and specific CNI. This is problematic for projects like the "East-to-West Computing Resource Transfer Project" where clusters lack public IPs and have diverse CNIs due to different ownership. This session introduces a solution to establish an independent and unified parallel network for east-west traffic cross clusters based on Node Resource Interface (NRI) to avoid intrusive modifications for clusters and limitations on CNI. A hybrid approach is provided for inter-cluster traffic: clusters can communicate through a hub cluster with public IP or connect directly if public IP is equipped. Moreover, cross-cluster service discovery follows the MCS standard to ensure seamless service access. All functionalities remain agnostic to Kubernetes and applications. A live demo will be shown in this session.

连接和服务发现通常是多集群管理的关键挑战，现有解决方案如Submariner引入了公共IP和特定CNI的先决条件。这对于像“东西计算资源转移项目”这样的项目是有问题的，因为集群缺乏公共IP并且由于不同所有权而具有不同的CNI。本场演讲介绍了一种解决方案，基于节点资源接口（NRI）建立一个独立和统一的跨集群东西流量网络，以避免对集群进行侵入性修改和对CNI的限制。提供了一种混合方法用于集群间流量：集群可以通过具有公共IP的中心集群进行通信，或者如果具有公共IP则可以直接连接。此外，跨集群服务发现遵循MCS标准，以确保无缝的服务访问。所有功能都与Kubernetes和应用程序无关。本场演讲将展示现场演示。

Speakers

Di Xu

Principle Software Engineer, Xiaohongshu

Currently, he serves as a Tech Lead at Xiaohongshu, where he leads a team focused on building a highly reliable and scalable container platform. He is the founder of CNCF Sandbox Project Clusternet. Also, he is a top 50 code contributor in Kubernetes community. He had spoken many... Read More →

Lingming

Researcher in Purple Mountain Laboratories, Purple Mountain Laboratories

Focusing on subjects such as cloud-native and distributed clouds. I am currently working as a researcher in the New Computing Architecture Research group of Purple Mountain Laboratories.

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Time Series Database on Kubernetes: Efficient Management of Massive Internet of Vehicles Data | Kubernetes上的时序数据库：高效管理海量物联网车辆数据 - Vicky Lee, Huawei Cloud Computing Technology Co., Ltd.

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 2 | Grand Ballroom 1-2

Today, more and more car companies are building a new generation of Internet of Vehicles platforms based on cloud-native technology stacks such as Kubernetes. However, as more and more cars are produced, they generate hundreds of GB of data every second, making it difficult to store massive data in real-time and making storage costs difficult to control. which requires the platform's underlying database to be low-cost, high-performance, and efficient. openGemini is a cloud-native distributed time series database with high performance and low cost. In data writing, we provide a dedicated high-performance data writing component that supports Arrow Flight. Regarding data storage, we provide specialized data compression algorithms and support local data storage and object storage. This talk will introduce how to build Internet of Vehicles platforms based on cloud-native technology stacks and share the technical practices on how to efficiently manage massive vehicle data.

今天，越来越多的汽车公司正在基于Kubernetes等云原生技术堆栈构建新一代车联网平台。然而，随着汽车的生产越来越多，它们每秒产生数百GB的数据，使得实时存储海量数据变得困难，存储成本难以控制。这就要求平台的底层数据库要低成本、高性能和高效。openGemini是一个具有高性能和低成本的云原生分布式时间序列数据库。在数据写入方面，我们提供了支持Arrow Flight的专用高性能数据写入组件。在数据存储方面，我们提供了专门的数据压缩算法，并支持本地数据存储和对象存储。本次演讲将介绍如何基于云原生技术堆栈构建车联网平台，并分享如何有效管理海量车辆数据的技术实践。

Speakers

Vicky Lee

Engineer, Huawei Cloud Computing Technology Co., Ltd.

Vicky Lee, a Time-series database expert in the HUAWEI CLOUD Database Innovation Lab and the Co-founder of the openGemini community, has been engaged in distributed databases and NoSQL databases as a cloud service for many years. Currently, mainly dedicated to openGemini developm... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:00 HKT

Enhancing Security and Software Supply Chain: Recent and Upcoming Features in Harbor | 增强安全性和软件供应链：Harbor 中的最新和即将推出的功能 - Stone Zhang, Broadcom

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 6

In 2023 to 2024, we released Harbor 2.9 and 2.10, and we will release 2.11 soon. In these releases, we have mainly focused on adding or enhancing features related to security and the software supply chain. These features include the Security Hub, which analyzes vulnerability information in artifacts across different dimensions, and the SBOM generation feature, which can create SBOMs manually or automatically. We have also improved the performance of the garbage collector by implementing parallel processing and ensured alignment with the latest OCI specifications. In future releases, we will continue to explore the potential of using SBOMs to secure the software supply chain and facilitate AI model distribution in cloud-native applications. We welcome software engineers and DevOps professionals to join our community and explore the new features of Harbor together. Let's work together to make Harbor even better!

在2023年至2024年，我们发布了Harbor 2.9和2.10版本，很快将发布2.11版本。在这些版本中，我们主要专注于添加或增强与安全和软件供应链相关的功能。这些功能包括Security Hub，它可以分析不同维度中的构件中的漏洞信息，以及SBOM生成功能，可以手动或自动创建SBOM。我们还通过实现并行处理改进了垃圾收集器的性能，并确保与最新的OCI规范保持一致。在未来的版本中，我们将继续探索使用SBOM来保护软件供应链并促进在云原生应用中分发AI模型的潜力。我们欢迎软件工程师和DevOps专业人士加入我们的社区，一起探索Harbor的新功能。让我们共同努力，使Harbor变得更好！

Speakers

Stone Zhang

Staff Engineer, Broadcom

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Harbor

Language | 语言 中文 (Chinese)

11:00 HKT

A Story of Managing Kubernetes Watch Events End-to End Flow in Extremely Large Clusters | 在极大规模集群中管理Kubernetes watch事件端到端流程的故事 - Bo Tang, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

The K8s watching mechanism has not been given the attention it deserves for an extended period. However, it is critical to the K8s cluster in both stability and perfermance aspsects and watch latency is a perfect indicator of cluster health. This talk begins by introducing the measurement of watch events latency and then defines watch SLI and SLO metrics. Using watch SLO as a guide, the talk will show the bottleneck identification process for watching. And the talk will describe the optimizations made to apiserver, etcd, kubelet, controller-runtime and clients such as controllers and schedulers in various aspects wrt watching, including watch latency, pod provisioning time, bandwidth, cpu/mem etc. With these optimizations, daily P99 watch latency has improved by over 90% in large clusters (~20K nodes) impacting billions of watch events. Pod provisioning time has improved by over 60%. Apiserver bandwidth has decreased by 50%. The overall stability of K8s cluster has improved greatly.

K8s观察机制长期以来并未得到应有的重视。然而，它对于K8s集群的稳定性和性能至关重要，观察延迟是集群健康的完美指标。本次演讲将首先介绍观察事件延迟的测量，然后定义观察SLI和SLO指标。通过观察SLO作为指导，演讲将展示观察瓶颈识别过程。演讲将描述在观察方面对apiserver、etcd、kubelet、controller-runtime和客户端（如控制器和调度器）进行的各种优化，包括观察延迟、Pod提供时间、带宽、CPU/内存等方面。通过这些优化，大型集群（~20K节点）中每日P99观察延迟已经提高了超过90%，影响了数十亿次观察事件。Pod提供时间已经提高了超过60%。Apiserver带宽减少了50%。K8s集群的整体稳定性得到了极大的改善。

Speakers

Bo Tang

Senior Engineer, Ant Group

Bo Tang is a senior engineer in Ant Group. He is currently working on scalability and performance optimization of Kubernetes clusters.

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

OpenYurt & Dragonfly: Enhancing Efficient Distribution of LLMs in Cloud-Edge Collaborative Scenarios | OpenYurt和Dragonfly：增强云边协作场景中LLM的高效分发 - Linbo He, alibaba cloud & Jim Ma, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 1

As LLMs continue to grow in size, their deployment and delivery in cloud-edge environments are faced with substantial challenges, especially within edge computing settings that encompass multiple sites with thousands of edge nodes. In this presentation, we will explore how to efficiently distribute LLM applications across dispersed edge nodes using OpenYurt. We will also delve into how Dragonfly’s P2P image distribution technology can address the issue of public network bandwidth consumption encountered during cross-site transmission, reducing public network traffic consumption by up to 90% compared to conventional LLM distribution, and achieving rapid and efficient sharing of LLMs in physically isolated environments. During this presentation, container service experts from Alibaba Cloud and Ant Group will share this solution and introduce the practical application of combining OpenYurt with Dragonfly in edge computing scenarios for LLMs.

随着LLM的规模不断增长，它们在云边缘环境中的部署和交付面临着重大挑战，特别是在涵盖数千个边缘节点的边缘计算环境中。在本次演讲中，我们将探讨如何使用OpenYurt在分散的边缘节点上高效分发LLM应用程序。我们还将深入探讨Dragonfly的P2P图像分发技术如何解决跨站点传输中遇到的公共网络带宽消耗问题，与传统的LLM分发相比，将公共网络流量消耗降低高达90％，实现在物理隔离环境中LLM的快速高效共享。在本次演示中，来自阿里巴巴云和蚂蚁集团的容器服务专家将分享这一解决方案，并介绍在LLM的边缘计算场景中将OpenYurt与Dragonfly结合应用的实际应用。

Speakers

Jim Ma

Senior Engineer, Ant Group

Kubernetes enthusiast at Ant Group, diving deep into Kubernetes CSI storage, OCI image distribution and maintaining CNCF Dragonfly.

Linbo He

senior software engineer, alibaba cloud

I am a member of the Alibaba Cloud Container Service team and one of the founding contributors to the OpenYurt project. Since 2015, I have been actively engaged in the design, development, and open-source initiatives related to Kubernetes. I have taken on responsibilities in a variety... Read More →

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

Revolutionizing Service Mesh with Kernel-Native Sidecarless Architecture | 用内核原生无边车架构彻底改变服务网格 - ChangYe Wu, Huawei Technologies Co., Ltd.

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 5

Service mesh technology has revolutionized service governance among microservices, but as clusters expand, challenges arise. Proxy programs can strain resources, with memory consumption reaching GB levels and CPU overhead peaking at 30%. Furthermore, this expansion often leads to noticeable delays in microservice access. This call for proposals seeks to address these challenges head-on by exploring innovative solutions within the kernel-native sidecarless service mesh framework. We invite submissions that delve into: Efficient Resource Management: Novel strategies to minimize memory consumption and CPU overhead in proxy programs, ensuring optimal resource utilization. Latency Optimization: Techniques to reduce microservice access latency without compromising on service governance effectiveness. Real-world Implementations: Case studies or examples showcasing successful deployments of kernel-native sidecarless service mesh in diverse environments.

服务网格技术已经在微服务之间的服务治理方面发生了革命，但随着集群的扩大，也带来了挑战。代理程序可能会消耗资源，内存消耗可能达到GB级别，CPU开销可能达到30%。此外，这种扩展通常会导致微服务访问出现明显的延迟。本次征集旨在通过探索内核本地无边车服务网格框架中的创新解决方案来直面这些挑战。我们邀请提交以下内容的提案：高效资源管理：采用新颖策略来最小化代理程序的内存消耗和CPU开销，确保资源的最佳利用。延迟优化：通过技术手段减少微服务访问延迟，同时不影响服务治理的有效性。实际应用：展示在不同环境中成功部署内核本地无边车服务网格的案例研究或示例。

Speakers

ChangYe Wu

Senior software engineer, Huawei Technologies Co., Ltd.

10+ years of OS and network experience, and extensive interest in kernel protocol stack, cloud native, service grid, and EBPF technologies.

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 5

Open Source Summit Sessions, Operating Systems

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:50 HKT

How Volcano Enable Next Wave of Intelligent Applications | 如何让 Volcano 激活下一波智能应用 - William Wang, Huawei Cloud Technologies Co., LTD

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 6

According to Gartner predication, 30% of new applications will use AI techonolgy by 2026. However, the popularity of AI applications also faces challenges. This talk will introduce the challenges, the solutions and show how to leverage Volcano to enable the intelligent applications. Volcano is a cloud native batch platform and CNCF's first container batch computing project. It is optimized for AI and Bigdata by providing the following capabilities: - Full lifecycle management for jobs - Scheduling policies for batch workloads - Support for heterogeneous hardware - Performance optimization for high performance workloads This year Volcano contributors have made great progress to help users to address challenges for intelligent application. A number of new features are on the way to accelerate the GPU/Ascend NPU training efficiency, optimize resource utilization for large scale clusters and provides fine-grained scheduling.

根据Gartner的预测，到2026年将有30%的新应用程序将使用人工智能技术。然而，人工智能应用的普及也面临挑战。本次讲座将介绍这些挑战、解决方案，并展示如何利用Volcano实现智能应用。 Volcano是一个云原生批处理平台，也是CNCF的第一个容器批处理计算项目。它通过提供以下功能来优化人工智能和大数据： - 作业的全生命周期管理 - 批处理工作负载的调度策略 - 支持异构硬件 - 高性能工作负载的性能优化今年，Volcano的贡献者取得了巨大进展，帮助用户解决智能应用的挑战。许多新功能正在开发中，以加速GPU/Ascend NPU训练效率，优化大规模集群的资源利用率，并提供细粒度调度。

Speakers

William Wang

Architect, Huawei Cloud Technologies Co., LTD

William(LeiBo) Wang is an architect of Huawei Cloud. And He is responsible for planning and implementing cloud native scheduling system on HUAWEI CLOUD. He is also the tech lead of CNCF Volcano project, focusing on large-scale cluster resource management, batch scheduling, BigData... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Volcano

Language | 语言 中文 (Chinese)

11:50 HKT

Beyond the Basics: Towards Making Thanos Production-Ready | 超越基础：朝着使Thanos达到生产就绪状态的方向前进 - Benjamin Huo & Junhao Zhang, QingCloud Technologies

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 2 | Grand Ballroom 1-2

As one of the most popular and powerful Prometheus long-term storage projects, Thanos is widely adopted by the community. But to use Thanos in production, there are still a lot of day-2 operations that need to be automated. In this talk, KubeSphere maintainers will share their experiences in using and maintaining Thanos in production including: - Kubernetes native definition of all Thanos components - Tenant isolation of ingestion, rule evaluation, compaction - Tenant-based autoscaling mechanism of Thanos Ingester, Ruler, and Compactor - The time-based partition of Thanos store - Tenant-based data lifetime management - The sharding mechanism of the global ruler to handle massive recording rules and alerting rules evaluation workload - The gateway & agent proxy mechanism for read/write with tenant access control - The basic_auth, built-in query UI, and external remote write and query support of the gateway - The tls support between Thanos components - The 3-tier config management

作为最受欢迎和强大的Prometheus长期存储项目之一，Thanos被社区广泛采用。但要在生产环境中使用Thanos，仍然需要自动化许多第二天的运维工作。在这次演讲中，KubeSphere的维护者将分享他们在生产环境中使用和维护Thanos的经验，包括： - 所有Thanos组件的Kubernetes本地定义 - 数据摄入、规则评估、压缩的租户隔离 - 基于租户的Thanos Ingester、Ruler和Compactor的自动扩展机制 - Thanos存储的基于时间的分区 - 基于租户的数据生命周期管理 - 全局规则分片机制，用于处理大量录制规则和警报规则评估工作负载 - 用于读写的网关和代理机制，带有租户访问控制 - 网关的basic_auth、内置查询UI以及外部远程写入和查询支持 - Thanos组件之间的tls支持 - 三层配置管理

Speakers

Benjamin Huo

Manager of the Architect and Observability Team, QingCloud Technologies, QingCloud Technologies

Benjamin Huo leads QingCloud Technologies' Architect team and Observability Team. He is the founding member of KubeSphere and the co-author of Fluent Operator, Kube-Events, Notification Manager, OpenFunction, and most recently eBPFConductor. He loves cloud-native technologies especially... Read More →

Junhao Zhang

Senior Software Engineer, QingCloud Technologies

Junhao Zhang, Senior Development Engineer at QingCloud Technologies, is responsible for the research and development of container platform monitoring, alerting, and other cloud-native services. With many years of industry experience, he has previously held positions at companies such... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Security Threat Model Analysis and Protection Practice in Edge Computing Scenarios | 边缘计算场景中的安全威胁模型分析和保护实践 - Yue Bao, Huawei & Huan Wei, HarmonyCloud

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 5

Cloud native is rapidly developing towards multi-cloud, hybrid cloud and edge computing, which are becoming key trends in cloud native development. However, in the edge computing scenario, the traditional VPC-based security model is difficult to ensure safe production. There are more and more challenges faced, including weak edge security mechanisms, vulnerable service interfaces exposed to the outside network, vulnerable end device access protocols, and supply chain security risks. In 2023, KubeEdge completed its security audit. This talk will presents the work around the audit, including the threat model, fuzzing efforts and Tips about how to get started with contributing to KubeEdges continued security. Since the completion of the audit, KubeEdge has worked on several initiatives to improve the security of its consumers, and the talk will cover these. One of these initiatives was SLSA L3 compliance, and the presentation will present what has been done and how it helps the community.

云原生正迅速发展为多云、混合云和边缘计算，这些正在成为云原生开发的关键趋势。然而，在边缘计算场景中，传统的基于VPC的安全模型很难确保安全生产。面临的挑战越来越多，包括边缘安全机制薄弱、暴露于外部网络的易受攻击的服务接口、易受攻击的终端设备访问协议以及供应链安全风险。 2023年，KubeEdge完成了安全审计。本次演讲将介绍围绕审计的工作，包括威胁模型、模糊测试工作以及如何开始为KubeEdge持续安全做出贡献的提示。自完成审计以来，KubeEdge已经开展了多项改进其消费者安全性的倡议，本次演讲将涵盖这些内容。其中一个倡议是SLSA L3合规性，演示将展示已经完成的工作以及它如何帮助社区。

Speakers

Huan Wei

Chief Architect, HarmonyCloud

Chief architect of HarmonyCloud. He designs and implements private cloud construction for many large enterprise customers. Huan has 10+ years of experience on software design and development across a variety of industries and technology bases, including cloud computing, micro service... Read More →

Yue Bao

Senior Software Engineer, Huawei Cloud Computing Technology Co., Ltd.

Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source and the member of KubeEdge maintainers, focusing on lightweight edge and edge api-server for KubeEdge. Before that, Yue worked on Huawei Cloud Intelligent EdgeFabric Service and participated... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 5

Open Source Summit Sessions, Supply Chain Security

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

13:50 HKT

Gateway API and Beyond: Introducing Envoy Gateway's Gateway API Extensions | 网关API及更多：介绍Envoy网关的网关API扩展 - Huabing Zhao, Tetrate

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 6

Envoy Gateway, a new member of the Envoy project family, efficiently manages Envoy-based application gateways. In strict adherence to the Kubernetes Gateway API, it amplifies its functionalities by leveraging custom resource definitions (CRDs) in areas where the Gateway API hasn't yet ventured. This presentation will delve into the Gateway API extensions of Envoy Gateway, specifically focusing on ClientTrafficPolicy, BackendTrafficPolicy, and SecurityPolicy. We'll explore their practical applications in managing and securing edge traffic for cloud-native applications. Additionally, we'll discuss a strategic approach for potentially integrating these extensions into the formal Gateway API specifications.

Envoy Gateway是Envoy项目家族的新成员，有效地管理基于Envoy的应用网关。严格遵循Kubernetes Gateway API，通过利用自定义资源定义（CRDs），在Gateway API尚未涉足的领域增强其功能。本次演示将深入探讨Envoy Gateway的Gateway API扩展，特别关注ClientTrafficPolicy、BackendTrafficPolicy和SecurityPolicy。我们将探讨它们在管理和保护云原生应用程序边缘流量方面的实际应用。此外，我们将讨论一个战略方法，可能将这些扩展集成到正式的Gateway API规范中。

Speakers

Huabing Zhao

Engineer, Tetrate

Huabing Zhao is a software engineer at Tetrate and a CNCF ambassador. He has developed a managed service mesh product on the cloud and assisted a lot of users in deploying Istio service mesh in production. He also founded Aeraki Mesh, a CNCF sandbox project that facilitates non-HTTP... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Envoy

Language | 语言 中文 (Chinese)

13:50 HKT

Implement Auto Instrumentation Under GraalVM Static Compilation on OTel Java Agent | GraalVM 静态编译下 OTel Java Agent 的自动增强方案与实现 - Zihao Rao & Ziyi Lin, Alibaba Cloud

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 2 | Grand Ballroom 1-2

GraalVM static compilation has a significant effect on improving Java application startup speed and runtime memory usage. It is very valuable for the Java to flourish in Cloud Native ecosystem. However, the automatic instrumentation originally provided based on Java Agent will become invalid after static compilation. We designed a static instrumentation solution in GraalVM to solve above problem. This speech will introduce the overall design idea of the solution and related test results in OTel Java Agent.

GraalVM静态编译对于提升Java应用的启动速度和运行时内存占用有着显著的效果，对于Java在云生态中的蓬勃发展有着十分宝贵的价值。然而，原本基于Java Agent提供的自动插桩功能在静态编译之后将会失效。针对上述问题我们在GraalVM中设计了静态插桩方案，本演讲将介绍该方案的整体设计思路以及在OTel Java Agent中的相关测试结果。

Speakers

Zihao Rao

Software Engineer, Alibaba Cloud

Zihao is a software engineer at Alibaba Cloud. Over the past few years, he has participated in several well-known open source projects, he is steering committee member of Spring Cloud Alibaba project, and is a triager for OpenTelemetry Java Instrumentation now.

Ziyi Lin

Senior Software Engineer, Alibaba Cloud

Author of book "Static compilation for Java in GraalVM: the principles and practice". ACM SIGSOFT distinguished paper award winner (ICSE'23). Committor of Apache incubating Teaclave Java TEE SDK(https://github.com/apache/incubator-teaclave-java-tee-sdk). Active contributor of GraalVM（https://github.com/pulls?q=is%3Apr+org%3Aoracle+author%3Aziyilin... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:50 HKT

Unified Management, Continuity, Compliance in Multi-Clouds with Service Mesh | 在多云环境中通过服务网格实现统一管理、连续性和合规性 - Kebe Liu, DaoCloud

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 1

In multi-cloud and hybrid cloud architectures, enterprises face challenges like inter-cloud communication, traffic management, application orchestration, data security, and compliance. Service mesh technology offers a unified approach for managing service interactions, enhancing security, and ensuring data compliance. Istio, a leading service mesh project, is particularly effective in multi-cloud and hybrid cloud environments. It provides seamless network connectivity across various architectures, ensuring reliable and secure communication. Additionally, integrating Istio with Karmada enables efficient application scheduling across these complex environments. Karmada allows for smooth orchestration of workloads across different cloud platforms, enhancing the flexibility and scalability of cloud-native applications. I aim to share practical insights and experiences, especially from China, to inspire and provide strategic perspectives in navigating these technological landscapes.

在多云和混合云架构中，企业面临诸如云间通信、流量管理、应用编排、数据安全和合规性等挑战。服务网格技术提供了统一的管理服务交互方式，增强安全性，并确保数据合规性。作为领先的服务网格项目，Istio在多云和混合云环境中特别有效。它提供了跨不同架构的无缝网络连接，确保可靠和安全的通信。此外，将Istio与Karmada集成，可以实现在这些复杂环境中高效的应用调度。Karmada允许在不同云平台上平稳地编排工作负载，增强云原生应用的灵活性和可扩展性。我旨在分享实用的见解和经验，特别是来自中国，以激发并提供在这些技术领域中导航的战略视角。

Speakers

Kebe Liu

Senior software engineer, DaoCloud

Member of Istio Steering Committee, focused on cloud-native and Istio, eBPF and other areas in recent years. Founder of Merbridge project.

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

13:50 HKT

OS Migration Solution on Cloud | 云上操作系统迁移解决方案 - Jianlin Lv, eBay

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 5

Each Linux distribution has a lifecycle; this refers to when the OS developers stop providing updates or any form of support. Continuing to use EOL Linux poses risks such as security vulnerabilities, compatibility issues, and lack of official support. Cloud providers face the challenge of quickly and safely migrating OS to a supported distribution. There are several challenges involved in the process of migrating OS: 1. Ensuring the safety of application data, which is especially significant during OS migrations between different Linux distributions; 2. Customizing the OS based on the Linux distribution, which includes changes to the kernel, deb packages, specific configurations, and tools; 3. How to quickly rollout new OS to the production environment. Achieving the goal of transitioning over 100,000 physical nodes each month without affecting customer operations and minimizing node downtime. This talk will detail the issues encountered in OS migration and the proposed solutions.

每个Linux发行版都有一个生命周期；这指的是当操作系统开发者停止提供更新或任何形式的支持时。继续使用EOL Linux会带来风险，如安全漏洞、兼容性问题和缺乏官方支持。云服务提供商面临着快速且安全地将操作系统迁移到受支持的发行版的挑战。在迁移操作系统的过程中涉及到几个挑战： 1. 确保应用数据的安全性，在不同Linux发行版之间迁移操作系统时尤为重要； 2. 根据Linux发行版定制操作系统，包括对内核、deb软件包、特定配置和工具的更改； 3. 如何快速将新操作系统推出到生产环境。实现每月迁移超过10万个物理节点的目标，同时不影响客户运营并最小化节点停机时间。本次演讲将详细介绍操作系统迁移中遇到的问题和提出的解决方案。

Speakers

Jianlin Lv

Senior Linux Kernel Development Engineer, eBay

Jianlin Lv currently works at eBay CCOE as a Senior Kernel Engineer, responsible for the maintenance and release of eBay TessOS. He has long been involved in the development and maintenance of open-source software and operating systems and has contributed code to multiple open-source... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 5

Open Source Summit Sessions, Operating Systems

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

14:40 HKT

Panel: Fragmentation of the Scheduling in Kubernetes and Challenges for AI/ML Workloads | 圆桌：Kubernetes调度社区碎片化现状及如何应对AI/ML工作负载带来的挑战 - Kante Yin, DaoCloud; Li Tao, Independent; William Wang, Huawei Cloud Technologies Co., LTD; 秋萍戴, daocloud; Yuquan Ren, B

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 1

Scheduler is one of the most frequently customized components in Kubernetes, owing to its expandability. However, too many schedulers lead to decision paralysis among users, which has been discussed extensively in the past KubeCons. To help mitigate the confusion of users, four maintainers from various community (Godel-Scheduler, Koordinator, Kubernetes SIG-Scheduling and Volcano) are invited to profile the background and usecases behind these projects. Also the panel will discuss the gap between upstream Kubernetes and downstream projects and try to abstract the common patterns or functionalities which can be pushed to the upstream to avoid reimplementing the wheel, and what should still be defined loosely to preserve the expandability. Moreover, with the rise of AI, scheduling AI workloads in Kubernetes poses a significant challenge, the panel will discuss where we're right now and where we're head for, as well as the opportunities of cooperations.

调度器是Kubernetes中最经常定制的组件之一，这归功于其可扩展性。然而，过多的调度器会导致用户决策瘫痪，这在过去的KubeCon中已经被广泛讨论过。为了帮助减轻用户的困惑，我们邀请了来自各个社区（Godel-Scheduler、Koordinator、Kubernetes SIG-Scheduling和Volcano）的四位维护者来介绍这些项目背后的背景和用例。此外，本小组讨论将探讨上游Kubernetes和下游项目之间的差距，并尝试提炼出可以推送到上游的常见模式或功能，以避免重新实现轮子，以及什么应该保持松散定义以保留可扩展性。此外，随着人工智能的兴起，在Kubernetes中调度AI工作负载面临着重大挑战，本小组讨论将探讨我们目前的状况以及我们未来的发展方向，以及合作的机会。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Kante Yin

Senior Software Engineer, DaoCloud

Kante is a senior software engineer and an open source enthusiast. He's currently working at the Kubernetes platform team at DaoCloud based in Shanghai, mostly around scheduling, resource management and inference. He also works on upstream Kubernetes as SIG-Scheduling Maintainer and... Read More →

Tao Li

Koordinator Co-founder&Maintainer, N/A

Tao Li is a seasoned Senior Software Engineer with a specialization in K8s scheduling. With extensive practical experience in large-scale K8s cluster scheduling technology, Tao has been deeply participated in the research and development of K8s scheduling systems both within Alibaba... Read More →

秋萍戴

product mananger, daocloud

QiuPing Dai is a senior Technology Product Manager at DaoCloud for 5 years and involved in Cloud Computing ( including Kubernetes Computing, Storage, Network) development work. Before that, Qiuping worked at IBM for Cloud Computing. QiuPing is interested in Storage, Network , Scheduling... Read More →

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Emerging + Advanced

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:40 HKT

WebAssembly on the Server | 服务端的WebAssembly - Vivian Hu, Second State

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 6

As CNCF Annual Survey 2022 key findings described, “Containers are the new normal, and WebAssembly (Wasm) is the future.” Wasm is playing an important role in cloud native area. Before Wasm, Linux containers are commonly used to run these compiled applications in the cloud — eg a Rust or C++ app is compiled to x86_64 machine code and runs inside a Linux container. Wasm provides a more secure, much lighter, faster, and more portable alternative to Linux containers for this type of performance-minded server-side applications. Currently, CNCF hosts three Wasm-focused projects, like WasmEdge, WasmCould, and runwasi. This talk will discuss WebAssembly on the server side. You will learn the integration between Wasm and the existing container tools, use cases of WebAssembly on the server side. Going forward, we will also discuss the role of Wasm in the LLM applications.

根据CNCF年度调查2022的关键发现，“容器是新常态，WebAssembly（Wasm）是未来。” Wasm在云原生领域发挥着重要作用。在Wasm出现之前，Linux容器通常用于在云中运行这些编译应用程序 - 例如，Rust或C++应用程序被编译为x86_64机器代码，并在Linux容器内运行。相比于Linux容器，Wasm为这类性能导向的服务器端应用程序提供了更安全、更轻量、更快速和更可移植的替代方案。目前，CNCF托管了三个以Wasm为重点的项目，如WasmEdge、WasmCould和runwasi。本次演讲将讨论服务器端的WebAssembly。您将了解Wasm与现有容器工具的集成，以及服务器端WebAssembly的用例。此外，我们还将讨论Wasm在LLM应用程序中的作用。

Speakers

Xiaowei

Product Manager, Second State

Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 6

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:35 HKT

Empower Large Language Models (LLMs) Serving in Production with Cloud Native AI Technologies | 利用云原生人工智能技术在生产环境中赋能大型语言模型(LLMs) - Lize Cai, SAP & Yang Che, Alibaba Cloud Intelligence

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 3

LLMs have heightened public expectations of generative models. However, as noted in the Gartner report, running AI applications in production poses significant challenges. To tackle the challenges, we have redesigned and optimized the software capabilities of Cloud Native AI Technologies. By extending KServe to handle OpenAI's streaming requests, it can accommodate the inference load of LLM. With Fluid and Vineyard, It shows a result of reducing Llama-30B model loading time from 10 minutes to under 25 seconds. However, the above optimizations do not stop there. Since LLM loading is not a high-frequency operation,It is crucial to utilize cronHPA for timed auto-scaling in order to achieve a balance between cost and performance, and to evaluate the cost-effectiveness of the scaling process. As KServe and Fluid's reviewer and maintainer, we share our insights on the challenges in the session. We will showcase effective use of Cloud Native AI and share our experiences in production.

LLM让公众对生成式大模型的期望提高。然而，正如Gartner报告所指出的，将AI应用程序投入生产中存在重大挑战。为了解决这些挑战，我们重新设计和优化了云原生AI技术的软件能力。通过扩展KServe以处理OpenAI的流式请求，它可以容纳LLM的推理负载。通过Fluid和Vineyard，我们成功将Llama-30B模型的加载时间从10分钟缩短到不到25秒。然而，上述优化并不止于此。由于LLM加载不是高频操作，利用cronHPA进行定时自动扩展至关重要，以实现成本和性能之间的平衡，并评估扩展过程的成本效益。作为KServe和Fluid的审阅者和维护者，我们在本场演讲中分享了对挑战的见解。我们将展示云原生AI的有效使用，并分享我们在生产中的经验。

Speakers

Yang Che

senior engineer, Alibaba Cloud Intelligence

Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →

Lize Cai

Senior Software Engineer, SAP

Lize is a senior software engineer at SAP, based in Singapore. With a strong product mindset, Lize has extensive experience in building enterprise-grade machine learning platforms. A passionate advocate for open source technology, Lize actively contributes to various projects, including... Read More →

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 3

AI_dev: Open Source GenAI & ML Summit Sessions, MLOps + GenOps + DataOps

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

15:35 HKT

Kubernetes Community Panel: A Decade of Evolution and Future Trends | Kubernetes维护者圆桌：十年演变与未来趋势 - Paco Xu & Mengjiao Liu, DaoCloud; Qiming Teng, Freelance; Klaus Ma, Nvidia; Pengfei Ni, Microsoft

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 2

Join us in celebrating the 10th anniversary of Kubernetes with a panel featuring some of the community's most influential contributors and maintainers from China. Over the past decade, Kubernetes has grown to the cornerstone of cloud-native infra, thanks to the dedication and innovation of its community members. In this panel, we will talk about our journeys with Kubernetes, share stories and experience, and discuss the future of Kubernetes in the next decade. Our panelists include current and previous owners, tech leads and maintainers. Feel free to join the panel to share your perspectives on the past and next decade of the Kubernetes community and ask anything about the community.

加入我们，与中国社区最具影响力的贡献者和维护者一起庆祝Kubernetes的十周年。在过去的十年里，由于社区成员的奉献和创新，Kubernetes已经发展成为云原生基础设施的基石。在这个专题讨论中，我们将谈论与Kubernetes的旅程，分享故事和经验，并讨论Kubernetes在未来十年的发展。我们的专题讨论嘉宾包括现任和前任所有者、技术负责人和维护者。欢迎加入专题讨论，分享您对Kubernetes社区过去和未来十年的看法，并提出任何关于社区的问题。

Speakers

Pengfei Ni

Principal Software Engineer, Microsoft

Pengfei Ni is a Principal Software Engineer at Microsoft Azure and a maintainer of the Kubernetes project. With extensive experience in Cloud Computing, Kubernetes, and Software Defined Networking (SDN), he has delivered presentations at various conferences, including KubeCon, ArchSummit... Read More →

徐俊杰 Paco

Open Source Team Lead, DaoCloud

Paco is co-chair of KubeCon+CloudNativeCon China 2024, and a member of Kubernetes Steering Committee. He is the leader of open-source team in DaoCloud. He is also a KCD Chengdu 2022 organizer, and a speaker in KubeCon EU 2023 & 2024, and KubeCon China 2021. Paco is a kubeadm maintainer... Read More →

Qiming Teng

Architect, Freelance

Qiming has been a passionate open source contributor for more than 10 years. He was an active contributor to the OpenInfra community and the CNCF community. His interest spans from operating systems, programming languages to cloud platforms. His current research fields include the... Read More →

Mengjiao Liu

Software Engineer, DaoCloud

Mengjiao Liu is a Software Engineer at DaoCloud. She contributes to Kubernetes and serves as the WG Structured Logging Lead and SIG Instrumentation Reviewer, focusing on enhancing logging quality. Additionally, she actively participates in SIG Docs as a Chinese owner and English reviewer... Read More →

Klaus Ma

Principal Software Engineer, Nvidia

eam leader, system architect, designer, software developer with 10+ years of experience across a variety of industries and technology bases, including cloud computing, machine learning, bigdata and financial services. Founding Volcano & kube-batch, Kubernetes SIG-Scheduling co-Leader... Read More →

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

CNCF Maintainer Track Sessions, Kubernetes

Language | 语言 中文 (Chinese)

15:35 HKT

KubeSkoop: Deal with the Complexity of Network Issues and Monitoring with eBPF | KubeSkoop：使用eBPF处理网络问题和监控的复杂性 - Yutong Li, Alibaba Cloud & Bingshen Wang, AlibabaCloud

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 2 | Grand Ballroom 1-2

Troubleshooting network issues has always been one of the most difficult parts, especially on Kubernetes. Containerization and microservice results in a denser network topology and more dependencies on various layers of network stack modules, and the new network technology and architecture introduced by AI also provided a significant challenge in observability and diagnosis. We developed KubeSkoop, the networking monitoring and diagnosis suite for Kubernetes. With the eBPF technology, it provides a deep monitoring and tracing of Kubernetes network, to help users quickly locate the network jitter problem happened in the cluster. It also provides the network connectivity check ability, which can help users solve network connectivity issues by one click. This topic will introduce as follows: ● What makes Kubernetes networking complex. ● Introduction to KubeSkoop. ● How we use eBPF to monitor container networking. ● The practices of KubeSkoop in large-scale production environment.

网络问题的故障排除一直是最困难的部分之一，尤其是在Kubernetes上。容器化和微服务导致了更密集的网络拓扑结构，以及对各个网络堆栈模块的更多依赖，人工智能引入的新网络技术和架构也在可观察性和诊断方面提出了重大挑战。我们开发了KubeSkoop，这是专为Kubernetes设计的网络监控和诊断套件。利用eBPF技术，它提供了对Kubernetes网络的深度监控和跟踪，帮助用户快速定位集群中发生的网络抖动问题。它还提供了网络连接性检查功能，可以帮助用户通过一键解决网络连接问题。本主题将介绍以下内容： ● 什么使Kubernetes网络变得复杂。 ● KubeSkoop的介绍。 ● 我们如何使用eBPF来监控容器网络。 ● KubeSkoop在大规模生产环境中的实践。

Speakers

wang bingshen

Senior Engineer, AlibabaCloud

Bingshen Wang is a Senior Engineer in Alibaba Could, a maintainer of KubeSkoop/Terway/OpenYurt, and a contributor of Kubernetes/Containerd. He mainly focuses on container networking and runtime, and has many years of experience around managing Alibaba Cloud Kubernetes clusters. He... Read More →

Tony Li

Software Engineer, Alibaba Cloud

Yutong Li is a Software Engineer at Alibaba Cloud. He is working on designing and maintaining container network for Alibaba Cloud Container Service, and open source Kubernetes networking diagnose tool KubeSkoop.

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:25 HKT

KubeVirt Community Update | KubeVirt社区更新 - Haolin Zhang, Arm

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 6

KubeVirt has been going through some growth spurts of late. As the project matures, so too must the community. We'll go through some of these changes, such as our recent changes building out our SIG process and how this helps our contributor ladder, changes to our design proposal process and how to track the lifecycle of a feature and accountability. As we move towards graduation, it's certainly an interesting time to be part of the community. We've also had three big releases since the last time we met, so we will walk through some of the key features from those. And for those of you who are still learning about what KubeVirt is, we'll run a demo to show some of the basic uses, and how running virtual machines natively alongside your containers is easier than you think.

KubeVirt最近经历了一些增长阶段。随着项目的成熟，社区也必须发展壮大。我们将讨论一些这些变化，比如最近我们改进了SIG流程以及这如何帮助我们的贡献者阶梯，设计提案流程的变化以及如何跟踪功能的生命周期和责任。随着我们向毕业迈进，成为社区的一部分绝对是一个有趣的时刻。自上次见面以来，我们还发布了三个重要版本，因此我们将介绍其中一些关键功能。对于那些仍在了解KubeVirt的人，我们将进行演示，展示一些基本用途，以及如何在容器旁边本地运行虚拟机比你想象的更容易。

Speakers

Haolin Zhang

Senior Software Engineer, Arm

Haolin Zhang, a senior engineer at Arm company. With expertise in cloud computing, he brings deep knowledge in virtualization, containers, and container orchestration. He actively contributes to open-source projects, particularly the Kubevirt project, where he focuses on enabling... Read More →

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, KubeVirt

Language | 语言 中文 (Chinese)

16:25 HKT

A Decade of Cloud-Native Journey: The Evolution of Container Technology and the Kubernetes Ecosystem | 十年云原生之旅：容器技术和Kubernetes生态系统的演变 - Jintao Zhang, Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Over the past decade, cloud-native technologies have revolutionized software development, deployment, and operations. Container technology and the Kubernetes ecosystem, as transformation leaders, have enhanced development agility, and provided enterprises with unmatched scalability, flexibility, and efficiency. This talk navigates the evolution of these technologies, highlighting their impact on the cloud-native landscape. Starting my journey in 2014, I will share insights into the decade-long evolution of Kubernetes, its community, and technology stacks, alongside personal experiences. Attendees will learn about successes, challenges, and future trends, gaining knowledge to navigate their cloud-native transformations.

在过去的十年里，云原生技术已经彻底改变了软件开发、部署和运营。容器技术和Kubernetes生态系统作为变革的领导者，提升了开发的灵活性，并为企业提供了无与伦比的可扩展性、灵活性和效率。本次演讲将探讨这些技术的演变，突出它们对云原生领域的影响。从2014年开始我的旅程，我将分享关于Kubernetes、其社区和技术堆栈十年演变的见解，以及个人经验。与会者将了解成功、挑战和未来趋势，获得知识来引领他们的云原生转型。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:25 HKT

Uniting Sustainability and Edge Computing: Kepler & Open Horizon on RISC-V and Heterogeneous System | 团结可持续性和边缘计算：Kepler和Open Horizon在RISC-V和异构系统上 - Peng Hui Jiang & David Yao, IBM

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 1

The dynamic landscape of cloud-edge computing demands solutions to mitigate energy consumption and promote sustainability. Our proposal advocates for the integration of Kepler and Open Horizon with CNCF and LF Edge ecosystem to address diverse hardware requirements in Cloud and Edge deployments, including x86, arm, s390, and the emerging RISC-V architectures. Notably, the Chinese market, characterized by edge devices in manufacturing, retail and surveillance domains, stands to benefit significantly from this initiative. By using Kepler’s sophisticated energy estimation capabilities and Open Horizon’s autonomous workload management features, this proposal endeavors to optimize energy efficiency across heterogeneous edge environments. In the session, we will demonstrate one use case to build and integrate Kepler and Open Horizon to work on RISC-V platform, and monitor and optimize distributed and heterogeneous system to build a greener and more resilient cloud-edge computing paradigm.

云边计算的动态景观需要解决能源消耗问题并促进可持续发展。我们的提案主张将Kepler和Open Horizon与CNCF和LF Edge生态系统整合，以解决云和边缘部署中多样化的硬件需求，包括x86、arm、s390和新兴的RISC-V架构。值得注意的是，中国市场以制造、零售和监控领域的边缘设备为特征，这一举措将使其受益匪浅。通过利用Kepler的先进能源估算能力和Open Horizon的自主工作负载管理功能，本提案旨在优化异构边缘环境的能源效率。在本场演讲中，我们将演示一个使用案例，展示如何构建和整合Kepler和Open Horizon在RISC-V平台上运行，并监控和优化分布式和异构系统，以构建更环保、更具弹性的云边计算范式。

Speakers

Peng Hui Jiang

Architect, IBM

Peng Hui Jiang is working for IBM as Senior Software Engineer to build and operate Public Cloud services. He has rich experience in Cloud, Database, and Security. He is CNCF Kepler Maintainer and Apache CouchDB committer and Master Inventor in IBM holding more than 200 patents or... Read More →

勇姚

Program Director, IBM Cloud Platform, IBM

David Yao is the Program Director of IBM Cloud Platform in IBM China Development Lab, developing and managing the entire product development lifecycle and team for the dynamic cloud and edge environment. Passionate on learning open technology, building and transforming an open and... Read More →

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Understanding the Buzz Around Cilium: Introduction and in Production at Alibaba | 深入了解Cilium背后的热潮：在阿里巴巴的介绍和生产中 - Bo Kang Li, AlibabaCloud & Liyi Huang, Cisco

Thursday August 22, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 6

Cilium is CNCF's most widely adopted CNI, being the default choice for all major cloud providers. You'll hear how the project grew from a simple networking CNI to cover observability and security too..This talk dives into the bytecode behind all of the buzz around the project. It will cover: An introduction to Cilium and how it works Network policy, kube proxy replacement, and bandwidth manager deep dive Why and how Alibaba Cloud integrated Cilium as their CNI The observability and security capabilities Hubble and Tetragon bring to the project Where Cilium is heading next By weaving together both theoretical knowledge and hands on experience running Cilium in production, the audience will walk away with a strong understanding of what Cilium provides for networking, observability, and security.

Cilium是CNCF最广泛采用的CNI，是所有主要云提供商的默认选择。您将了解到该项目是如何从一个简单的网络CNI发展到覆盖可观察性和安全性的。本次演讲将深入探讨该项目背后的所有字节码。内容包括： Cilium的介绍及其工作原理网络策略、kube代理替换和带宽管理器深入挖掘阿里云为何以及如何集成Cilium作为他们的CNI Hubble和Tetragon为该项目带来的可观察性和安全性能力 Cilium未来的发展方向通过将理论知识和实际运行Cilium的经验结合起来，观众将对Cilium在网络、可观察性和安全性方面提供的功能有深入的了解。

Speakers

BoKang Li

Senior Engineer, AlibabaCloud

BoKang Li is a senior engineer on Container Service for Kubernetes team of Alibaba Cloud. He has a primary focus on container networks and connectivity solutions, and has gained extensive experience in use container networking.

Liyi Huang

solution architect, Isovalent at Cisco

solution architect @Isovalent part of Cisco

Thursday August 22, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Cilium

Language | 语言 中文 (Chinese)

09:05 HKT

Keynote: Deploying LLM Workloads on Kubernetes by WasmEdge and Kuasar | 主论坛演讲: 使用WasmEdge和Kuasar在Kubernetes上部署LLM工作负载 - Tianyang Zhang, Huawei Cloud & Xiaowei Hu, Second State

Friday August 23, 2024 09:05 - 09:20 HKT

Level 2 | Grand Ballroom 1-2

LLMs are powerful artificial intelligence models capable of comprehending and generating natural language. However, the conventional methods for running LLMs pose significant challenges, including complex package installations, GPU devices compatibility concerns, inflexible scaling, limited resource monitoring and statistics, and security vulnerabilities on native platforms. WasmEdge introduces a solution enabling the development of swift, agile, resource-efficient, and secure LLMs applications. Kuasar enables running applications on Kubernetes with faster container startup and reduced management overheads. This session will demonstrate running Llama3-8B on a Kubernetes cluster using WasmEdge and Kuasar as container runtimes. Attendees will explore how Kubernetes enhances efficiency, scalability, and stability in LLMs deployment and operations.

LLM是强大的人工智能模型，能够理解和生成自然语言。然而，传统的运行LLM的方法存在重大挑战，包括复杂的软件包安装、GPU设备兼容性问题、不灵活的扩展性、有限的资源监控和统计，以及在本地平台上的安全漏洞。 WasmEdge提出了一种解决方案，可以开发快速、灵活、资源高效和安全的LLM应用程序。Kuasar使应用程序能够在Kubernetes上运行，具有更快的容器启动速度和减少的管理开销。本场演讲将演示如何使用WasmEdge和Kuasar作为容器运行时，在Kubernetes集群上运行Llama3-8B。与会者将探索Kubernetes如何提高LLM部署和运营的效率、可扩展性和稳定性。

Speakers

Vivian Hu

Product Manager, Second State

Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.

Tianyang Zhang

Software Engineer, Huawei Cloud

Working on container runtime at Huawei Cloud. He is the maintainer of Kuasar and the reviewer of Containerd rust-extension repository.

Friday August 23, 2024 09:05 - 09:20 HKT
Level 2 | Grand Ballroom 1-2

Keynote Sessions | 主论坛演讲, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

10:35 HKT

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano | 通过Volcano增强的智能基础设施优化LLM工作流程 - Xin Li, qihoo360 & William Wang, Huawei Cloud Technologies Co., LTD

Friday August 23, 2024 10:35 - 11:10 HKT

Level 1 | Hung Hom Room 2

As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型（LLMs）革新我们生活的各个方面，许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而，管理大规模LLM训练和推理平台面临更为关键的挑战，如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上，演讲者将分享他们在使用基于Kubernetes的智能基础设施（由Volcano增强）管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖：故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。

Speakers

Xin Li

Senior Engineer of Server Development, qihoo360

Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →

Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:25 HKT

Next Steps for the Ingress-NGINX Project | Ingress-NGINX项目的下一步计划 - Jintao Zhang, Kong Inc.

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 6

The Ingress-NGINX project is the most widely used ingress-controller project globally. As a community-maintained open-source project, with the release of Gateway API GA, more and more people are paying attention to what the next plan for the Ingress-NGINX project is and what updates have been made recently.

Ingress-NGINX项目是全球范围内最广泛使用的Ingress控制器项目。作为一个由社区维护的开源项目，随着Gateway API GA的发布，越来越多的人开始关注Ingress-NGINX项目的下一个计划以及最近的更新。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Networkingress-nginx

Language | 语言 中文 (Chinese)

11:25 HKT

Beyond Statefulset: Containerize Your Enterprise Stateful Applications in Practice | 超越StatefulSet：实践中将企业有状态应用容器化 - Mingshan Zhao, Alibaba Cloud & Vec Sun, xiaohongshu

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 1

Kubernetes provides StatefulSet to manage stateful services, but it is far from enough to run enterprise stateful applications in practice. For example: how does Zookeeper accomplish leader election, and how does MQ implement configuration hot loading? How to do daily operation and maintenance of the database? Many practitioners resort to operators that manages pod directly e.g. KubeBlocks, for specific applications e.g. database, yet they are not general enough for other stateful applications. OpenKruise provides several stateful features that are missing in native StatefulSet, such as in-place resource and volume resizing, progressive Configmap & Secret hot update and container operation channel. Teams from Alibaba and Xiaohongshu will share their lessons to build operators and platforms for general stateful apps and containerize database and middleware with a scale of hundreds of thousands of pods.

Kubernetes提供了StatefulSet来管理有状态服务，但实际上要运行企业级有状态应用还远远不够。例如：Zookeeper如何完成领导者选举，MQ如何实现配置热加载？如何进行数据库的日常运维？许多从业者借助直接管理pod的运营商，例如KubeBlocks，针对特定应用程序，例如数据库，但它们并不足够通用以适用于其他有状态应用程序。 OpenKruise提供了一些在原生StatefulSet中缺失的有状态功能，例如原地资源和卷大小调整，渐进式Configmap和Secret热更新以及容器操作通道。来自阿里巴巴和小红书的团队将分享他们构建运营商和平台以适用于通用有状态应用程序，并将数据库和中间件容器化的经验，规模达数十万个pod。

Speakers

Mingshan Zhao

Senior R&D Engineer, Alibaba Cloud

Senior R&D Engineer of AliCloud, Maintainer of OpenKruise community, has long been engaged in the research and development of cloud native, containers, scheduling and other fields; core R&D member of Alibaba's one million container scheduling system, and many years of experience in... Read More →

Vec Sun

software engineer, xiaohongshu

Sunweixiang has previously worked in the Alibaba Cloud container team as software engineer and is a contributor to the OpenKruise community's main, Karmada, and other communities. He is deeply involved in container application orchestration, multi-cluster.

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:25 HKT

Evolution of SPDK Vhost-FS Solution to Accelerate File Access in VMs and Secure Containers | SPDK Vhost-FS解决方案的演进，加速虚拟机中的文件访问并保护容器 - Changpeng Liu, Intel

Friday August 23, 2024 11:25 - 12:00 HKT

Level 2 | Grand Ballroom 1-2

Virtio-fs is a shared file system between virtual machines or secure containers and host, Storage Performance Development Kit(SPDK) vhost-fs is the backend implementation of virtio-fs in userspace, in this presentation, we will summarize typical storage solutions that use SPDK vhost-fs and components to build the storage stack, then go through the evolution of SPDK vhost-fs from BlobFS to latest FSDEV module, advanced features such as interrupt mode and thread modeling for data processing in SPDK vhost-fs are also covered.

Virtio-fs是虚拟机或安全容器与主机之间共享文件系统，Storage Performance Development Kit(SPDK) vhost-fs是virtio-fs在用户空间的后端实现。在本次演讲中，我们将总结使用SPDK vhost-fs和组件构建存储栈的典型存储解决方案，然后介绍SPDK vhost-fs从BlobFS到最新的FSDEV模块的演变过程，还将涵盖SPDK vhost-fs中用于数据处理的高级功能，如中断模式和线程建模。

Speakers

Changpeng Liu

Cloud Solution Architect, Intel

Changpeng is a Cloud Solution Architect at Intel. He has been working on Storage Performance Development Kit since 2014. Currently, Changpeng is a core maintainer for the SPDK. His areas of expertise include NVMe, I/O Virtualization, and storage offload on IPU.

Friday August 23, 2024 11:25 - 12:00 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:25 HKT

Rollout Patterns: Smoothly Migrating and Rolling Out Your Microservices | 部署模式：平稳迁移和部署您的微服务 - Tim Xiao, DaoCloud & Wu Chenhui, AS.Watson TechLab

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 7

At Watsons, most of their services are built on Dubbo. Now, they aim to utilize delivery tools like Argo CD and Argo Rollouts to automatically and securely deliver their services. However, they have encountered complexities beyond what Argo Rollouts assumes. We will summarize these patterns and demonstrate how to handle them, including: - Pattern 1: One service at a time. - Pattern 2: Multiple services, each forward-compatible. - Pattern 3: Multiple services with version dependency.

在Watsons，他们的大多数服务都是基于Dubbo构建的。现在，他们希望利用Argo CD和Argo Rollouts等交付工具来自动和安全地交付他们的服务。然而，他们遇到了超出Argo Rollouts假设的复杂性。我们将总结这些模式，并演示如何处理它们，包括： - 模式1：一次一个服务。 - 模式2：多个服务，每个都是向前兼容的。 - 模式3：具有版本依赖性的多个服务。

Speakers

旸肖

Developer, DaoCloud

Served as DevOps platform Principle Engineer in DaoCloud, participated in community projects including argo-cd, argo-rollouts, kubevela and other community projects, and has more than 5 years of kubernetes platform development experience.

Wu Chenhui

architecture, AS.Watson TechLab

I have nearly 30 years of experience in software development and architecture design, and 5 years of experience in k8s, responsible for k8s related architecture design of Watsons Group

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, SDLC (Software Development Lifecycle)

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:25 HKT

How Does KubeEdge Build the Tunnel Which Is Secure, Trusted, and Adaptable to Edge Networks | KubeEdge如何构建适应边缘网络的安全可信隧道 - Wei Hu, DaoCloud

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 5

Edge Computing makes the connection broader, faster and more agile, meanwhile it also brings the threat of cyberattacks to the edge of the network, which also puts forward higher requirements for the safety at the edge side. In addition, due to any forms like Internet, 5G, WIFI and other forms are possible, the network environment will be complex and the quality can't be guaranteed in the edge scnee. Therefore, supporting weak network environments which is also a challenge at edge site. KubeEdge is a cloud-edge collaborative architecture project for Kubernetes native edge computing. KubeEdge uses its own trusted tunnel to ensure the security of data transmission, it verifies, encrypts and authenticates all communications in this tunnel. This tunnel ensures data accessibility through QoS and provides a QUIC protocol to improve the performance of network reordering in weak networks. We will share how the tunnel of KubeEdge achieves these goals in this session.

边缘计算使连接更广泛、更快速、更灵活，同时也将网络威胁带到了边缘，这也对边缘安全提出了更高的要求。此外，由于互联网、5G、WIFI等各种形式可能存在，边缘场景中的网络环境将变得复杂，质量无法保证。因此，支持弱网络环境也是边缘场景中的一个挑战。 KubeEdge是一个针对Kubernetes原生边缘计算的云边协作架构项目。KubeEdge使用自己的可信隧道来确保数据传输的安全性，它验证、加密和认证该隧道中的所有通信。该隧道通过QoS确保数据可访问性，并提供QUIC协议来改善弱网络中的网络重排序性能。在本场演讲中，我们将分享KubeEdge隧道如何实现这些目标。

Speakers

炜胡

Senior Software Engineer, DaoCloud

Wei Hu is a Senior Software Engineer at DaoCloud, currently working on Edge Computing Team. He is a maintainer of KubeEdge project and a regular contributor to it. He has rich experience in cloud-edge collaboration. He has given several speeches on the topic of edge computing at other... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 5

Open Source Summit Sessions, Networking + Edge Computing

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

13:20 HKT

Build Container Runtime Based on Sandbox API of Containerd | 基于Containerd的Sandbox API构建容器运行时 - Shaobao Feng, Huawei Cloud & Cai Wei, DaoCloud

Friday August 23, 2024 13:20 - 13:55 HKT

Level 1 | Hung Hom Room 1

Sandbox API is released in containerd 1.7 and will be stable in containerd 2.0. It provides a clean way to implement a sandbox oriented container runtime. Container is more a set of API specifications than a single technology now, with the introduction of different kinds of isolation techiques as sandboxes, We need a clear and abstract definition of Sandbox API, to make it easy to integrate different kinds of sandboxing techiniques to become a container runtime. In this sharing, We will: 1. Make an introduction of Sandbox API of containerd, and why we need it. 2. Show how we build our container runtimes based on the Sandobx API and the benefits comes with it. 3. We will show the demostration of different kinds of sandboxed containers created by Kuasar, a container runtime framework based on the new Sandbox API, currently supports sandboxes of VMM, UserMode Kernel, WebAssembly and Runc.

在KubeCon的会议描述中，我们将介绍Sandbox API在containerd 1.7中发布，并将在containerd 2.0中稳定。它提供了一种清晰的方式来实现面向沙箱的容器运行时。随着不同类型的隔离技术（如沙箱）的引入，容器现在更多地是一组API规范，而不是单一技术。我们需要对Sandbox API进行清晰和抽象的定义，以便轻松集成不同类型的沙箱技术，使其成为容器运行时。在这次分享中，我们将： 1. 介绍containerd的Sandbox API，以及为什么我们需要它。 2. 展示我们如何基于Sandbox API构建我们的容器运行时以及带来的好处。 3. 我们将展示由基于新Sandbox API的容器运行时框架Kuasar创建的不同类型的沙箱容器的演示，目前支持VMM、UserMode Kernel、WebAssembly和Runc的沙箱。

Speakers

Wei Cai(Iceber Gu)

Software Engineer, DaoCloud

Senior open source enthusiast, focused on cloud runtime, multi-cloud and WASM. I am a CNCF Ambassador and founded Clusterpedia and promoted it as a CNCF Sandbox project. I also created KasmCloud to promote the integration of WASM with Kubernetes and contribute it to the WasmCloud... Read More →

Shaobao Feng

Principal Engineer, Huawei Cloud

Shaobao is Principal Engineer working on Huawei Cloud, with his work focusing on the Serverless Platforms. He has been a leader in building secure container runtime of the first Serverless Kubernetes on public cloud. He is the main code contributor and maintainer of the open source... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:20 HKT

JuiceFS CSI in Multi-Thousand Node Kubernetes Clusters for LLM Pre-Training | JuiceFS CSI在LLM预训练中用于几千节点Kubernetes集群 - Weiwei Zhu, juicedata

Friday August 23, 2024 13:20 - 13:55 HKT

Level 2 | Grand Ballroom 1-2

The rapid advancement of artificial intelligence technologies, especially the development of large language models (LLMs), has led to a sharp increase in the amount of data that enterprises need to process. Managing large-scale data clusters in Kubernetes environments presents several challenges, including storage performance, complex access control management and system stability. JuiceFS is a distributed POSIX file system designed for cloud. It was open-sourced in 2021( 9.8k stars) To deliver an optimal experience in Kubernetes, JuiceFS developed JuiceFS CSI Driver. In addition, JuiceFS CSI introduced several new designs to support large-scale, complex AI training tasks such as the mount pod mode and the sidecar mode for serverless environments. Outline: - LLM Storage challenges - JuiceFS CSI Driver Architectural - Mount pod mode\Sidecar mode - Practical experience - Future

人工智能技术的快速发展，特别是大型语言模型（LLMs）的发展，导致企业需要处理的数据量急剧增加。在Kubernetes环境中管理大规模数据集群面临着多个挑战，包括存储性能、复杂的访问控制管理和系统稳定性。 JuiceFS是一种为云设计的分布式POSIX文件系统。它于2021年开源（拥有9.8k星）。为了在Kubernetes中提供最佳体验，JuiceFS开发了JuiceFS CSI驱动程序。此外，JuiceFS CSI引入了几项新设计，以支持大规模、复杂的人工智能训练任务，如挂载Pod模式和用于无服务器环境的Sidecar模式。大纲： - LLM存储挑战 - JuiceFS CSI驱动程序架构 - 挂载Pod模式\Sidecar模式 - 实践经验 - 未来

Speakers

Weiwei Zhu

Full stack engineer, juicedata

She is a full-stack engineer of Juicedata.Inc, maintainer of JuiceFS CSI driver and Fluid. She is responsible for development and maintenance of JuiceFS in the Cloud-Native ecosystem, completed the implementation and practice of JuiceFS in Kubernetes, and continued to improve the... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:10 HKT

Ensuring Success with Kyverno in Production: What You Must Know | 在生产环境中确保Kyverno成功：你必须知道的事项 - Shuting Zhao, Nirmata

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 6

Deploying Kyverno in a production environment requires careful planning and consideration of key factors to ensure success. In this talk, we will delve into the essential aspects that organizations must understand when bringing Kyverno into production. From policy creation to enforcement, scalability, performance optimization, and integration with existing workflows, attendees will gain valuable insights into effectively leveraging Kyverno in a live environment. Moreover, the session will cover strategies for maintaining stability, resilience, and security when using Kyverno at scale. Participants will learn about common pitfalls to avoid, tips for troubleshooting, and proactive measures to enhance the overall production readiness of Kyverno deployments. Whether you are new to Kyverno or looking to optimize your existing implementation, this talk will equip you with the knowledge and guidance needed to successfully navigate the complexities of deploying Kyverno in a production setting.

在生产环境中部署Kyverno需要仔细规划和考虑关键因素，以确保成功。在这次讨论中，我们将深入探讨组织在将Kyverno引入生产环境时必须了解的基本方面。从策略创建到执行、可扩展性、性能优化和与现有工作流程的集成，与会者将获得有关如何在实际环境中有效利用Kyverno的宝贵见解。此外，本场演讲还将涵盖在规模化使用Kyverno时保持稳定性、弹性和安全性的策略。参与者将了解要避免的常见陷阱、故障排除的技巧以及增强Kyverno部署整体生产准备性的积极措施。无论您是初次接触Kyverno还是希望优化现有实施，本次讨论将为您提供成功部署Kyverno在生产环境中的复杂性所需的知识和指导。

Speakers

Shuting Zhao

Staff Engineer, Nirmata

Shuting Zhao is a Kyverno maintainer and a Staff Engineer at Nirmata. Her passion for open source extends beyond her professional role, as she has also taken on the role of mentor for several LXF mentorship programs since March 2021, she enjoys helping others contribute to open source... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Kyverno

Language | 语言 中文 (Chinese)

14:10 HKT

KuaiShou's 100% Resource Utilization Boost: 100K Redis Migration from Bare Metal to Kubernetes | 快手的100%资源利用率提升：从裸机迁移100K Redis到Kubernetes - XueQiang Wu, ApeCloud & YuXing Liu, Kuaishou

Friday August 23, 2024 14:10 - 14:45 HKT

Level 2 | Grand Ballroom 1-2

In the past year, Kuaishou successfully migrated nearly 100,000 Redis instances from traditional bare metal environments to the Kubernetes platform, achieving a significant doubling of resource utilization. While ensuring business stability, this large-scale migration faced numerous challenges, including smooth migration execution, finding a balance between increasing deployment density (resource utilization) and ensuring system stability, avoiding interference with other services during coexistence, and addressing specific issues associated with stateful services like databases (including data management, configuration management, ensuring high availability, cross-cluster disaster recovery, etc.). This session will share Kuaishou's large-scale practical experience in Redis cloud-native transformation, in collaboration with the open-source project KubeBlocks, covering aspects such as smooth migration, resource efficiency improvement, and efficient database management.

在过去的一年中，快手成功将近10万个Redis实例从传统裸机环境迁移到Kubernetes平台，实现资源利用率显著翻倍。在确保业务稳定性的同时，这一大规模迁移面临诸多挑战，包括顺利执行迁移、在增加部署密度（资源利用率）和确保系统稳定性之间找到平衡、在共存期间避免与其他服务的干扰，以及解决与数据库等有状态服务相关的特定问题（包括数据管理、配置管理、确保高可用性、跨集群灾难恢复等）。本场演讲将分享快手在Redis云原生转型方面的大规模实践经验，与开源项目KubeBlocks合作，涵盖顺利迁移、资源效率提升和高效数据库管理等方面。

Speakers

yuxing liu

senior software engineer, Kuaishou

I have worked in the cloud-native teams of Alibaba Cloud and Kuaishou, focusing on the cloud-native field and gaining experience in open source, commercialization, and scaling of cloud-native technologies. I am one of the maintainers of the CNCF/Dragonfly project and also one of the... Read More →

XueQiang Wu

Director of Research and Development, ApeCloud

Former tech leader at Alibaba Cloud PolarDB-X, a cloud-native distributed database, with a wide range of interests and expertise in operating systems, cryptography, distributed systems, and more. Joined the PolarDB-X team in 2017, focusing on the development of high-concurrency, low-latency... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:10 HKT

Model Service Mesh: A New Paradigm for Large-Scale AI Model Service Deployment and Management | 模型服务网格：大规模AI模型服务部署和管理的新范式 - Xi Ning Wang, Alibaba Cloud & Huailong Zhang, Intel China

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 2

As AI/ML models grow in scale and complexity, how to efficiently deploy and manage model service in cloud-native environments has become a significant challenge. This proposal will introduce the Model Service Mesh (MSM), an emerging architectural paradigm designed specifically for large-scale AI model service deployment and management, to address the challenge. This new paradigm focuses on: 1. How to build a highly scalable and reliable model delivery system and the key features include dynamic model service routing, unified management for multi-models within single endpoint, an optimized caching layer, and cache-aware scheduling,etc. 2. How to leverage the MSM to optimize AI models service in lifecycle management, resource utilization improvement, security enhancement, and observability and resilience insurance. In essence, this architecture ensures a scalable, secure, and efficient model service in cloud native environment.

随着人工智能/机器学习模型规模和复杂性的增长，如何在云原生环境中高效部署和管理模型服务已成为一个重大挑战。本提案将介绍模型服务网格（MSM），这是一种专门为大规模人工智能模型服务部署和管理而设计的新兴架构范式，旨在解决这一挑战。这种新范式关注以下几点： 1. 如何构建一个高度可扩展和可靠的模型交付系统，关键特性包括动态模型服务路由、单个端点内多模型的统一管理、优化缓存层和缓存感知调度等。 2. 如何利用MSM优化人工智能模型服务的生命周期管理、资源利用率改善、安全增强以及可观察性和弹性保障。总的来说，这种架构确保了在云原生环境中可扩展、安全和高效的模型服务。

Speakers

王夕宁

Technical Leader, Alibaba Cloud

Wang Xining, senior technical expert of Alibaba Cloud, technical leader of ACK(Kubernetes)/ASM(Service Mesh) , focusing on Kubernetes, service mesh and other cloud native fields. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the... Read More →

Huailong Zhang

Cloud Software Engineer, Intel China

Steve(Huailong) Zhang has worked for Alcatel-Lucent, Baidu and IBM to engage in cloud computing research and development. Huailong is currently working for Intel China as a cloud-native software engineer, focusing on cloud-native technical fields, such as kubernetes and service mesh... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:15 HKT

Dragonfly: Intro, Updates and Ant Group's Practice of Accelerating Model Distribution in Ray Serving | Dragonfly：介绍、更新和蚂蚁集团在Ray Serving中加速模型分发的实践 - Wenbo Qi, Ant Group & Qixiang Chen, AntGroup

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 6

Dragonfly provides efficient, stable and secure file distribution and image acceleration based on P2P technology to be the best practice and standard solution in cloud native architectures. In this talk, there is an introduction to dragonfly and the features of the latest version, and AI model distribution practice in AI inference. Additionally, Ray utilizes Dragonfly as its file distribution solution for the large-scale cluster. Subsequently, we will introduce practical problems of model distribution in LLM and multi-media service, and how Ray solves them in Ant Group's production environment.

Dragonfly提供了基于P2P技术的高效、稳定和安全的文件分发和图像加速，成为云原生架构中的最佳实践和标准解决方案。在本次讨论中，将介绍Dragonfly及最新版本的特性，以及在AI推理中的AI模型分发实践。此外，Ray将Dragonfly作为其大规模集群的文件分发解决方案。随后，我们将介绍LLM和多媒体服务中模型分发的实际问题，以及Ray在蚂蚁集团生产环境中如何解决这些问题。

Speakers

Wenbo Qi

Senior Software Engineer, Ant Group

Qixiang Chen

Engineer, AntGroup

Qixiang Chen is a software engineer at Ray team of Ant Group. His research interests include distributed systems and System4ML, and he published several papers in academic conferences and journals. Based on the research experience, he is the main author of Rayfed - a distributed federated... Read More →

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, Dragonfly

Language | 语言 中文 (Chinese)

15:15 HKT

Expanding Cloud Native Capabilities with WASM: A Case Study of Harbor and WASM Integration | 通过WASM扩展云原生能力：Harbor和WASM集成案例研究 - Chenyu Zhang, AntGroup & Yan Wang, Broadcom

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 1

In the cloud-native realm, eBPF's versatility has led to scalable solutions in observability and security by attaching to system event checkpoints without kernel code modification. This concept has paved the way for extending business applications non-invasively and flexibly without altering the original code. In this session, we'll use Harbor, the cloud-native artifact registry, to showcase how WASM (WebAssembly) extends Harbor's functionalities without code modification. Here, Harbor is analogous to the Linux kernel, and WASM to user-provided eBPF programs. Harbor provides mounting points for various events, such as pre-pull requests, enabling users to filter requests with custom WASM programs. This facilitates fine-grained permission control and artifact security auditing before a user pulls the artifacts, with more features to discover.

在云原生领域，eBPF 的多功能性使得它能够通过附加到系统事件检查点而无需修改内核代码，从而实现可扩展的可观测性和安全性解决方案。这一概念为在不改变原始代码的情况下非侵入性和灵活地扩展业务应用程序铺平了道路。在本场演讲中，我们将使用 Harbor，云原生制品注册表，展示如何使用 WASM（WebAssembly）在不修改代码的情况下扩展 Harbor 的功能。在这里，Harbor 类似于 Linux 内核，而 WASM 则类似于用户提供的 eBPF 程序。Harbor 提供了各种事件的挂载点，例如预拉取请求，使用户能够使用自定义的 WASM 程序过滤请求。这有助于在用户拉取制品之前进行细粒度的权限控制和制品安全审计，还有更多功能等待您去发现。

Speakers

Yan Wang

Staff engineer, Broadcom

Yan Wang is a Staff engineer working on VMWare. As one of the core maintainer of CNCF project Harbor and the maintainer of CNCF project distribution, his main work focuses on technology research and innovation in the cloud native field.

Chenyu Zhang

Software Engineer, AntGroup

Chenyu Zhang is a software engineer, currently mainly responsible for the development and maintenance of project harbor, and also has some experience in devops and cloud native related technology stacks.

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

15:15 HKT

The Challenges of Kubernetes Data Protection - Real Examples and Solutions with Velero | Kubernetes数据保护的挑战- Velero的真实案例和解决方案 - Wenkai Yin, Broadcom & Bruce Zou, Shanghai Jibu Tech

Friday August 23, 2024 15:15 - 15:50 HKT

Level 2 | Grand Ballroom 1-2

The distributed and dynamic nature of Kubernetes makes data protection challenging to guarantee data availability and durability, below are summaries of the issues we encountered in the real customer environments: 1. Application definition and resources capture 2. Application data consistency 3.Application restore on heterogenous and across-cloud environments We provide the detailed description of these issues in the "Additional resources" section due to the character limitation of the "Description".

Kubernetes的分布式和动态特性使得数据保护变得具有挑战性，以确保数据的可用性和持久性。以下是我们在真实客户环境中遇到的问题摘要： 1. 应用程序定义和资源捕获 2. 应用程序数据一致性 3. 跨异构和跨云环境的应用程序恢复由于“描述”部分的字符限制，我们将在“附加资源”部分提供这些问题的详细描述。

Speakers

Bruce Zou

Jibu Tech, Co-founder and Development Director, Shanghai Jibu Tech

Over 10 years storage development and architecture experience working at IBM storage system lab, submitted 15+ disclosures and publications; supported 10+ big accounts for high end storage system critical issues. Rich experience in building high available storage systems, leading... Read More →

Wenkai Yin

Staff Software Engineer, Broadcom

Staff software engineer, focus on cloud-native development. Core maintainers of open source project Harbor and Velero

Friday August 23, 2024 15:15 - 15:50 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:15 HKT

The Experience of ChillyRoom Developing & Managing Session-Based Game on K8s with OpenKruiseGame | 在K8s上使用OpenKruiseGame开发和管理基于会话的游戏的ChillyRoom经验 - Qiuyang Liu, Alibaba Cloud & Xinhao Liu, ChillyRoom

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 7

In the era of traditional game operation and maintenance, session-based games face huge challenges in terms of delivery efficiency and resource costs. Cloud native technology brings exactly the flexibility and highly automated capabilities that session-based games need. However, due to the game servers' strong stateful characteristics, there are also various difficulties in the process of implementing games on Kubernetes. This talk will focus on the characteristics of session-based games and describe how ChillyRoom uses OpenKruiseGame, which is the subproject of CNCF incubating project OpenKruise, to develop and manage session-based games on Kubernetes, providing developers in the game industry with cloud native implementation experience in automatic network access, elastic scaling of game servers, matching logic development, and room status management, etc.

在传统游戏运维时代，基于会话的游戏在交付效率和资源成本方面面临巨大挑战。云原生技术正好为会话型游戏带来了灵活性和高度自动化能力。然而，由于游戏服务器具有强烈的有状态特性，在实现游戏在 Kubernetes 上的过程中也存在各种困难。本次演讲将重点关注会话型游戏的特点，并描述 ChillyRoom 如何使用 OpenKruise 的子项目 OpenKruiseGame 来开发和管理基于会话的游戏在 Kubernetes 上，为游戏行业的开发人员提供云原生实现经验，包括自动网络访问、游戏服务器的弹性扩展、匹配逻辑开发和房间状态管理等。

Speakers

Qiuyang Liu

Senior R&D Engineer, Alibaba Cloud

Qiuyang Liu, head of cloud native game at Alibaba Cloud Container Service and maintainer of the kruise-game project. He has long been engaged in the research and development of cloud native in the gaming field and is committed to promoting the implementation of cloud native in the... Read More →

Xinhao Liu

Engineer, ChillyRoom

Xinhao Liu, an engineer with one year experience in game server development at ChillyRoom and three years experience in Linux OS and cloud core network software development in industry. He has a passion for creating flexible, high-performance, high-available and easy-to-maintain game... Read More →

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Cloud Native Experience

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:05 HKT

CubeFS Boosts Efficiency of AI Production | CubeFS提高了AI生产的效率 - Chi He, OPPO

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 6

With the booming development of AI, the scale of data required for AI model training has been increasing. As one of the fundamental infrastructures for AI, distributed file storage faces significant challenges, such as scalability and the need to provide high performance and stable storage while considering cost-effectiveness. This presentation will mainly share the practical experience and reflections of CubeFS in addressing these challenges.

随着人工智能的蓬勃发展，用于AI模型训练的数据规模不断增加。作为AI的基础设施之一，分布式文件存储面临着诸多挑战，如可扩展性和在考虑成本效益的同时提供高性能和稳定存储。本次演讲将主要分享CubeFS在应对这些挑战方面的实践经验和反思。

Speakers

chi he

senior engineer, OPPO

CubeFS commiter,responsible for the design and development of the CubeFS storage engine,including feature such as hybrid cloud,caching acceleration.

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 6

CNCF Maintainer Track Sessions, CubeFS

Language | 语言 中文 (Chinese)

16:05 HKT

JD Cloud's Large-Scale Serverless Practice : APP Management and Elastic Scaling on Karmada | 京东云的大规模无服务器实践：在Karmada上的应用管理和弹性扩展 - XiaoFei Wang & Chen Yanying, JDCloud

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 1

In JDCloud, the federated Serverless service is based on the federated management model and Serverless application model, providing JDOS application container control services for federated application container deployment, elastic scaling, and fault migration capabilities. It manages multiple clusters with over 10,000 nodes. Unify management of multiple sub-clusters to improve overall resource utilization. Reduce the complexity of multi-cluster management, scheduling, and distribution on the platform. End users can use our platform just like the native Kubernetes API. Throughout the process, we will address numerous technical challenges, including: 1. Multi-cluster management and distribution practice 2. Efficient cross-cluster elastic scaling solution 3. Problems encountered in production and sharing

在京东云中，联邦Serverless服务基于联邦管理模型和Serverless应用模型，为联邦应用容器部署、弹性扩展和故障迁移提供JDOS应用容器控制服务。它管理超过10,000个节点的多个集群。统一管理多个子集群，提高整体资源利用率。减少平台上多集群管理、调度和分发的复杂性。最终用户可以像使用本机Kubernetes API一样使用我们的平台。在整个过程中，我们将解决许多技术挑战，包括： 1. 多集群管理和分发实践 2. 高效的跨集群弹性扩展解决方案 3. 在生产和分享中遇到的问题

Speakers

Chen Yanying

Cloud Native Engineer, JDCloud

Engaged in the construction and internal promotion of basic platforms such as Federated Clusters, Serverless, Service Mesh and some middleware, based on JD's large-scale Kubernetes clusters

XiaoFei Wang

CloudNativeEngineer, JDCloud

As a software engineer, he is responsible for cluster deployment, multi-cluster management, and federated clusters. Participate in JD.com’s 618 and 11.11. Have rich practical experience in cloud native.

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:05 HKT

TiDB: Your Next MySQL Is Not a MySQL | TiDB：你的下一个 MySQL 何必是 MySQL - Qizhi Wang, PingCAP

Friday August 23, 2024 16:05 - 16:40 HKT

Level 2 | Grand Ballroom 1-2

You might have heard of TiDB, a distributed open-source database known for its virtually limitless horizontal scalability, capable of handling both online transactional processing and analytical workloads while being compatible with the MySQL protocol. Traditionally, different databases have been employed to handle various workloads in our application architecture designs. Commonly, relational databases are used for online transaction processing, with data asynchronously distributed to analytical databases, document stores, and cache databases. With the rise of AI, an additional type of database needs consideration — the vector database. But introducing this type of database can add unnecessary complexity to your technology stack. This talk we will discuss how TiDB integrates multiple functionalities such as real-time transaction processing, online analytics, sharding-free architecture, and vector type computations, all aimed at reducing the cognitive load for developers.

您可能已经听说过 TiDB，这是一个分布式开源数据库，以其几乎无限的水平扩展性而闻名，能够处理在线事务处理和分析工作负载，同时兼容 MySQL 协议。传统上，在我们的应用架构设计中，通常会使用不同的数据库来处理各种工作负载。通常情况下，关系数据库用于在线事务处理，数据会异步分布到分析数据库、文档存储和缓存数据库。随着人工智能的兴起，还需要考虑一种额外的数据库类型 —— 向量数据库。但引入这种类型的数据库可能会给您的技术堆栈增加不必要的复杂性。在本次演讲中，我们将讨论 TiDB 如何集成多种功能，如实时事务处理、在线分析、无分片架构和向量类型计算，所有这些都旨在减少开发人员的认知负荷。

Speakers

Qizhi Wang

TiDB Ecosystem Software Architect and Senior Developer Advocate at PingCAP, PingCAP

Qizhi is a TiDB Ecosystem Software Architect & Senior Developer Advocate at PingCAP, the company behind TiDB. In this role, He focuses on EcoSystem development and has been instrumental in integrating TiDB with various platforms such as AWS, GORM, MySQL Connector/J, Hibernate, DBeaver... Read More →

Friday August 23, 2024 16:05 - 16:40 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:05 HKT

Unlocking LLM Performance with EBPF: Optimizing Training and Inference Pipelines | 通过eBPF解锁LLM性能：优化训练和推理管道 - Yang Xiang, Yunshan Networks, Inc.

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 2

The training and inference processes of Large Language Models (LLMs) involve handling vast amounts of model data and training data, and consume significant GPU compute resources. However, enhancing GPU utilization becomes extremely challenging in the absence of observability. This presentation will introduce how to achieve observability in LLM training and inference processes with zero disruption using eBPF. This includes utilizing Memory Profiling to understand the loading performance of models and training data, Network Profiling to comprehend the data exchange performance, and GPU Profiling to analyze GPU's MFU (Model FLOPs Utilization) and performance bottlenecks. Additionally, we will share the practical effects of implementing observability in a PyTorch LLM application and the llm.c project using eBPF, aiming to enhance training and inference performance.

大型语言模型（LLMs）的训练和推断过程涉及处理大量的模型数据和训练数据，并消耗大量的GPU计算资源。然而，在缺乏可观察性的情况下，提高GPU利用率变得极具挑战性。本次演讲将介绍如何利用eBPF在LLM训练和推理过程中实现零中断的可观察性。这包括利用内存分析来了解模型和训练数据的加载性能，网络分析来理解数据交换性能，以及GPU分析来分析GPU的MFU（模型FLOPs利用率）和性能瓶颈。此外，我们将分享在PyTorch LLM应用程序和llm.c项目中使用eBPF实现可观察性的实际效果，旨在提高训练和推理性能。

Speakers

Yang Xiang

VP of Engineering, Yunshan Networks, Inc.

Received a Ph.D. from Tsinghua University, and currently serving as VP of Engineering at Yunshan Networks and the head of the DeepFlow open-source community. He has presented academic papers on topics such as application observability and network measurement at top international academic... Read More →

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)