KubeCon + CloudNativeCon + Open Source Summit + AI

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

11:00 HKT

Accelerating Serverless AI Large Model Inference with Functionalized Scheduling and RDMA | 通过功能化调度和RDMA加速无服务器AI大模型推理 - Yiming Li, Tianjin University& Chenglong Wang, Jinan Inspur Data Technology Co., Ltd.

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 7

The deployment of AI large models on standard Serverless inference platforms like KServe is gaining popularity due to its ability to improve resource utilization and reduce costs. However, existing large model inference faces significant scheduling and communication bottlenecks, making it challenging to meet low-latency and high-throughput demands. The centralized control plane of Kubernetes leads to low scheduling efficiency, unable to achieve second-level response to large-scale burst requests. Additionally, the large model inference needs to transfer GB-level KV cache for each request, resulting in high communication overhead. So, we have developed a highly elastic functionalized scheduling framework to guarantee second-level scheduling for thousands of Serverless AI large model inference task instances. Additionally, we leverage RDMA technology to achieve high-speed KV cache migration, avoiding the high overhead caused by traditional network protocol stacks.

AI大模型在像KServe这样的标准无服务器推理平台上的部署越来越受欢迎，因为它能够提高资源利用率并降低成本。然而，现有的大模型推理面临着重要的调度和通信瓶颈，使得满足低延迟和高吞吐量需求变得具有挑战性。Kubernetes的集中式控制平面导致低调度效率，无法实现对大规模突发请求的秒级响应。此外，大模型推理需要为每个请求传输GB级别的KV缓存，导致高通信开销。因此，我们开发了一个高度弹性的功能化调度框架，以确保对数千个无服务器AI大模型推理任务实例进行秒级调度。此外，我们利用RDMA技术实现高速KV缓存迁移，避免传统网络协议栈引起的高开销。

Speakers

Cookie

Senior Software Engineer, Jinan Inspur Data Technology Co., Ltd.

I'm employed in Inspur. I mainly do container computing related development and are familiar with container networks, especially Calico and Cilium. I'm also a contributor to the Openyurt community and mainly participate in the development of the raven project.

Yiming Li

PhD candidate, Tianjin University

Yiming Li received the bachelor’s and master’s degrees from Tianjin University, China, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include cloud com... Read More →

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

How to Increase the Throughput of Kubernetes Scheduler by Tens of Times | 如何将Kubernetes调度器的吞吐量提高数十倍 - Yuquan Ren & Bing Li, ByteDance

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

Currently, various Kubernetes-based task schedulers popular in the community have limited performance capabilities, which restricts the cluster scale they can handle. Due to the limitation of cluster scale, it is difficult to improve resource utilization through large-scale colocation, and more clusters also bring greater operational burdens. 1. Due to the bottleneck of the scheduler and related components, the maximum cluster scale cannot exceed 5k nodes; 2. In clusters with more than 5k Nodes, scheduling throughput cannot exceed 100 Pods/s. Godel Scheduler is a distributed high-performance scheduler based on Kubernetes, and it is now open-sourced. In this talk, we will go deep into the performance optimization methods of godel scheduler: 1. Optimize scheduling algorithms and do data structures refactor; 2. Implement optimistic concurrency under multi-shard architecture to achieve parallel computation; 3. Abstract "batch" scheduling to fully reuse scheduling computation results.

目前，社区中流行的基于Kubernetes的各种任务调度器在性能方面存在一定限制，这限制了它们能处理的集群规模。由于集群规模的限制，通过大规模的共存难以提高资源利用率，而且更多的集群也会带来更大的运维负担。1. 由于调度器及相关组件的瓶颈，最大集群规模无法超过5k个节点；2. 在超过5k个节点的集群中，调度吞吐量无法超过100个Pod/s。 Godel Scheduler是一个基于Kubernetes的分布式高性能调度器，现已开源。在本次演讲中，我们将深入探讨godel调度器的性能优化方法：1. 优化调度算法并进行数据结构重构；2. 在多分片架构下实现乐观并发以实现并行计算；3. 抽象“批量”调度以充分重用调度计算结果。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Yuquan Ren has 10+ years of working experience in the cloud-native field, contributing extensively to open-source projects such as Kubernetes. Currently, he is a tech leader at ByteDance, primarily focusing on the field of orchestration and scheduling.

Bing Li

Senior Software Engineer, Bytedance

Bing Li has participated in the open source community for nearly 3 years. Currently, he is a senior software engineer at ByteDance, focusing on scheduling system performance optimization and system evolution.

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:00 HKT

Securing the Supply Chain: A Practical Guide to SLSA Compliance from Build to Runtime | 保障供应链安全：从构建到运行的SLSA合规实用指南 - Enguerrand Allamel, Ledger

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 1

Navigating the complexities of supply chain security might seem intimidating, especially with evolving frameworks like SLSA (Supply-chain Levels for Software Artifacts). This talk introduces beginners to the foundational practices required to secure software from build to runtime using CNCF tools. We'll explore how GitHub Actions can automate build processes, integrate with Cosign for keyless artifact signing, and use Kyverno for runtime policy enforcement. Additionally, we'll discuss how tools like in-toto and Kubescape help manage and verify artifact integrity, providing a holistic view of SLSA compliance in the Kubernetes ecosystem. To enhance security further, we will also briefly discuss the potential integration of Hardware Security Modules (HSMs) into the supply chain. HSMs can offer an added layer of security for key management operations critical to signing processes, ensuring that cryptographic keys are managed securely and are resilient against attack.

在KubeCon的一个会话描述：供应链安全的复杂性可能看起来令人望而却步，尤其是随着像SLSA（软件构件供应链级别）这样不断发展的框架。本次演讲将向初学者介绍使用CNCF工具来确保软件从构建到运行时的基本实践。我们将探讨GitHub Actions如何自动化构建流程，与Cosign集成进行无密钥构件签名，以及使用Kyverno进行运行时策略执行。此外，我们还将讨论像in-toto和Kubescape这样的工具如何帮助管理和验证构件完整性，为Kubernetes生态系统中的SLSA合规性提供全面视角。为了进一步增强安全性，我们还将简要讨论将硬件安全模块（HSMs）集成到供应链中的潜在可能性。HSMs可以为关键管理操作提供额外的安全层，这对签名过程至关重要，确保加密密钥得到安全管理，并且具有抵御攻击的弹性。

Speakers

Enguerrand Allamel

Senior Cloud Security Engineer, Ledger

Enguerrand is a Senior Cloud Security Engineer with experience in Site Reliability Engineering at Ledger since 2022. His work focuses on the security of scalable and reliable cloud systems, leveraging his knowledge of hybrid computing technologies and container orchestration with... Read More →

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Security

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

11:50 HKT

AI Inference Performance Acceleration: Methods, Tools, and Deployment Workflows | AI推理性能加速：方法、工具和部署工作流程 - Yifei Zhang & 磊钱, Bytedance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 7

As AI rapidly evolves and embraces cloud-native technologies, inference performance has become crucial for application value. GPU selection, serving framework configuration, and model/data loading significantly impact inference efficiency. We'll focus on cloud-native solutions to storage performance issues and tools for evaluating inference performance across configurations, offering optimal deployment setups integrated into cloud-native workflows. We'll discuss inference performance's impact on user experience and how optimization can reduce costs and improve efficiency. Using technologies like Fluid and model optimization, we'll share strategies to enhance inference performance. Based on performance and cost analysis of various GPUs, we'll guide AI engineers in hardware selection. Additionally, we'll introduce a performance testing tool to evaluate and recommend the best model, hardware, and acceleration scheme combinations, aligning with deployment workflows based on test results.

随着人工智能的快速发展和对云原生技术的采用，推理性能对应用价值变得至关重要。 GPU选择、服务框架配置以及模型/数据加载对推理效率有着重大影响。我们将专注于云原生解决方案，解决存储性能问题，并提供评估不同配置下推理性能的工具，为云原生工作流程提供最佳部署设置。我们将讨论推理性能对用户体验的影响，以及优化如何降低成本并提高效率。利用Fluid和模型优化等技术，我们将分享增强推理性能的策略。基于各种GPU的性能和成本分析，我们将指导人工智能工程师进行硬件选择。此外，我们将介绍一种性能测试工具，评估并推荐最佳模型、硬件和加速方案组合，根据测试结果与部署工作流程相匹配。

Speakers

Yifei Zhang

Software Engineer, Bytedance

Yifei Zhang, Software Engineer at Volcengine, focuses on technical research and product development in Kubernetes and AI, and has rich experience in public cloud, and is now fully working on VKE (Volcengine Kubernetes Engine), which is the managed Kubernetes product in Volcengine... Read More →

钱磊

Software Engineer, Bytedance

a kubernetes developer in bytedance. focus on building a stable kubernetes engine on public cloud.

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Extend Kubernetes to Edge Using Event-Based Transport | 使用基于事件的传输将Kubernetes扩展到边缘 - Longlong Cao & Meng Yan, Red Hat

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 1

Struggling with extensive edge cluster management? Kubernetes adoption brings new challenges, especially in sectors like telecom, retail, and manufacturing. The surge in clusters highlights Kubernetes' limitations, worsened by unreliable networks between data centers and edge clusters. Without scalable control, organizations resort to sending engineers to maintain thousands or even millions of edge clusters, slowing progress. But, we have a solution: connecting Kubernetes and edge clusters via event-based transport, utilizing standard open-source protocols like Kafka, MQTT, and NATS. This enhances Kubernetes-style events, making them resilient to network delays or disconnects. With these capabilities, we can effortlessly construct a central control plane scalable to millions of edge clusters. Join us for an intuitive control plane, handling a million edge clusters across regions. Learn an approach that can be adapted to your edge management infrastructure today.

在KubeCon的会议描述中，若您正在为庞大的边缘集群管理而苦恼？Kubernetes的采用带来了新的挑战，尤其是在电信、零售和制造等行业。集群数量的激增凸显了Kubernetes的局限性，加剧了数据中心和边缘集群之间不稳定网络的问题。在缺乏可扩展控制的情况下，组织不得不派遣工程师去维护成千上万甚至数百万个边缘集群，从而拖慢了进展。但是，我们有解决方案：通过基于事件的传输将Kubernetes和边缘集群连接起来，利用标准的开源协议如Kafka、MQTT和NATS。这样可以增强Kubernetes风格的事件，使其能够抵御网络延迟或断开连接。有了这些功能，我们可以轻松构建一个可扩展到数百万个边缘集群的中央控制平台。加入我们，体验一个直观的控制平台，可以跨区域管理数百万个边缘集群。学习一种可以立即应用于您的边缘管理基础设施的方法。

Speakers

Longlong Cao

Senior Software Engineer, Red Hat

Long Long Cao currently works as a cloud engineer at Red Hat, he is also maintainer of the Istio project and member of the Kubernetes SIGs. He is passionate about open source projects and has extensive experience in Docker, Kubernetes and Service Mesh. He writes blogs/articles and... Read More →

Meng Yan

Software Engineer, Red Hat

Meng Yan currently works as a software engineer at Red Hat. What he mainly does is the management of large-scale clusters. Mainly contributed to open source projects are multicluster-global-hub, multicluster-controlplane, etc, also participating in the improvement of Cloudevent.

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:50 HKT

Implementing Fine-Grained and Pluggable Container Resource Management Leveraging NRI | 基于 NRI 实现精细化且可插拔的容器资源管理 - Qiang Ren, Intel & He Cao, ByteDance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

To overcome Kubernetes' limitations in resource management, ByteDance developed Katalyst, a resource management system. Katalyst employs a range of methodologies, including colocation, node over-commitment, specification recommendation, and tidal colocation, aimed at optimizing cluster resource utilization.

Initially, Katalyst introduced a QoS Resource Manager (QRM) framework within kubelet, facilitating versatile container resource allocation through a plugin architecture. Presently, the Node Resource Interface (NRI) presents a refined alternative.

This session elucidates how Katalyst leverages NRI for fine-grained and adaptable container resource management, ensuring efficiency without intrusive modifications of upstream components. This novel architecture allows Katalyst to seamlessly integrate with native Kubernetes, offering a user-friendly and easily maintainable solution.

为了克服 Kubernetes 在资源管理方面的局限性，字节跳动构建了一个资源管理系统 Katalyst，通过在离线业务常态混部、资源超分、规格推荐、潮汐混部等方式，提升集群的资源利用率。最初，Katalyst 在 kubelet 中引入了一个 QoS Resource Manager（QRM）框架，通过插件化的方式来扩展容器的资源分配策略；当前，Node Resource Interface（NRI）提供了一个原生的替代方案。

本次演讲将介绍 Katalyst 如何通过 NRI 实现精细化且可插拔的容器资源管理，在不对上游组件进行侵入性修改的情况下，提升资源利用率并保证业务的 SLO 不受影响。这种全新的架构使 Katalyst 能够与原生 Kubernetes 无缝集成，提供了一种易于使用和维护的解决方案。

Speakers

Qiang Ren

Software Engineer, Intel

Ren Qiang works as a Cloud Orchestration Software Engineer in SATG, Intel. He mainly focuses on Cloud Native technologies in the runtime. At the same time, he actively participates in open-source projects and is committed to promoting the development of runtime and resource isola... Read More →

He Cao

Senior Software Engineer, ByteDance

He Cao is a senior software engineer on the Cloud Native team at ByteDance, a maintainer of Katalyst and KubeZoo, and a member of Istio. He has 5+ years of experience in the cloud native area. Since joining ByteDance, he has designed and implemented several critical systems for VKE... Read More →

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

13:50 HKT

Boundaryless Computing: Optimizing LLM Performance, Cost, and Efficiency in Multi-Cloud Architecture | 无边界计算：在多云架构中优化LLM性能、成本和效率 - Jian Zhu, Red Hat & Kai Zhang, Alibaba Cloud Intelligence

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 7

For large language model (LLM) inference, GPU resources within a single data center or cloud region often cannot meet all user demands. Additionally, for the end-users, deploying across multiple geographic regions is necessary to provide an optimal user experience. However, managing model distribution, synchronization, and consistency across multiple regions presents new challenges. To address this, the OCM and Fluid communities have collaborated to automate the multi-region distribution of inference applications through OCM's multi-cluster application deployment capabilities, combined with Fluid's data orchestration capabilities. This automation facilitates the cross-regional distribution and pre-warming of large models, enhancing the efficiency of model deployment and upgrades.

对于大型语言模型（LLM）推理，单个数据中心或云区域内的GPU资源通常无法满足所有用户需求。此外，对于最终用户来说，跨多个地理区域部署是为了提供最佳用户体验。然而，在多个地区管理模型分发、同步和一致性会带来新的挑战。为了解决这个问题，OCM和Fluid社区合作，通过OCM的多集群应用部署能力和Fluid的数据编排能力自动化实现推理应用的多地区分发。这种自动化促进了大型模型的跨地区分发和预热，提高了模型部署和升级的效率。

Speakers

Kai Zhang

Senior Staff Engineer, Alibaba

Kai Zhang is a Senior Staff Engineer at Alibaba Cloud Intelligence, where he has been part of the team developing the Alibaba Cloud container service for Kubernetes (ACK) for over 6 years. He currently leads ACK’s Cloud native AI product and solution offerings. Before this, he spent... Read More →

Jian Zhu

Senior Software Engineer, RedHat

Zhu Jian is a senior software engineer at RedHat, core contributor to open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:50 HKT

Enhancing Cyber Resilience Through Zero Trust Chaos Experiments in Cloud Native Environments | 通过在云原生环境中进行零信任混沌实验来增强网络安全弹性 - Sayan Mondal, Harness & Rafik Harabi, Sysdig

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 2 | Grand Ballroom 1-2

Cyber-attacks against cloud-native infrastructure are increasing in frequency and sophistication. The complexity of modern cloud-native systems and the speed at which technology is developing have outpaced cloud security solutions. On the flip side, cyber-criminals are taking advantage of these developments to launch successful cloud attacks. This session delves into the paradigm of Zero Trust Chaos Experiments, exploring how intentional disruptions and simulated cyber threats can uncover vulnerabilities and enhance cyber resilience. Through practical insights, we will illustrate the transformative impact of Zero Trust Chaos Experiments on organizations' ability to detect and mitigate cyber incidents. By the end of the session, participants will be equipped with actionable strategies and a better understanding of how Zero Trust Chaos Experiments can elevate cyber resilience in cloud-native environments

针对云原生基础设施的网络攻击频率和复杂性正在增加。现代云原生系统的复杂性和技术发展速度已经超过了云安全解决方案。与此同时，网络犯罪分子正在利用这些发展来发动成功的云攻击。本场演讲将深入探讨零信任混沌实验的范式，探讨有意的干扰和模拟网络威胁如何揭示漏洞并增强网络安全弹性。通过实用的见解，我们将阐明零信任混沌实验对组织检测和缓解网络事件能力的转变影响。在会议结束时，参与者将掌握可操作的策略，并更好地了解零信任混沌实验如何提升云原生环境中的网络安全弹性。

Speakers

Rafik Harabi

Senior Solutions Architect, Sysdig

Rafik has more than 15 years of tech and internet industry experience. Currently, he is a Senior Solution Architect devoted to helping customers secure their cloud native platforms and applications. Before joining Sysdig, he was responsible for executing go-to cloud programmes in... Read More →

Sayan Mondal

Senior Software Engineer 2, Harness

Sayan Mondal is a Senior Software Engineer II at Harness, building their Chaos Engineering platform and helping them shape the customer experience market. He's the maintainer of a few open-source libraries and is also a maintainer of LitmusChaos (the Incubating CNCF project). Sayan's... Read More →

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Security

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

13:50 HKT

Kubespray Unleashed: Navigating Bare Metal Services in Kubernetes for LLM and RAG | Kubespray大放异彩：在Kubernetes中为LLM和RAG部署裸金属服务 - Kay Yan, DaoCloud & Alan Leung, Equinix

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 2

Kubespray, popular within the SIG-Cluster-Lifecycle of Kubernetes, is celebrated for deploying production-ready Kubernetes clusters, particularly on bare metal, which boosts performance for AI workloads like LLM and RAG. This session will explore using Kubespray in bare metal settings, addressing challenges, and sharing best practices. The first part of the talk will show Kubespray's key features and provide practical tips. The latter half will focus on swiftly deploying AI using Retrieval-Augmented Generation (RAG), demonstrating how Kubespray facilitates setting up Kubernetes clusters on bare metal. This setup enhances AI applications by integrating continuous knowledge updates and domain-specific information via RAG, improving the accuracy and credibility of the AI systems. The session will conclude with discussions on community engagement and future advancements, followed by a Q&A period to address participant queries.

KubeCon会议描述： Kubespray在Kubernetes的SIG-Cluster-Lifecycle中备受推崇，以在裸金属上部署可用于生产的Kubernetes集群而闻名，特别是对于像LLM和RAG这样的AI工作负载，可以提高性能。本场演讲将探讨在裸金属环境中使用Kubespray，解决挑战，并分享最佳实践。演讲的第一部分将展示Kubespray的关键特性并提供实用技巧。后半部分将重点介绍如何使用检索增强生成（RAG）快速部署AI，演示Kubespray如何在裸金属上设置Kubernetes集群。通过RAG集成持续的知识更新和领域特定信息，这种设置可以提升AI应用程序的性能，提高AI系统的准确性和可信度。本场演讲将以社区参与和未来发展的讨论结束，随后进行问答环节以解答参与者的疑问。

Speakers

Kay Yan

Principal Software Engineer, DaoCloud

Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.

Alan Leung

Digital Technical Specialist, Equinix

Alan is the Digital Technical Specialist at Equinix with focus on enabling customers, prospects and partners to develop innovative solutions to solve business challenges at the digital edge.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:40 HKT

Best Practice: Karmada & Istio Improve Workload & Traffic Resilience of Production Distributed Cloud | 最佳实践：Karmada和Istio提高生产分布式云的工作负载和流量弹性 - Chaomeng Zhang, Huawei

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 2 | Grand Ballroom 1-2

The Distributed cloud offers better resilience by providing redundancy, scalability and flexibility, especially for cloud native applications. However the complexity of multi-cluster workload and traffic management in hybrid or multi-cloud environment brings huge challenges in practice, such as the number of overall multi-cluster workload instances serve for customer request decreased when some unhealthy ones isolated in case of failures. In this speech, Chaomeng introduces a production practice of Karmada and Istio work together to promote resilience of multi-cluster application. How Karmada and Istio policies configured in a centralized control plane controls both replica and traffic distribution across cluster automatically. In case of failures, how Istio’s failover acts to remove unhealthy endpoints from global load balancing pool, and how Karmada rebuild the according number of instance in other healthy clusters, ensure multi-cluster instances always meet the capacity design.

分布式云通过提供冗余、可伸缩性和灵活性，特别是对于云原生应用程序，提供了更好的弹性。然而，在混合或多云环境中的多集群工作负载和流量管理的复杂性在实践中带来了巨大挑战，例如当一些不健康的实例在故障情况下被隔离时，为客户请求提供服务的整体多集群工作负载实例数量减少。在这次演讲中，Chaomeng介绍了Karmada和Istio共同推动多集群应用程序弹性的生产实践。Karmada和Istio策略如何在集中控制平面中配置，自动控制跨集群的副本和流量分发。在发生故障时，Istio的故障转移如何从全局负载均衡池中移除不健康的端点，以及Karmada如何在其他健康集群中重新构建相应数量的实例，确保多集群实例始终满足容量设计。

Speakers

Chaomeng Zhang

Architect of UCS (HUAWEI Distributed Cloud Native), Huawei

Zhang Chaomeng is the architect of UCS (HUAWEI Distributed Cloud Native), has 9 years cloud computing related design and developing experience in HUAWEI Cloud, including service mesh, Kubernetes, micro service, cloud service catalog, big data, APM, cloud computing reliability and... Read More →

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

14:40 HKT

Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience | 连接点：走向统一的多集群AI/ML体验 - Qing Hao, RedHat & Chen Yu, Microsoft

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 7

Today cloud-native infra is vital for AI/ML, administrative complexities and the growing demand for compute resources drive devs towards multi-cluster patterns. Batch scheduling projects, like Kueue, are valuable for efficient AI/ML training in a single Kubernetes cluster. Multi-cluster management platforms like OCM and Fleet simplify cluster management and provide advanced scheduling features. We hope to bridge the best of both worlds to simplify user operations and reduce confusion between different systems. In this talk, we will showcase that with the help of Sig Multi-Cluster's newly proposed API - ClusterProfile, combined with OCM, Fleet, and Kueue, to address these challenges. We will demonstrate that MultiKueue setup can be easily automated with the help of the ClusterProfile API; with a few tweaks, users can use OCM and Fleet's advanced scheduling features through MultiKueue to smart place AI/ML jobs across the clusters to maximize resource utilization like GPU to save costs.

今天，云原生基础设施对于人工智能/机器学习、管理复杂性以及对计算资源需求不断增长至关重要，这推动开发人员转向多集群模式。像Kueue这样的批处理调度项目对于在单个Kubernetes集群中高效进行人工智能/机器学习训练非常有价值。OCM和Fleet等多集群管理平台简化了集群管理，并提供了高级调度功能。我们希望将两者的优势结合起来，简化用户操作，减少不同系统之间的混乱。在本次演讲中，我们将展示如何借助Sig Multi-Cluster最新提出的API - ClusterProfile，结合OCM、Fleet和Kueue来解决这些挑战。我们将演示如何通过ClusterProfile API轻松自动化MultiKueue设置；通过一些调整，用户可以利用OCM和Fleet的高级调度功能，通过MultiKueue智能地在集群之间放置人工智能/机器学习作业，以最大化资源利用率，如GPU，以节省成本。

Speakers

Qing Hao

Senior Software Engineer, RedHat

Qing Hao is a senior software engineer at RedHat, where she works as the maintainer of Open Cluster Management. Qing has interest in solving complex problems in the multi-clusters areas, eg, application scheduling, and management components rolling upgrade. Prior to RedHat, she worked... Read More →

Chen Yu

Senior Software Engineer, Microsoft

Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:40 HKT

Scaling Kubernetes: Best Practices for Managing Large-Scale Batch Jobs with Spark and Argo Workflow | 扩展Kubernetes：管理大规模批处理作业的最佳实践与Spark和Argo工作流 - Yu Zhuang & Liu Jiaxu, Alibaba Cloud

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 2

Are you managing large-scale batch jobs on Kubernetes, like data processing with Spark applications or genomics computing with Argo workflows? To complete these jobs promptly, a significant number of pods have to be scaled out/in quickly for parallel computation. It means a big pressure to Kubernetes control plane. In this talk, we will use Spark and Argo workflows as example, guiding you how to build a Kubernetes cluster which supports creating/deleting 20000 of pods frequently. Our focus will be on tuning the Kubernetes control plane, including optimizing the list-watch mechanism, service broadcasting, environment variable attachments, API server configurations. Additionally, we'll share some of the best practices for configuring Spark operator and Argo workflows controller.

您是否正在Kubernetes上管理大规模的批处理作业，比如使用Spark应用程序进行数据处理或使用Argo工作流进行基因组计算？为了及时完成这些作业，需要快速地扩展/缩减大量的Pod以进行并行计算，这给Kubernetes控制平面带来了巨大压力。在本次演讲中，我们将以Spark和Argo工作流为例，指导您如何构建一个支持频繁创建/删除20000个Pod的Kubernetes集群。我们将重点放在调优Kubernetes控制平面上，包括优化列表-观察机制、服务广播、环境变量附加、API服务器配置等。此外，我们还将分享一些配置Spark操作员和Argo工作流控制器的最佳实践。

Speakers

Liu Jiaxu

Senior Engineer, Alibaba Cloud

Jiaxu Liu is a Senior Engineer on the Container Service Team at Alibaba Cloud. He specializes in observability enhancement and large-scale cluster management and optimization for Alibaba Cloud's container service offerings. Before joining Alibaba Cloud, he worked at Nokia as a Senior... Read More →

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

15:35 HKT

How Fast Can Your Model Composition Run in Serverless Inference? | 您的模型组合在无服务器推理中可以运行多快？ - Fog Dong, BentoML & Wenbo Qi, Ant Group

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 7

Are you struggling with slow deployment times, high operational costs, or scalability issues when serving your ML models? Now, imagine the added complexity when typical AI apps require not just one, but an interconnected suite of models. In this session, discover how the integration of BentoML with Dragonfly effectively addresses these challenges, transforming the landscape of multi-model composition and inference within serverless Kubernetes envs. Join the co-presentation by the BentoML and Dragonfly communities to explore a compelling case study: a RAG app that combines 3 models—LLM, embedding, and OCR. Learn how our framework not only packages these diverse models efficiently but also utilizes Dragonfly's innovative P2P network for swift distribution. We'll further delve into how other open-source technologies like JuiceFS and VLLM have enabled us to achieve remarkable deployment times of just 40 seconds and establish a scalable blueprint for multi-model composition deployments.

您是否在为机器学习模型的部署时间慢、运营成本高或可扩展性问题而苦恼？现在，想象一下当典型的人工智能应用程序不仅需要一个模型，而是一个相互连接的模型套件时所增加的复杂性。在本场演讲中，了解BentoML与Dragonfly的集成如何有效解决这些挑战，改变了无服务器Kubernetes环境中多模型组合和推理的格局。加入BentoML和Dragonfly社区的联合演示，探索一个引人注目的案例研究：一个结合了LLM、嵌入和OCR三个模型的RAG应用程序。了解我们的框架不仅高效打包这些多样化的模型，还利用Dragonfly创新的P2P网络进行快速分发。我们还将深入探讨其他开源技术，如JuiceFS和VLLM，如何帮助我们实现仅需40秒的部署时间，并为多模型组合部署建立可扩展的蓝图。

Speakers

Wenbo Qi

Senior Software Engineer, Ant Group

Wenbo Qi is a software engineer at Ant Group working on Dragonfly. He is a maintainer of the Dragonfly. He hopes to do some positive contributions to open source software and believe that fear springs from ignorance.

Fog Dong

Senior Software Engineer, BentoML

Fog Dong, a Senior Engineer at BentoML, KubeVela maintainer, CNCF Ambassador, and LFAPAC Evangelist, has a rich background in cloud native. Previously instrumental in developing Alibaba's large-scale Serverless Application Engine workflows and Bytedance's cloud-native CI/CD platform... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 中文 (Chinese)

15:35 HKT

Implementing Seamless Connectivity and Service Governance in Multi Kubernetes Cluster with ZTM | 在多个Kubernetes集群中使用ZTM实现无缝连接和服务治理 - Xiaohui Zhang, Flomesh

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 1

In the evolving cloud-native ecosystem, Kubernetes is vital for microservices. As enterprises adopt multi-cluster Kubernetes setups, securely managing cross-cluster communications becomes challenging due to the limitations of traditional gateways and Ingress solutions. This session explores how ZTM (Zero Trusted Mesh) acts as a bridge across K8s clusters, bypassing traditional gateways and network constraints, thus ensuring zero exposure and boosting security. ZTM uses an HTTP/2-based tunneling mechanism with end-to-end encryption, minimizing public exposure and securing data during transmission. Its design enables quick deployment of cross-cluster communications without altering existing networks or applications, easing management. Furthermore, ZTM integrates with service mesh technologies to provide a secure framework for microservices, supporting service discovery, load balancing, and advanced routing policies, allowing flexible and secure cross-cluster service management.

在不断发展的云原生生态系统中，Kubernetes 对于微服务至关重要。随着企业采用多集群 Kubernetes 设置，由于传统网关和入口解决方案的限制，安全地管理跨集群通信变得具有挑战性。本场演讲探讨了 ZTM（Zero Trusted Mesh）如何作为跨 K8s 集群的桥梁，绕过传统网关和网络限制，从而确保零暴露并提升安全性。 ZTM 使用基于 HTTP/2 的隧道机制进行端到端加密，最大程度减少公开暴露并在传输过程中保护数据安全。其设计能够快速部署跨集群通信，而无需改变现有网络或应用程序，简化管理。此外，ZTM 还与服务网格技术集成，为微服务提供安全框架，支持服务发现、负载均衡和高级路由策略，实现灵活且安全的跨集群服务管理。

Speakers

AddoZhang

Cloud Native Architect, Flomesh

Senior programmer, LFAPAC open source evangelist, CNCF Ambassador, Microsoft MVP, author of the WeChat public account "云原生指北". Years of practical experience in microservices and cloud-native, the main work involves microservices, containers, Kubernetes, DevOps, etc.

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

15:35 HKT

Strengthening Container Security: A Collaborative Journey | 加强容器安全性：共同的旅程 - Yi Zha, Microsoft & Beltran Rueda Borrego, VMware (part of Broadcom)

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 2 | Grand Ballroom 1-2

Ensuring the integrity and authenticity of container images is critical in securing the container supply chain. As developers are increasingly using images from external sources, questions arise: How can we verify these images originate from trusted vendors? How do we guarantee they are not altered since their creation? In this session, you will learn from the real-world experience of VMware Bitnami, who partnered with the Notary Project community to implement image signing and verification. Bitnami will show you how they use Notary Project signatures to ensure the integrity and authenticity of images from Docker Hub. Don't miss this opportunity to gain practical insights into container security with Notary Project within your CI/CD pipelines and during Kubernetes deployments! Additionally, we’ll explore future enhancements, including attestation support, empowering users to verify images from various perspectives such as provenance, vulnerability assessment, and software compliance.

确保容器镜像的完整性和真实性对于保护容器供应链至关重要。随着开发人员越来越多地使用来自外部来源的镜像，一些问题浮出水面：我们如何验证这些镜像来自可信赖的供应商？我们如何确保它们自创建以来没有被篡改？在这场演讲中，您将从VMware Bitnami的实际经验中学习，他们与Notary项目社区合作实施了镜像签名和验证。Bitnami将向您展示他们如何使用Notary项目签名来确保来自Docker Hub的镜像的完整性和真实性。不要错过这个机会，在您的CI/CD流水线和Kubernetes部署中通过Notary项目获得容器安全的实用见解！此外，我们将探讨未来的增强功能，包括证明支持，使用户能够从各种角度验证镜像，如来源、漏洞评估和软件合规性。

Speakers

Yi Zha

Senior Product Manager, Microsoft

Yi is a senior product manager in Azure Container Upstream team at Microsoft and is responsible for container supply chain security for Azure services and customers. He is also a maintainer of CNCF project Notary, and a contributor of CNCF ORAS and OSS project Ratify.

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Security

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

15:35 HKT

Tackling Operational Time-to-Market Decelerators in AI/ML Projects | 应对人工智能/机器学习项目中的运营时间市场减速器 - Adrian Matei & Andreea Munteanu, Canonical

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 2

In the competitive AI market, Time To Market (TTM) is crucial for success. Ensuring secure, scalable, and compliant ML infrastructures often slows TTM due to the complexities of updates, patches, monitoring, and security enforcement. This leads to decreases in ROI, profitability, reproducibility, and competitive edge. To address this, companies can engage Managed Service Providers (MSPs) to offload operational burdens and focus on innovation, yet selecting the right MSP requires consideration of expertise, automation capabilities, and compliance adherence. This presentation explores the AI operational landscape, highlighting indicators and challenges in MSP collaboration. We will focus on the management of open source tools like Kubeflow and MLflow across hybrid and multicloud environments. By understanding operational excellence in AI and available options to achieve it, attendees will gain insights into choosing an approach that aligns with their greater objectives.

在竞争激烈的人工智能市场中，上市时间对于成功至关重要。确保安全、可扩展和合规的机器学习基础设施通常会因更新、补丁、监控和安全执行的复杂性而减慢上市时间，导致投资回报率、盈利能力、可复制性和竞争优势下降。为了解决这个问题，公司可以与托管服务提供商（MSPs）合作，减轻运营负担，专注于创新，但选择合适的MSP需要考虑专业知识、自动化能力和合规性。本次演讲探讨了人工智能运营领域，重点介绍了MSP合作中的指标和挑战。我们将重点关注在混合和多云环境中管理开源工具如Kubeflow和MLflow。通过了解人工智能运营卓越性以及实现卓越性的可用选项，与会者将获得选择与其更大目标一致的方法的见解。

Speakers

Andreea Munteanu

AI Product Manager, Canonical

Andreea Munteanu is a Product Manager at Canonical, leading the MLOps area. With a background in Data Science in various industries, she used AI techniques to enable enterprises to benefit from their initiatives and make data-driven decisions. Nowadays, Andreea is looking to help... Read More →

Adrian Matei

Product Manager, Canonical

With a degree in Information Management for Business, Adrian is now guiding Canonical’s open-source operational management toolset as Product Manager. He has been working in open source operations for the past two years, having previously accumulated experience in technology consulting... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

16:25 HKT

Istio and Modern API Gateways: Navigating the Future of Service Meshes | Istio和现代API网关：引领服务网格的未来 - Jimmy Song & Jianpeng He, Tetrate; Jiaqi Zhang, Alibaba Cloud; Jintao Zhang, Kong Inc.; Xunzhuo Liu, Tencent

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 1

Join our esteemed panel of experts as they delve into the latest advancements and integrations in the world of Istio and API gateways. This discussion, led by Jimmy Song from Tetrate and founder of the China Cloud Native Community, will feature insights from core contributors and thought leaders including Jianpeng He (Tetrate), Jintao Zhang (Kong), Xunzhuo Liu (Tencent) and Zhang Jiaqi (Alibaba Cloud). The panel will explore Istio's recent developments such as Ambient Mesh, sidecar-less architectures, and the application of eBPF, along with the evolving role of Envoy Gateway. Participants will gain an in-depth understanding of how API gateways are blending with service meshes to create more dynamic, efficient, and secure cloud-native environments.

加入我们尊贵的专家小组，他们将深入探讨 Istio 和 API 网关领域的最新进展和集成。这次讨论由 Tetrate 的 Jimmy Song 主持，他是中国云原生社区的创始人，将邀请核心贡献者和思想领袖，包括 Jianpeng He（Tetrate）、Jintao Zhang（Kong）、Xunzhuo Liu（腾讯）和张佳琦（阿里云）分享见解。小组将探讨 Istio 的最新发展，如环境网格、无边车架构以及 eBPF 的应用，以及 Envoy 网关的不断演变角色。参与者将深入了解 API 网关如何与服务网格融合，创造更具动态、高效和安全的云原生环境。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Jimmy Song

Developer Advocate, Tetrate

Jimmy Song is a developer advocate at Tetrate, CNCF Ambassador, Cloud Native Community founder. He is an outstanding translator, author, and producer of PHEI. Early adopters and evangelists of Kubernetes and Istio. Previously, he worked at iFlytek, TalkingData, and Ant Group.

Xunzhuo

Software Engineer, Tencent

Xunzhuo Liu, Software Engineer working at Tencent Kubernetes Engine Team. He is an Open Source Enthusiast, focusing on API Gateway, Service Mesh, and Kubernetes Networking. He is the steering committee member, core maintainer of Envoy Gateway, also maintaining a couple of CNCF projects... Read More →

Jianpeng He

Software Engineer, Tetrate

Jianpeng is a core maintainer of istio, co-leader of Extensions and Telemetry wroking group, has been working on Istio for almost 3 years, he is the maintainer of Envoy Gateway.

Jiaqi Zhang

software engineer, Alibaba Cloud

Zhang Jiaqi, working on Alibaba Cloud Service Mesh as software engineer, , focusing on traffic management and telemetry related fields, after graduated from the School of Computer Science, Peking University. Participated in several software computer academic conferences, and keen... Read More →

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:25 HKT

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training | 利用拓扑建模和拓扑感知调度加速LLM训练 - Yang Wang, Huawei

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 7

In the LLM training and inference era, the bottle neck has changed from computing to network. A lot of high throughput and low latency inter-connect technology are widely used, e.g. nvlink, nvswitch to build hyper computer such as nvidia super pod, google multi-slice, AWS placement group. However, Kubernetes has net yet addressed topology awareness efficiently, resulting in low performance when sub-optimal resources are provisioned. This talk will explore the inter-node communication and resources within node inter-connect. Also analyze how these two toplogical factors impacts on the runtime performance of AI workload especially for large language model training. The talk will cover: - How to model the topology on underlying resources like NUMA, Rack, Super Pod, Hyper Computer - How to make scheduler to aware of topology and make the best scheduling - How to coordinate topology-aware scheduling with DRA on node

在LLM训练和推断时代，瓶颈已经从计算转变为网络。许多高吞吐量和低延迟的互连技术被广泛使用，例如nvlink、nvswitch用于构建超级计算机，如nvidia超级Pod、谷歌多片、AWS放置组。然而，Kubernetes尚未有效地解决拓扑意识问题，导致在资源配置不佳时性能较低。本次演讲将探讨节点间通信和节点内部资源的互连。还将分析这两个拓扑因素如何影响AI工作负载的运行性能，特别是对于大型语言模型训练。演讲内容包括： - 如何对底层资源（如NUMA、机架、超级计算机）建模拓扑 - 如何使调度程序意识到拓扑并进行最佳调度 - 如何协调拓扑感知调度与节点上的DRA

Speakers

Yang Wang

Senior engineer and maintainer of Volcano, Huawei Cloud Technologies Co., LTD

Volcano maintainer and speaker at KCD and GOTC. Focus on cloud native scheduling and multi-cluster managment.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:25 HKT

Staying Ahead of Fast-Moving Attackers | 保持领先于快速移动的攻击者 - Aizhamal Nurmamat kyzy, Sysdig

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 2 | Grand Ballroom 1-2

How to find the right balance between convenience, operational efficiency, and a strong security policy in a world of ephemeral containers? And how can we ensure security at a time when Advanced Persistent Threats (APTs) are more prevalent? In this talk we will present the latest Cloud Native Security & Usage Report findings on critical vulnerabilities inherent in today’s container security practices. We will also demonstrate how a compromised, short-lived container can be an insidious security risk, and what we can do to detect and mitigate those risks in real time using cloud native open source tools.

在一个短暂容器世界中，如何在便利性、运营效率和强大安全政策之间找到合适的平衡？在APT（高级持续性威胁）更加普遍的时代，我们如何确保安全？在这次演讲中，我们将介绍最新的云原生安全和使用报告发现，揭示当今容器安全实践中存在的关键漏洞。我们还将演示一个被 compromise 的短暂容器如何成为一个隐蔽的安全风险，以及我们如何使用云原生开源工具实时检测和减轻这些风险。

Speakers

Aizhamal Nurmamat kyzy

Director, DevRel, Sysdig

Aizhamal is a Director of DevRel at Sysdig where she focuses on education around security and open source. Previously she worked at Google's OSPO where she helped build open source communities in cloud native and data analytics ecosystems.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Security

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

16:25 HKT

Unleashing the Power of Cluster API: Extensibility and Customization | 释放Cluster API的力量：可扩展性和定制化 - Zain Malik, CityStorageSystems & Nibir Bora, Startup

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Cluster API, designed with extensibility at its core, has revolutionized Kubernetes cluster management. Its open and pluggable architecture empowers providers to implement custom solutions tailored to their unique requirements. In this session, we will explore how Cluster API's extension-by-design philosophy has opened new horizons for organizations seeking to create bespoke Kubernetes clusters. Managing Kubernetes clusters at scale presents unique operational challenges that cannot be tamed with manual operations. Through real-world examples and lessons learned, we will demonstrate how Cluster API's flexibility allows for the integration of diverse infrastructure providers and the implementation of organization-specific customizations. Attendees will gain insights into best practices for extending Cluster API, including developing custom controllers, integrating third-party tools, and creating bespoke workflows.

Cluster API是以可扩展性为核心设计的，已经彻底改变了Kubernetes集群管理。其开放和可插拔的架构赋予提供者实施定制解决方案的能力，以满足其独特需求。在本场演讲中，我们将探讨Cluster API的“通过设计进行扩展”的理念如何为寻求创建定制化Kubernetes集群的组织开辟了新的视野。在规模化管理Kubernetes集群时，会面临无法通过手动操作解决的独特运营挑战。通过现实世界的例子和经验教训，我们将演示Cluster API的灵活性如何允许集成各种基础设施提供者，并实施组织特定的定制化。与会者将获得有关扩展Cluster API的最佳实践的见解，包括开发自定义控制器、集成第三方工具和创建定制工作流程。

Speakers

Zain Malik

Staff Software Engineer, CityStorageSystems

Zain Malik serves as a tech lead in the compute team for a startup, where he has significantly contributed to projects related to cost saving and reliability. And help mature cluster lifecycle management. Before this role, Zain was a product owner and staff software engineer in the... Read More →

Nibir Bora

Engineering Manager, Startup

Nibir is a Engineering Manager in charge of Core Infrastructure at a Stealth Startup, where he is responsible for the company's Kubernetes infrastructure running 100s of clusters globally.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 英语 (English)

17:15 HKT

How to Manage Database Clusters Without a Dedicated Operator | 如何在没有专门Operator的情况下管理数据库集群 - Shanshan Ying, ApeCloud & Shun Ding, China Mobile Cloud

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 2

As Kubernetes becomes integral to cloud-native environments, more organizations are deploying database services on K8S, facing significant challenges. Integrating new database engines typically requires developing a dedicated Kubernetes operator that manages not only resource provisioning but also essential maintenance tasks like high availability, backup & restore, and configuration management. This session introduces a universal operator framework that supports various database engines, enabling rapid, minimal-code integration. We will present a case study from China Mobile Cloud on integrating a new cloud-native database engine into K8S using this framework, achieved with minimal coding and reduced time investment, bypassing the extensive Golang coding usually required for developing a dedicated operator.

随着Kubernetes成为云原生环境中不可或缺的一部分，越来越多的组织在K8S上部署数据库服务，面临着重大挑战。集成新的数据库引擎通常需要开发一个专门的Kubernetes operator，管理资源提供以及高可用性、备份和恢复、配置管理等重要维护任务。本场演讲将介绍一个支持各种数据库引擎的通用operator框架，实现快速、最小代码集成。我们将从中国移动云的一个案例研究中介绍如何使用这个框架将新的云原生数据库引擎集成到K8S中，通过最小的编码和减少时间投入来实现，避免通常需要开发专门operator所需的大量Golang编码。

Speakers

Shanshan Ying

Maintainer, ApeCloud

Shanshan is currently a maintainer of KubeBlocks by ApeCloud. Before joining ApeCloud, she worked in Aliyun Database Group for years. She received her PhD degree from National University of Singapore.

Shun Ding

Senior Systems Architect, China Mobile Cloud

Shun is a Senior Systems Architect at China Mobile Cloud, leading the design, development, and deployment of next-generation Kubernetes-based large-scale database managing service. With over a decade of experience in cloud computing and database technologies, Shun has extensive expertise... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

17:15 HKT

Leveraging Wasm for Portable AI Inference Across GPUs, CPUs, OS & Cloud-Native Environments | 利用Wasm在GPU、CPU、操作系统和云原生环境中进行可移植的AI推理 - Miley Fu & Hung-Tung Tai, Second State

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 7

This talk will focus on the advantages of using WebAssembly (Wasm) for running AI inference tasks in a cloud-native ecosystem. We will explore how wasm empowers devs to develop on their own PC and have their AI inference uniformly performed across different hardware, including GPUs and CPUs, operating systems, edge cloud etc. We'll discuss how Wasm and Wasm runtime facilitates seamless integration into cloud-native frameworks, enhancing the deployment and scalability of AI applications. This presentation will specifically highlight how Wasm provides a flexible, efficient solution suitable for diverse cloud-native architectures, including Kubernetes, to allow developers to fully tap the potential of LLMs, especially open source LLMs. The session offers insights into maximizing the potential of AI applications by leveraging the cross-platform capabilities of Wasm, ensuring consistency, low cost, and efficiency in AI inference across different computing environments.

本次演讲将重点介绍在云原生生态中运行AI推理任务时使用WebAssembly（Wasm）的优势。我们将探讨如何使用Wasm使开发者能够在自己的个人电脑上开发，并在不同硬件（包括GPU和CPU）、操作系统、边缘云等上统一执行他们的AI推理。我们将讨论Wasm和Wasm运行时如何实现无缝集成到云原生框架中，增强AI应用程序的部署和可扩展性。本次演示将重点展示Wasm如何提供灵活、高效的解决方案，适用于各种云原生架构，包括Kubernetes，以帮助开发者充分发挥大语言模型的潜力，特别是开源大语言模型。将深入探讨通过利用Wasm的跨平台能力来最大限度地发挥AI应用的潜力，确保在不同计算环境中实现AI推理的一致性、低成本和高效性。

Speakers

Hung-Ying Tai

Software Engineer, Second State

Hung-Ying is a maintainer of the WasmEdge project and a pioneer in compiler optimization and virtual machine design. He is a prolific open-source contributor, participating in many open-source projects, including go-ethereum, solidity, SOLL, crun, and WasmEdge.

Miley Fu

CNCF Ambassador, Founding member at WasmEdge, Second State Inc

Miley is a Developer Advocate with a passion for empowering devs to build and contribute to open source. With over 5 years of experience working on WasmEdge runtime in CNCF sandbox as the founding member, she talked at KubeCon, KCD Shenzhen, CloudDay Italy, DevRelCon, Open Source... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

17:15 HKT

Multi-Cluster Networking and Service Discovery Leveraging NRI | 利用NRI的多集群网络和服务发现 - LingMing Xia, Purple Mountain Laboratories & Di Xu, Xiaohongshu

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 1

Connection and service discovery are usually key challenges for multi-cluster management, existing solutions such as Submariner introduce pre-conditions for public IP and specific CNI. This is problematic for projects like the "East-to-West Computing Resource Transfer Project" where clusters lack public IPs and have diverse CNIs due to different ownership. This session introduces a solution to establish an independent and unified parallel network for east-west traffic cross clusters based on Node Resource Interface (NRI) to avoid intrusive modifications for clusters and limitations on CNI. A hybrid approach is provided for inter-cluster traffic: clusters can communicate through a hub cluster with public IP or connect directly if public IP is equipped. Moreover, cross-cluster service discovery follows the MCS standard to ensure seamless service access. All functionalities remain agnostic to Kubernetes and applications. A live demo will be shown in this session.

连接和服务发现通常是多集群管理的关键挑战，现有解决方案如Submariner引入了公共IP和特定CNI的先决条件。这对于像“东西计算资源转移项目”这样的项目是有问题的，因为集群缺乏公共IP并且由于不同所有权而具有不同的CNI。本场演讲介绍了一种解决方案，基于节点资源接口（NRI）建立一个独立和统一的跨集群东西流量网络，以避免对集群进行侵入性修改和对CNI的限制。提供了一种混合方法用于集群间流量：集群可以通过具有公共IP的中心集群进行通信，或者如果具有公共IP则可以直接连接。此外，跨集群服务发现遵循MCS标准，以确保无缝的服务访问。所有功能都与Kubernetes和应用程序无关。本场演讲将展示现场演示。

Speakers

Di Xu

Principle Software Engineer, Xiaohongshu

Currently, he serves as a Tech Lead at Xiaohongshu, where he leads a team focused on building a highly reliable and scalable container platform. He is the founder of CNCF Sandbox Project Clusternet. Also, he is a top 50 code contributor in Kubernetes community. He had spoken many... Read More →

Lingming

Researcher in Purple Mountain Laboratories, Purple Mountain Laboratories

Focusing on subjects such as cloud-native and distributed clouds. I am currently working as a researcher in the New Computing Architecture Research group of Purple Mountain Laboratories.

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Time Series Database on Kubernetes: Efficient Management of Massive Internet of Vehicles Data | Kubernetes上的时序数据库：高效管理海量物联网车辆数据 - Vicky Lee, Huawei Cloud Computing Technology Co., Ltd.

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 2 | Grand Ballroom 1-2

Today, more and more car companies are building a new generation of Internet of Vehicles platforms based on cloud-native technology stacks such as Kubernetes. However, as more and more cars are produced, they generate hundreds of GB of data every second, making it difficult to store massive data in real-time and making storage costs difficult to control. which requires the platform's underlying database to be low-cost, high-performance, and efficient. openGemini is a cloud-native distributed time series database with high performance and low cost. In data writing, we provide a dedicated high-performance data writing component that supports Arrow Flight. Regarding data storage, we provide specialized data compression algorithms and support local data storage and object storage. This talk will introduce how to build Internet of Vehicles platforms based on cloud-native technology stacks and share the technical practices on how to efficiently manage massive vehicle data.

今天，越来越多的汽车公司正在基于Kubernetes等云原生技术堆栈构建新一代车联网平台。然而，随着汽车的生产越来越多，它们每秒产生数百GB的数据，使得实时存储海量数据变得困难，存储成本难以控制。这就要求平台的底层数据库要低成本、高性能和高效。openGemini是一个具有高性能和低成本的云原生分布式时间序列数据库。在数据写入方面，我们提供了支持Arrow Flight的专用高性能数据写入组件。在数据存储方面，我们提供了专门的数据压缩算法，并支持本地数据存储和对象存储。本次演讲将介绍如何基于云原生技术堆栈构建车联网平台，并分享如何有效管理海量车辆数据的技术实践。

Speakers

Vicky Lee

Engineer, Huawei Cloud Computing Technology Co., Ltd.

Vicky Lee, a Time-series database expert in the HUAWEI CLOUD Database Innovation Lab and the Co-founder of the openGemini community, has been engaged in distributed databases and NoSQL databases as a cloud service for many years. Currently, mainly dedicated to openGemini developm... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:00 HKT

A Story of Managing Kubernetes Watch Events End-to End Flow in Extremely Large Clusters | 在极大规模集群中管理Kubernetes watch事件端到端流程的故事 - Bo Tang, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 2

The K8s watching mechanism has not been given the attention it deserves for an extended period. However, it is critical to the K8s cluster in both stability and perfermance aspsects and watch latency is a perfect indicator of cluster health. This talk begins by introducing the measurement of watch events latency and then defines watch SLI and SLO metrics. Using watch SLO as a guide, the talk will show the bottleneck identification process for watching. And the talk will describe the optimizations made to apiserver, etcd, kubelet, controller-runtime and clients such as controllers and schedulers in various aspects wrt watching, including watch latency, pod provisioning time, bandwidth, cpu/mem etc. With these optimizations, daily P99 watch latency has improved by over 90% in large clusters (~20K nodes) impacting billions of watch events. Pod provisioning time has improved by over 60%. Apiserver bandwidth has decreased by 50%. The overall stability of K8s cluster has improved greatly.

K8s观察机制长期以来并未得到应有的重视。然而，它对于K8s集群的稳定性和性能至关重要，观察延迟是集群健康的完美指标。本次演讲将首先介绍观察事件延迟的测量，然后定义观察SLI和SLO指标。通过观察SLO作为指导，演讲将展示观察瓶颈识别过程。演讲将描述在观察方面对apiserver、etcd、kubelet、controller-runtime和客户端（如控制器和调度器）进行的各种优化，包括观察延迟、Pod提供时间、带宽、CPU/内存等方面。通过这些优化，大型集群（~20K节点）中每日P99观察延迟已经提高了超过90%，影响了数十亿次观察事件。Pod提供时间已经提高了超过60%。Apiserver带宽减少了50%。K8s集群的整体稳定性得到了极大的改善。

Speakers

Bo Tang

Senior Engineer, Ant Group

Bo Tang is a senior engineer in Ant Group. He is currently working on scalability and performance optimization of Kubernetes clusters.

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

Dollars and PPM's - Carbon Emissions and Cloud Spend | 美元和PPM - 碳排放和云支出 - Bryan Oliver, Thoughtworks

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 2 | Grand Ballroom 1-2

Cloud Carbon emissions are unfortunately not the priority of most enterprises. Costs, however, are. In the Cloud Native space, there is an ever-growing list of spend tracking and reduction tools. In this talk, we'll discuss several strategies you can adopt to unify the prioritization of cloud costs and carbon impact. We want to show how you can align with your business goal of simultaneously reducing cloud spend and overall carbon emissions.

云计算的碳排放很可惜并不是大多数企业的首要任务。成本，然而，是。在云原生领域，有越来越多的支出跟踪和降低工具。在这次讨论中，我们将讨论几种您可以采用的策略，统一云成本和碳影响的优先级。我们希望展示如何与您同时降低云支出和整体碳排放的业务目标保持一致。

Speakers

Bryan Oliver

Principal, Thoughtworks

Bryan is an experienced engineer and leader who designs and builds complex distributed systems. He has spent his career developing mobile and back-end systems whilst building autonomous teams. More recently he has been focused on delivery and cloud native at Thoughtworks. In his free... Read More →

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

11:00 HKT

OpenYurt & Dragonfly: Enhancing Efficient Distribution of LLMs in Cloud-Edge Collaborative Scenarios | OpenYurt和Dragonfly：增强云边协作场景中LLM的高效分发 - Linbo He, alibaba cloud & Jim Ma, Ant Group

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 1

As LLMs continue to grow in size, their deployment and delivery in cloud-edge environments are faced with substantial challenges, especially within edge computing settings that encompass multiple sites with thousands of edge nodes. In this presentation, we will explore how to efficiently distribute LLM applications across dispersed edge nodes using OpenYurt. We will also delve into how Dragonfly’s P2P image distribution technology can address the issue of public network bandwidth consumption encountered during cross-site transmission, reducing public network traffic consumption by up to 90% compared to conventional LLM distribution, and achieving rapid and efficient sharing of LLMs in physically isolated environments. During this presentation, container service experts from Alibaba Cloud and Ant Group will share this solution and introduce the practical application of combining OpenYurt with Dragonfly in edge computing scenarios for LLMs.

随着LLM的规模不断增长，它们在云边缘环境中的部署和交付面临着重大挑战，特别是在涵盖数千个边缘节点的边缘计算环境中。在本次演讲中，我们将探讨如何使用OpenYurt在分散的边缘节点上高效分发LLM应用程序。我们还将深入探讨Dragonfly的P2P图像分发技术如何解决跨站点传输中遇到的公共网络带宽消耗问题，与传统的LLM分发相比，将公共网络流量消耗降低高达90％，实现在物理隔离环境中LLM的快速高效共享。在本次演示中，来自阿里巴巴云和蚂蚁集团的容器服务专家将分享这一解决方案，并介绍在LLM的边缘计算场景中将OpenYurt与Dragonfly结合应用的实际应用。

Speakers

Jim Ma

Senior Engineer, Ant Group

Kubernetes enthusiast at Ant Group, diving deep into Kubernetes CSI storage, OCI image distribution and maintaining CNCF Dragonfly.

Linbo He

senior software engineer, alibaba cloud

I am a member of the Alibaba Cloud Container Service team and one of the founding contributors to the OpenYurt project. Since 2015, I have been actively engaged in the design, development, and open-source initiatives related to Kubernetes. I have taken on responsibilities in a variety... Read More →

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:00 HKT

The Journey of Next-Gen FinTech IDP at China Merchants Bank | 中国招商银行下一代金融科技IDP之旅 - Jiahang Xu, China Merchants Bank

Thursday August 22, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 7

Explore China Merchants Bank's (CMB), one of China's largest retail banks, transformative journey through cloud migration, cloud-native transformation, and platform engineering over the past three years. Despite challenges such as increased complexity in cloud technology and management, and potential risks to developer productivity and continuous assurance of financial services, CMB successfully leveraged KubeVela, OpenFeature, Envoy, Clilum, and OpenTelemetry to build the Next-Gen FinTech IDP. This led to the management of 70% of applications within a year and improved developer experience, covering thousands of R&D engineers. We'll discuss the strategic thinking, 'Golden Path' implementation, struggles, trade-offs, and key success metrics with platform engineering maturity model. This session provides a blueprint and reference architecture for financial organizations undergoing similar transformations.

在KubeCon的会议描述中，探索中国招商银行（CMB）作为中国最大的零售银行之一，在过去三年中通过云迁移、云原生转型和平台工程的变革之旅。尽管面临诸如云技术和管理复杂性增加、开发人员生产力和金融服务持续保障的潜在风险等挑战，CMB成功利用KubeVela、OpenFeature、Envoy、Clilum和OpenTelemetry构建了下一代金融科技IDP。这导致了一年内管理了70%的应用程序，并改善了开发人员体验，涵盖了数千名研发工程师。我们将讨论战略思维、“黄金路径”实施、挣扎、权衡和关键成功指标，以及平台工程成熟度模型。本场演讲提供了金融机构进行类似转型的蓝图和参考架构。

Speakers

Jiahang Xu

System Architect, China Merchants Bank

Jiahang Xu is a System Architect at China Merchants Bank. He has over 14 years of unique cross-domain experience working in telecom, automotive, financial industry, startup as a co-founder, and KubeVela maintainer. He's mainly focused on cloud-native application technology and platform... Read More →

Thursday August 22, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 英语 (English)

11:50 HKT

Beyond the Basics: Towards Making Thanos Production-Ready | 超越基础：朝着使Thanos达到生产就绪状态的方向前进 - Benjamin Huo & Junhao Zhang, QingCloud Technologies

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 2 | Grand Ballroom 1-2

As one of the most popular and powerful Prometheus long-term storage projects, Thanos is widely adopted by the community. But to use Thanos in production, there are still a lot of day-2 operations that need to be automated. In this talk, KubeSphere maintainers will share their experiences in using and maintaining Thanos in production including: - Kubernetes native definition of all Thanos components - Tenant isolation of ingestion, rule evaluation, compaction - Tenant-based autoscaling mechanism of Thanos Ingester, Ruler, and Compactor - The time-based partition of Thanos store - Tenant-based data lifetime management - The sharding mechanism of the global ruler to handle massive recording rules and alerting rules evaluation workload - The gateway & agent proxy mechanism for read/write with tenant access control - The basic_auth, built-in query UI, and external remote write and query support of the gateway - The tls support between Thanos components - The 3-tier config management

作为最受欢迎和强大的Prometheus长期存储项目之一，Thanos被社区广泛采用。但要在生产环境中使用Thanos，仍然需要自动化许多第二天的运维工作。在这次演讲中，KubeSphere的维护者将分享他们在生产环境中使用和维护Thanos的经验，包括： - 所有Thanos组件的Kubernetes本地定义 - 数据摄入、规则评估、压缩的租户隔离 - 基于租户的Thanos Ingester、Ruler和Compactor的自动扩展机制 - Thanos存储的基于时间的分区 - 基于租户的数据生命周期管理 - 全局规则分片机制，用于处理大量录制规则和警报规则评估工作负载 - 用于读写的网关和代理机制，带有租户访问控制 - 网关的basic_auth、内置查询UI以及外部远程写入和查询支持 - Thanos组件之间的tls支持 - 三层配置管理

Speakers

Benjamin Huo

Manager of the Architect and Observability Team, QingCloud Technologies, QingCloud Technologies

Benjamin Huo leads QingCloud Technologies' Architect team and Observability Team. He is the founding member of KubeSphere and the co-author of Fluent Operator, Kube-Events, Notification Manager, OpenFunction, and most recently eBPFConductor. He loves cloud-native technologies especially... Read More →

Junhao Zhang

Senior Software Engineer, QingCloud Technologies

Junhao Zhang, Senior Development Engineer at QingCloud Technologies, is responsible for the research and development of container platform monitoring, alerting, and other cloud-native services. With many years of industry experience, he has previously held positions at companies such... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

Building a High-Performance Time Series Database from Scratch: Optimization Strategies | 从零开始构建高性能时序数据库：优化策略 - Aliaksandr Valialkin, VictoriaMetrics

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 2

Application Performance Monitoring and Kubernetes monitoring in their current state are pretty expensive. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, and 20-40 million samples/s on the read path. The biggest installations account for 100 million samples/s on the ingestion path. This requires being very clever with data pipelines to keep them efficient and scalable by adding more resources. In this session, we'll explore essential optimizations to maintain database speed such as string interning, caching results, goroutine management and utilizing sync.Pool for efficient resource management. These techniques help strike a balance between performance and resource consumption. This talk focuses on practical strategies for enhancing database speed.

在当前状态下，应用程序性能监控和Kubernetes监控非常昂贵。平均VictoriaMetrics安装在摄入路径上处理2-4百万样本/秒，在读取路径上处理20-40百万样本/秒。最大的安装在摄入路径上占据了1亿样本/秒。这需要通过对数据管道进行非常聪明的优化，通过增加更多资源来保持其高效和可扩展性。在本场演讲中，我们将探讨保持数据库速度的基本优化，如字符串内部化、缓存结果、goroutine管理和利用sync.Pool进行有效的资源管理。这些技术有助于在性能和资源消耗之间取得平衡。本次演讲侧重于增强数据库速度的实用策略。

Speakers

Hui Wang

Software Engineer, VictoriaMetrics

I'm working on monitoring at VictoriaMetrics. My passion is cloud-native technologies and opensource.

Aliaksandr Valialkin

CTO, VictoriaMetrics

Aliaksandr is a co-founder and the principal architect of VictoriaMetrics. He is also a well-known author of the popular performance-oriented libraries: fasthttp, fastcache and quicktemplate. He holds a Master’s Degree in Computer Software Engineering. He decided to found VictoriaMetrics... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Operations + Performance

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

11:50 HKT

Redefining Service Mesh: Leveraging EBPF to Optimize Istio Ambient Architecture and Performance | 重新定义服务网格：利用eBPF优化Istio环境架构和性能 - Yuxing Zeng, Alibaba Cloud

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 1

Istio Ambient separates the L4/L7 functions found in the traditional sidecar model and introduces the ztunnel component, which implement the L4 network load balancing and secure zero-trust. However, as ztunnel is deployed at the node level with DaemonSet, any malfunction or anomaly in ztunnel may impact the traffic of all mesh-related pods under that node. Furthermore, performance tests of Ambient Mesh have not delivered the anticipated outcomes; ztunnel often becomes a performance bottleneck. These factors make it challenging to apply Ambient Mesh in production environments. it appears that we require a more optimized and practical implementation solution. This session will share: 1. An introduction to the architecture of Istio Ambient Mesh, along with current known issues with the existing implement. 2. using eBPF to implement zero-trust and L4 network traffic capabilities, enhancing the stability of the Mesh network, and significantly improving overall performance.

Istio Ambient将传统的边车模型中发现的L4/L7功能分离，并引入了ztunnel组件，实现了L4网络负载均衡和安全的零信任。然而，由于ztunnel部署在节点级别的DaemonSet上，ztunnel中的任何故障或异常可能会影响该节点下所有与网格相关的Pod的流量。此外，Ambient Mesh的性能测试并未达到预期的结果；ztunnel经常成为性能瓶颈。这些因素使得在生产环境中应用Ambient Mesh变得具有挑战性。看起来我们需要一个更优化和实用的实现解决方案。本次会话将分享： 1. Istio Ambient Mesh架构的介绍，以及现有实现中已知的问题。 2. 使用eBPF实现零信任和L4网络流量功能，增强Mesh网络的稳定性，并显著提高整体性能。

Speakers

Jesse Zeng

Technical Expert, Alibaba Cloud

Yuxing Zeng is a technical expert in the Container Service Team at Alibaba Cloud. He is also a Istio Member、Envoy Contributor. He has rich experience in cloud native fields such as Kubernetes、Istio、 Envoy, etc.

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

11:50 HKT

Unlocking Scalability and Simplifying Multi-Cloud Management with Karmada and PipeCD | 使用Karmada和PipeCD解锁可扩展性并简化多云管理 - Khanh Tran, CyberAgent, Inc. & Hongcai Ren, Huawei

Thursday August 22, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 7

In the new AI coming age, it has become inevitable for any organizations to embrace the multi-cloud approach. Managing applications across multiple clouds can present various challenges, including resilience, performance, security, cost, and deployment management. How well did you prepare yourself and your services for that new coming age? This presentation will introduce Karmada and PipeCD, two powerful tools designed to support organizations in effectively addressing these challenges and achieving seamless multi-cloud management. Karmada is a multi-cloud container orchestration, while PipeCD is a multi-cloud continuous delivery solution. Both tools are built based on extensive experience in managing applications at scale across multiple clouds. We will delve into the key features and benefits of Karmada and PipeCD, and how they can simplify multi-cloud management. Together, we can unlock the true potential of multi-cloud systems and empower organizations to thrive in the era of AI.

在新的人工智能时代，任何组织都不可避免地需要采用多云方法。在多个云上管理应用程序可能会带来各种挑战，包括弹性、性能、安全性、成本和部署管理。您为新时代做好了多少准备？本次演讲将介绍Karmada和PipeCD，这两款强大的工具旨在支持组织有效应对这些挑战，实现无缝的多云管理。Karmada是一个多云容器编排工具，而PipeCD是一个多云持续交付解决方案。这两款工具都是基于在多个云上管理应用程序的丰富经验构建的。我们将深入探讨Karmada和PipeCD的关键特性和优势，以及它们如何简化多云管理。让我们一起释放多云系统的真正潜力，赋予组织在人工智能时代蓬勃发展的力量。

Speakers

Hongcai Ren

Senior Software Engineer, Huawei

Hongcai Ren(@RainbowMango) is the CNCF Ambassador, who has been working on Kubernetes and other CNCF projects since 2019, and is the maintainer of the Kubernetes and Karmada projects.

Khanh Tran

Software Engineer, CyberAgent, Inc.

Khanh is a maintainer of the PipeCD project. He is currently employed at CyberAgent Inc, and responsible for the CI/CD system across the organization. As a member of the developer productivity team, his primary focus is on automation and anything that enhances the development process... Read More →

Thursday August 22, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

13:50 HKT

Choose Your Own Adventure: The Struggle for Security | 选择你的冒险：安全之战 - Whitney Lee, VMware Tanzu & Viktor Farcic, Upbound

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 2

Our hero, a running application in a Kubernetes production environment, knows they are destined for greater things! They are serving end users, but currently, they are also endangering those users, the system, and themselves! But the struggle for security is HARD, filled with system design choices concerning secrets management; cluster-level and runtime policies; and securing pod-to-pod communications. It is up to you, the audience, to guide our hero and help them grow from a vulnerable, unprotected application to their final form⎯an app that is more secure against invasion. In their third ‘Choose Your Own Adventure’-style talk, Whitney and Viktor will present choices that an anthropomorphized app must make as they try to protect themselves against every kind of exploit. Throughout the presentation, the audience (YOU!) will vote to decide our hero app's path! Can we navigate CNCF projects to safeguard our app, system, and users against attack before the session time elapses?

我们的英雄是一个在Kubernetes生产环境中运行的应用程序，他知道自己注定要成为更伟大的存在！他正在为最终用户提供服务，但目前却也在危及这些用户、系统和自己！但是安全的斗争是艰难的，充满了关于秘钥管理、集群级别和运行时策略以及保护Pod之间通信的系统设计选择。观众们，你们将扮演引导我们英雄并帮助他们从一个脆弱、无保护的应用程序成长为更加安全抵御入侵的终极形态的角色。在这场第三场“选择你自己的冒险”风格的演讲中，Whitney和Viktor将呈现一个拟人化应用程序必须做出的选择，以试图保护自己免受各种利用。在整个演示过程中，观众（就是你！）将投票决定我们英雄应用程序的道路！在演讲结束之前，我们能否通过探索CNCF项目来保护我们的应用程序、系统和用户免受攻击呢？

Speakers

Viktor Farcic

Developer Advocate, Upbound

Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.

Whitney Lee

Developer Advocate, VMware Tanzu

Whitney is a lovable goofball and a CNCF Ambassador who enjoys understanding and using tools in the cloud native landscape. Creative and driven, Whitney recently pivoted from an art-related career to one in tech. You can catch her lightboard streaming show ⚡️ Enlightning on Tanzu.TV... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

13:50 HKT

Implement Auto Instrumentation Under GraalVM Static Compilation on OTel Java Agent | GraalVM 静态编译下 OTel Java Agent 的自动增强方案与实现 - Zihao Rao & Ziyi Lin, Alibaba Cloud

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 2 | Grand Ballroom 1-2

GraalVM static compilation has a significant effect on improving Java application startup speed and runtime memory usage. It is very valuable for the Java to flourish in Cloud Native ecosystem. However, the automatic instrumentation originally provided based on Java Agent will become invalid after static compilation. We designed a static instrumentation solution in GraalVM to solve above problem. This speech will introduce the overall design idea of the solution and related test results in OTel Java Agent.

GraalVM静态编译对于提升Java应用的启动速度和运行时内存占用有着显著的效果，对于Java在云生态中的蓬勃发展有着十分宝贵的价值。然而，原本基于Java Agent提供的自动插桩功能在静态编译之后将会失效。针对上述问题我们在GraalVM中设计了静态插桩方案，本演讲将介绍该方案的整体设计思路以及在OTel Java Agent中的相关测试结果。

Speakers

Zihao Rao

Software Engineer, Alibaba Cloud

Zihao is a software engineer at Alibaba Cloud. Over the past few years, he has participated in several well-known open source projects, he is steering committee member of Spring Cloud Alibaba project, and is a triager for OpenTelemetry Java Instrumentation now.

Ziyi Lin

Senior Software Engineer, Alibaba Cloud

Author of book "Static compilation for Java in GraalVM: the principles and practice". ACM SIGSOFT distinguished paper award winner (ICSE'23). Committor of Apache incubating Teaclave Java TEE SDK(https://github.com/apache/incubator-teaclave-java-tee-sdk). Active contributor of GraalVM（https://github.com/pulls?q=is%3Apr+org%3Aoracle+author%3Aziyilin... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:50 HKT

Testing and Release Patterns for Crossplane | 跨平面的测试和发布模式 - Yury Tsarev & Steven Borrelli, Upbound

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 7

Crossplane has become the foundation of many Internal Developer Platforms (IDPs). A requirement for any IDP in production is the ability to make changes and upgrades to the platform with confidence. This talk will cover testing and release patterns based on our experience building production-ready environments across a range of Crossplane users. We’ll cover the lifecycle of a Crossplane Composition upgrade, from local commit to pull request to target customer environment, end-to-end testing tools, handling API changes, and how to control updates to customer environments. For quite a while, testing Crossplane Compositions meant relying exclusively on costly end-to-end layers. In this talk, we're unveiling new unit testing capabilities that allow you to evaluate and test your Composition code in complete isolation.

Crossplane已成为许多内部开发者平台（IDPs）的基础。在生产中，任何IDP的要求都是能够有信心地对平台进行更改和升级。本次演讲将涵盖基于我们在跨多个Crossplane用户构建生产就绪环境的经验，讨论测试和发布模式。我们将介绍Crossplane Composition升级的生命周期，从本地提交到拉取请求再到目标客户环境，端到端测试工具，处理API更改以及如何控制对客户环境的更新。相当长一段时间以来，测试Crossplane Compositions意味着完全依赖昂贵的端到端层。在本次演讲中，我们将揭示新的单元测试功能，使您能够在完全隔离的环境中评估和测试您的Composition代码。

Speakers

Steven Borrelli

Principal Solutions Architect, Upbound

Steven is a Principal Solutions Architect for Upbound, where he helps customers adopt Crossplane.

Yury Tsarev

Principal Solutions Architect, Upbound

Yury is an experienced software engineer who strongly focuses on open-source, software quality and distributed systems. As the creator of k8gb (https://www.k8gb.io) and active contributor to the Crossplane ecosystem, he frequently speaks at conferences covering topics such as Control... Read More →

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

13:50 HKT

Unified Management, Continuity, Compliance in Multi-Clouds with Service Mesh | 在多云环境中通过服务网格实现统一管理、连续性和合规性 - Kebe Liu, DaoCloud

Thursday August 22, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 1

In multi-cloud and hybrid cloud architectures, enterprises face challenges like inter-cloud communication, traffic management, application orchestration, data security, and compliance. Service mesh technology offers a unified approach for managing service interactions, enhancing security, and ensuring data compliance. Istio, a leading service mesh project, is particularly effective in multi-cloud and hybrid cloud environments. It provides seamless network connectivity across various architectures, ensuring reliable and secure communication. Additionally, integrating Istio with Karmada enables efficient application scheduling across these complex environments. Karmada allows for smooth orchestration of workloads across different cloud platforms, enhancing the flexibility and scalability of cloud-native applications. I aim to share practical insights and experiences, especially from China, to inspire and provide strategic perspectives in navigating these technological landscapes.

在多云和混合云架构中，企业面临诸如云间通信、流量管理、应用编排、数据安全和合规性等挑战。服务网格技术提供了统一的管理服务交互方式，增强安全性，并确保数据合规性。作为领先的服务网格项目，Istio在多云和混合云环境中特别有效。它提供了跨不同架构的无缝网络连接，确保可靠和安全的通信。此外，将Istio与Karmada集成，可以实现在这些复杂环境中高效的应用调度。Karmada允许在不同云平台上平稳地编排工作负载，增强云原生应用的灵活性和可扩展性。我旨在分享实用的见解和经验，特别是来自中国，以激发并提供在这些技术领域中导航的战略视角。

Speakers

Kebe Liu

Senior software engineer, DaoCloud

Member of Istio Steering Committee, focused on cloud-native and Istio, eBPF and other areas in recent years. Founder of Merbridge project.

Thursday August 22, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Connectivity

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

14:40 HKT

Find Your Own Personal Tutor for the Study of Kubernetes | 为学习Kubernetes找到适合您的个人导师 - Hoon Jo, Megazone

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 2

Kubernetes novice users ask questions to stackoverflow or community or friends :) when they encounter the problem. However it needs to explain my environment and the background information. Even though it is not a guaranteed answer from someone. Thus I suggest to use K8sGPT with ollama to leverage the lack of knowledge at this moment. Furthermore, k8sGPT provides interactive mode that is able to ask continuing questions until I receive enough answers. Plus it could be helpful to ask other language who is not familiar with English. (Mostly it is big concern from the beginning of the stage) I highly recommend using K8sGPT to study who is a newcomer for soft landing in Kubernetes world.

在KubeCon上，我们将讨论Kubernetes新手用户在遇到问题时通常会向stackoverflow、社区或朋友提问的情况。然而，我们需要解释我的环境和背景信息。虽然并不能保证会得到答案，但我建议使用K8sGPT与ollama来弥补当前知识的不足。此外，k8sGPT提供交互模式，可以持续提问直到我得到足够的答案。此外，对于不熟悉英语的人来说，询问其他语言可能会有所帮助（这在刚开始阶段时是一个大问题）。我强烈推荐使用K8sGPT来帮助新手顺利进入Kubernetes世界。

Speakers

Hoon Jo

Cloud Solutions Architect | Cloud Native Engineer,, Megazone

Hoon Jo is Cloud Solutions Architect as well as Cloud Native engineer at Megazone. He has many times of speaker experience for cloud native technologies. And spread out Cloud Native Ubiquitous in the world. He wrote 『Python for System/Network Administrators』 (Wikibooks, 2017... Read More →

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

14:40 HKT

Kelemetry: Global Control Plane Tracing for Kubernetes | Kelemetry：面向Kubernetes控制面的全局追踪系统 - Wei Shao & Jonathan Chan, ByteDance

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 2 | Grand Ballroom 1-2

Debugging Kubernetes system issues is complicated: different controllers manipulate objects independently, sometimes triggering changes in other controllers. Unlike traditional RPC-based services, the relationship between components is not explicit; identifying which component causes an issue could be like finding a needle in a haystack. Components expose their own fragmented data, often limited to the lifecycle of a single request and fail to illustrate the bigger picture of asynchronous causal events. This talk introduces Kelemetry, a global tracing system for the Kubernetes control plane using scattered data sources from audit log, events, informers and component traces. Through several demonstrations of troubleshooting online problems, we will see how Kelemetry reveals the state transition of related objects over a long timespan and reconstructs the causal hierarchy of events to provide intuitive insight into the What, When and Why of everything going on in a Kubernetes system.

调试Kubernetes系统问题是复杂的：不同的控制器独立地操作对象，有时会触发其他控制器的变化。与传统的基于RPC的服务不同，组件之间的关系并不明确；确定哪个组件引起了问题就像在一堆草堆中找针一样困难。组件展示它们自己的碎片化数据，通常仅限于单个请求的生命周期，并未展示异步因果事件的整体情况。本次演讲介绍了Kelemetry，这是一个利用审计日志、事件、通知器和组件跟踪的分散数据源的Kubernetes控制平面全局跟踪系统。通过几次在线问题排查演示，我们将看到Kelemetry如何揭示相关对象在长时间跨度内的状态转换，并重建事件的因果层次结构，以提供对Kubernetes系统中发生的一切的直观洞察。

Speakers

Wei Shao

Senior Software Engineer, ByteDance

Wei Shao is a tech lead on the Orchestration & Scheduling team at ByteDance, and a maintainer of KubeWharf projects. Wei has 6+ years of experience in the cloud native area, focusing on resource management and performance-enhanced systems in K8s. Wei led the development of multiple... Read More →

Jonathan Chan

Software engineer, ByteDance

Jonathan is a software engineer at ByteDance working on Kubernetes related infrastructure such as observability systems and cluster federation. He is also a passionate contributor to a number of open source projects.

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:40 HKT

NanoVisor: Revolutionizing FaaS Cold Start Performance with Secure, Lightweight Container Runtime | NanoVisor：通过安全、轻量级容器运行时改变FaaS冷启动性能 - Tianyu Zhou, Ant Group

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 7

Function as a Service(FaaS) is booming, but cold start time, the time it takes to create a new container for a function, remains a significant bottleneck. This not only impacts user experience with noticeable delays, but also incurs unnecessary costs due to wasted resources. NanoVisor, a groundbreaking container runtime built on gVisor, tackles the challenge of slow cold start time in FaaS. It achieves this by a series of optimizations specifically designed for FaaS: lightweight containerd interaction for faster setup, read-only filesystem for enhanced efficiency, and a sandbox fork mechanism that replaces the heavy container creation for significant performance gains. These empower NanoVisor to create secure, sandboxed containers ready for function execution within an astonishing 5ms,

Function as a Service（FaaS）正在蓬勃发展，但冷启动时间，即为函数创建新容器所需的时间，仍然是一个重要的瓶颈。这不仅影响用户体验，导致明显的延迟，还因浪费资源而产生不必要的成本。NanoVisor是一种基于gVisor构建的开创性容器运行时，解决了FaaS中慢冷启动时间的挑战。它通过一系列专为FaaS设计的优化来实现：轻量级的containerd交互以加快设置速度，只读文件系统以提高效率，以及一个替代繁重容器创建的沙箱分叉机制，以获得显著的性能提升。这些优化使NanoVisor能够在惊人的5毫秒内创建安全的、沙箱化的容器，每个实例的内存开销不到1MB，每个节点的QPS为1.5K。它已成功应用于蚂蚁集团的生态系统，包括支付宝云基地和SOFA Function，以及CI/CD加速。

Speakers

Tianyu Zhou

System Engineer, Ant Group

Tianyu Zhou, a system engineer at Ant Group. I graduated from Zhejiang University with a master's degree in cyberspace security. My research interests include kernel, system security and container security.

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Emerging + Advanced

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:40 HKT

Panel: Fragmentation of the Scheduling in Kubernetes and Challenges for AI/ML Workloads | 圆桌：Kubernetes调度社区碎片化现状及如何应对AI/ML工作负载带来的挑战 - Kante Yin, DaoCloud; Li Tao, Independent; William Wang, Huawei Cloud Technologies Co., LTD; 秋萍戴, daocloud; Yuquan Ren, B

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 1

Scheduler is one of the most frequently customized components in Kubernetes, owing to its expandability. However, too many schedulers lead to decision paralysis among users, which has been discussed extensively in the past KubeCons. To help mitigate the confusion of users, four maintainers from various community (Godel-Scheduler, Koordinator, Kubernetes SIG-Scheduling and Volcano) are invited to profile the background and usecases behind these projects. Also the panel will discuss the gap between upstream Kubernetes and downstream projects and try to abstract the common patterns or functionalities which can be pushed to the upstream to avoid reimplementing the wheel, and what should still be defined loosely to preserve the expandability. Moreover, with the rise of AI, scheduling AI workloads in Kubernetes poses a significant challenge, the panel will discuss where we're right now and where we're head for, as well as the opportunities of cooperations.

调度器是Kubernetes中最经常定制的组件之一，这归功于其可扩展性。然而，过多的调度器会导致用户决策瘫痪，这在过去的KubeCon中已经被广泛讨论过。为了帮助减轻用户的困惑，我们邀请了来自各个社区（Godel-Scheduler、Koordinator、Kubernetes SIG-Scheduling和Volcano）的四位维护者来介绍这些项目背后的背景和用例。此外，本小组讨论将探讨上游Kubernetes和下游项目之间的差距，并尝试提炼出可以推送到上游的常见模式或功能，以避免重新实现轮子，以及什么应该保持松散定义以保留可扩展性。此外，随着人工智能的兴起，在Kubernetes中调度AI工作负载面临着重大挑战，本小组讨论将探讨我们目前的状况以及我们未来的发展方向，以及合作的机会。

Speakers

Yuquan Ren

Cloud Native Architect, ByteDance

Kante Yin

Senior Software Engineer, DaoCloud

Kante is a senior software engineer and an open source enthusiast. He's currently working at the Kubernetes platform team at DaoCloud based in Shanghai, mostly around scheduling, resource management and inference. He also works on upstream Kubernetes as SIG-Scheduling Maintainer and... Read More →

Tao Li

Koordinator Co-founder&Maintainer, N/A

Tao Li is a seasoned Senior Software Engineer with a specialization in K8s scheduling. With extensive practical experience in large-scale K8s cluster scheduling technology, Tao has been deeply participated in the research and development of K8s scheduling systems both within Alibaba... Read More →

秋萍戴

product mananger, daocloud

QiuPing Dai is a senior Technology Product Manager at DaoCloud for 5 years and involved in Cloud Computing ( including Kubernetes Computing, Storage, Network) development work. Before that, Qiuping worked at IBM for Cloud Computing. QiuPing is interested in Storage, Network , Scheduling... Read More →

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Emerging + Advanced

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:40 HKT

WebAssembly on the Server | 服务端的WebAssembly - Vivian Hu, Second State

Thursday August 22, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 6

As CNCF Annual Survey 2022 key findings described, “Containers are the new normal, and WebAssembly (Wasm) is the future.” Wasm is playing an important role in cloud native area. Before Wasm, Linux containers are commonly used to run these compiled applications in the cloud — eg a Rust or C++ app is compiled to x86_64 machine code and runs inside a Linux container. Wasm provides a more secure, much lighter, faster, and more portable alternative to Linux containers for this type of performance-minded server-side applications. Currently, CNCF hosts three Wasm-focused projects, like WasmEdge, WasmCould, and runwasi. This talk will discuss WebAssembly on the server side. You will learn the integration between Wasm and the existing container tools, use cases of WebAssembly on the server side. Going forward, we will also discuss the role of Wasm in the LLM applications.

根据CNCF年度调查2022的关键发现，“容器是新常态，WebAssembly（Wasm）是未来。” Wasm在云原生领域发挥着重要作用。在Wasm出现之前，Linux容器通常用于在云中运行这些编译应用程序 - 例如，Rust或C++应用程序被编译为x86_64机器代码，并在Linux容器内运行。相比于Linux容器，Wasm为这类性能导向的服务器端应用程序提供了更安全、更轻量、更快速和更可移植的替代方案。目前，CNCF托管了三个以Wasm为重点的项目，如WasmEdge、WasmCould和runwasi。本次演讲将讨论服务器端的WebAssembly。您将了解Wasm与现有容器工具的集成，以及服务器端WebAssembly的用例。此外，我们还将讨论Wasm在LLM应用程序中的作用。

Speakers

Xiaowei

Product Manager, Second State

Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.

Thursday August 22, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 6

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:35 HKT

KubeSkoop: Deal with the Complexity of Network Issues and Monitoring with eBPF | KubeSkoop：使用eBPF处理网络问题和监控的复杂性 - Yutong Li, Alibaba Cloud & Bingshen Wang, AlibabaCloud

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 2 | Grand Ballroom 1-2

Troubleshooting network issues has always been one of the most difficult parts, especially on Kubernetes. Containerization and microservice results in a denser network topology and more dependencies on various layers of network stack modules, and the new network technology and architecture introduced by AI also provided a significant challenge in observability and diagnosis. We developed KubeSkoop, the networking monitoring and diagnosis suite for Kubernetes. With the eBPF technology, it provides a deep monitoring and tracing of Kubernetes network, to help users quickly locate the network jitter problem happened in the cluster. It also provides the network connectivity check ability, which can help users solve network connectivity issues by one click. This topic will introduce as follows: ● What makes Kubernetes networking complex. ● Introduction to KubeSkoop. ● How we use eBPF to monitor container networking. ● The practices of KubeSkoop in large-scale production environment.

网络问题的故障排除一直是最困难的部分之一，尤其是在Kubernetes上。容器化和微服务导致了更密集的网络拓扑结构，以及对各个网络堆栈模块的更多依赖，人工智能引入的新网络技术和架构也在可观察性和诊断方面提出了重大挑战。我们开发了KubeSkoop，这是专为Kubernetes设计的网络监控和诊断套件。利用eBPF技术，它提供了对Kubernetes网络的深度监控和跟踪，帮助用户快速定位集群中发生的网络抖动问题。它还提供了网络连接性检查功能，可以帮助用户通过一键解决网络连接问题。本主题将介绍以下内容： ● 什么使Kubernetes网络变得复杂。 ● KubeSkoop的介绍。 ● 我们如何使用eBPF来监控容器网络。 ● KubeSkoop在大规模生产环境中的实践。

Speakers

wang bingshen

Senior Engineer, AlibabaCloud

Bingshen Wang is a Senior Engineer in Alibaba Could, a maintainer of KubeSkoop/Terway/OpenYurt, and a contributor of Kubernetes/Containerd. He mainly focuses on container networking and runtime, and has many years of experience around managing Alibaba Cloud Kubernetes clusters. He... Read More →

Tony Li

Software Engineer, Alibaba Cloud

Yutong Li is a Software Engineer at Alibaba Cloud. He is working on designing and maintaining container network for Alibaba Cloud Container Service, and open source Kubernetes networking diagnose tool KubeSkoop.

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

15:35 HKT

OpAMP: Scaling OpenTelemetry with Flexibility | OpAMP：灵活扩展OpenTelemetry - Husni Alhamdani, Censhare & Herbert Sianturi, Krom Bank

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 6

In this session, we will delve into how OpAMP (Open Agent Management Protocol) revolutionizes the management of large fleets of data collection Agents and its pivotal role in scaling OpenTelemetry deployments with unparalleled flexibility. Discover how OpAMP empowers organizations to remotely manage diverse Agents, irrespective of vendor, through its vendor-agnostic protocol. Learn how OpAMP facilitates status reporting, telemetry reporting, centralized management, allowing for tailored configurations and efficient monitoring of individual Agents or types of Agents, management of downloadable Agent-specific packages, and robust connection credentials management. Join us to unleash the potential of OpAMP and revolutionize your OpenTelemetry scalability strategy.

在这场演讲中，我们将深入探讨OpAMP（开放式代理管理协议）如何革新大规模数据收集代理的管理，并在扩展OpenTelemetry部署中发挥关键作用，具有无与伦比的灵活性。发现OpAMP如何赋予组织远程管理各种代理的能力，无论供应商如何，通过其供应商无关的协议。了解OpAMP如何促进状态报告、遥测报告、集中管理，允许定制配置和有效监控单个代理或代理类型，管理可下载的特定代理软件包，以及强大的连接凭证管理。加入我们，释放OpAMP的潜力，革新您的OpenTelemetry可扩展性策略。

Speakers

Husni Alhamdani

Senior Site Reliability Engineer, Censhare

Husni is a CNCF Ambassador, and a Site Reliability Engineer at Censhare, where he is responsible for building and maintaining infrastructure platforms. In addition to these responsibilities, he primarily focuses on architecting Cloud-Native solutions. He also graduated from the LFX... Read More →

Herbert Sianturi

Senior DevOps Engineer, Krom Bank

Herbert Sianturi serves as a Senior DevOps Engineer at Krom Bank Indonesia, where he roles spearheads efforts in enhancing the quality of end-to-end application lifecycle and applying open source platform as a base. With years of expertise in container orchestration and cloud computing... Read More →

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 6

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

15:35 HKT

Optimize and Accelerate Cloud AI Infrastructure with Autoscaling | 通过自动缩放优化和加速云AI基础设施 - Yuan Mo, Alibaba Cloud

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 7

With the rise of generative AI technology, more and more applications are starting to integrate with the capabilities of generative AI. However, the high costs of training and inference can be daunting for developers. In this talk, we will discuss the issues and solutions that need additional consideration when using elastic scaling in generative AI scenarios, including: ● How to enhance the elastic startup efficiency of generative AI ● How to address the efficiency of inference when separating compute and storage in generative AI ● How to reduce the costs of training and inference ● How to solve the interruption problem in AI training scenarios using Spot instances ● How to address the issue of capacity elasticity in LLM scenarios Finally, we will introduce the practical experience of the world's leading generative AI service provider: HaiYi (seaart.ai), allowing more developers to understand the architectural methods of elastic cloud AI infrastructure.

随着生成式人工智能技术的兴起，越来越多的应用程序开始与生成式人工智能的能力集成。然而，训练和推理的高成本可能会让开发人员望而却步。在这次演讲中，我们将讨论在生成式人工智能场景中使用弹性扩展时需要额外考虑的问题和解决方案，包括： ● 如何提高生成式人工智能的弹性启动效率 ● 如何在生成式人工智能中分离计算和存储时解决推理效率的问题 ● 如何降低训练和推理的成本 ● 如何使用Spot实例解决AI训练场景中的中断问题 ● 如何解决LLM场景中的容量弹性问题最后，我们将介绍世界领先的生成式人工智能服务提供商海艺（seaart.ai）的实际经验，让更多开发人员了解弹性云AI基础设施的架构方法。

Speakers

Yuan Mo

Staff Engineer, Alibaba Cloud

Senior technical expert at Alibaba Cloud, the maintainer of the Kubernetes elastic component autoscaler, the founder of the cloud-native gaming community and OpenKruiseGame, and has given several talks at kubecon before. Focus on the cloud-native transformation of the gaming industry... Read More →

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 英语 (English)

15:35 HKT

Revolutionizing Scientific Simulations with Argo Workflows | 用Argo工作流彻底改变科学模拟 - ShaungKun Tian, Alibaba Cloud & 建翔孙, 北京深势科技有限公司

Thursday August 22, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 1

DP Technology provides scientific simulation platforms for research in biomedicine, energy, materials and other industries. Science simulation workflows are inherently complex and resource-intensive, and manual deployment is often prone to errors. After adopting Argo workflows to orchestrate science simulation, we get productivity 100% improvement. In this talk, we will introduce why chose Argo Workflow, how to orchestrate large-scale tasks of science simulation, how to make whole system scalability and reliability. Specially, we will share best practice about how manage super large workflow (thousands of tasks), how to do reasonable workflow retry, how to use memorization to reduce runtime and compute cost, how to interact with HPC systems. We also made contributions to Argo community to enhance functionalities and improve reliability. Additionally, we'll introduce DFlow, our open-source Python SDK designed for the seamless orchestration of scientific simulations with Argo Workflows.

DP Technology为生物医药、能源、材料等行业的研究提供科学模拟平台。科学模拟工作流程本质上复杂且资源密集，手动部署往往容易出错。采用Argo工作流程来编排科学模拟后，我们的生产力提高了100%。在本次演讲中，我们将介绍为什么选择Argo工作流程，如何编排大规模科学模拟任务，如何实现整个系统的可扩展性和可靠性。特别是，我们将分享如何管理超大型工作流程（数千个任务），如何合理重试工作流程，如何使用记忆化来减少运行时间和计算成本，如何与HPC系统交互。我们还为Argo社区做出了贡献，以增强功能性和提高可靠性。此外，我们还将介绍DFlow，我们的开源Python SDK，旨在与Argo工作流程无缝协同编排科学模拟。

Speakers

建翔孙

软件工程师, 北京深势科技有限公司

I once built a machine learning platform at Kuaishou, and currently, I am involved in scheduling scientific computing tasks at DP Technology, as well as constructing workflow platforms. I specialize in the field of cloud-native development.

Thursday August 22, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

16:25 HKT

A Decade of Cloud-Native Journey: The Evolution of Container Technology and the Kubernetes Ecosystem | 十年云原生之旅：容器技术和Kubernetes生态系统的演变 - Jintao Zhang, Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 2

Over the past decade, cloud-native technologies have revolutionized software development, deployment, and operations. Container technology and the Kubernetes ecosystem, as transformation leaders, have enhanced development agility, and provided enterprises with unmatched scalability, flexibility, and efficiency. This talk navigates the evolution of these technologies, highlighting their impact on the cloud-native landscape. Starting my journey in 2014, I will share insights into the decade-long evolution of Kubernetes, its community, and technology stacks, alongside personal experiences. Attendees will learn about successes, challenges, and future trends, gaining knowledge to navigate their cloud-native transformations.

在过去的十年里，云原生技术已经彻底改变了软件开发、部署和运营。容器技术和Kubernetes生态系统作为变革的领导者，提升了开发的灵活性，并为企业提供了无与伦比的可扩展性、灵活性和效率。本次演讲将探讨这些技术的演变，突出它们对云原生领域的影响。从2014年开始我的旅程，我将分享关于Kubernetes、其社区和技术堆栈十年演变的见解，以及个人经验。与会者将了解成功、挑战和未来趋势，获得知识来引领他们的云原生转型。

Speakers

Jintao Zhang

Sr. SE, Kong

Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack. He worked for Kong Inc.

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Cloud Native Novice

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:25 HKT

Observability Supercharger: Build the Traffic Topology Map for Millions of Containers with Zero Code | 可观测性超级增强器：使用零代码为数百万个容器构建流量拓扑图 - Sheng Wei & Teck Chuan Lim, Shopee

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 2 | Grand Ballroom 1-2

Kubernetes makes container orchestration and management simple and easy. However, with the surge of applications and middleware onboard Kubernetes, it is difficult to analyze and identify the relationship and dependencies between huge amounts of services and middleware. The most general way requires the business side to make code changes to expose more information, which is impossible to cover for all applications. In this session, we will share: * How does Shopee leverage eBPF to build a universal map for a million containers in production environments? * How do we implement distributed tracing for arbitrary third-party middleware with different protocols and usage patterns? * How do we optimize eBPF code and Linux Kernel to minimize the impacts for injected containers? * How did we integrate with BigData and AI Stack to fully utilize the data for abnormal detection and incident troubleshooting?

Kubernetes使容器编排和管理变得简单易行。然而，随着应用程序和中间件在Kubernetes上的激增，分析和识别大量服务和中间件之间的关系和依赖关系变得困难。最常见的方法需要业务方进行代码更改以公开更多信息，这对所有应用程序来说是不可能覆盖的。在本场演讲中，我们将分享： *Shopee如何利用eBPF在生产环境中为百万个容器构建通用映射？ *我们如何为具有不同协议和使用模式的任意第三方中间件实现分布式跟踪？ *我们如何优化eBPF代码和Linux内核以最小化对注入容器的影响？ *我们如何与大数据和人工智能堆栈集成，充分利用数据进行异常检测和故障排除？

Speakers

Teck Chuan Lim

Engineer, Shopee

Been working with Shopee since graduation in 2018. I am a long standing core team member of the engineering infrastructure team and took charge to drive Shopee's engineering infrastructure ecosystem from DevOps to DataOps. As of the moment, I am taking charge to drive forward towards... Read More →

Sheng Wei

Shopee

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

16:25 HKT

The Two Sides of the Kubernetes Enhancement Proposals (KEPs) | Kubernetes Enhancement Proposals（KEPs）的两面性 - Rayan Das, OneTrust LLC & Sreeram Venkitesh, BigBinary

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 7

Kubernetes Enhancement Proposals (KEPs) are pivotal in proposing, communicating, and coordinating new efforts within the Kubernetes project. As members of the Release Team (the team responsible for releasing the next version of Kubernetes) especially Enhancements Team under SIG-Release, we play a vital role in maintaining the active status of enhancements and facilitating communication between stakeholders, be it a deprecation or a feature update. In this talk, we look at the KEP lifecycle from the perspective of the release team, exploring the process (enhancements freeze, code freeze, and the exception process), major themes, and more. Additionally, we will discuss the developer's viewpoint on KEPs, highlighting the process, deadlines, and best practices for proposing, reviewing, and implementing KEPs effectively. Join us to know how KEPs drive innovation and collaboration within the Kubernetes community, empowering contributors to shape the future of Kubernetes development.

Kubernetes Enhancement Proposals（KEPs）在Kubernetes项目中提出、沟通和协调新工作方面起着关键作用。作为发布团队的成员（负责发布下一个版本的Kubernetes的团队），特别是在SIG-Release下的Enhancements团队，我们在维护增强功能的活跃状态和促进利益相关者之间的沟通方面发挥着重要作用，无论是废弃还是功能更新。在这次演讲中，我们将从发布团队的角度看待KEP的生命周期，探讨过程（增强功能冻结、代码冻结和异常处理过程）、主要主题等。此外，我们还将讨论开发人员对KEP的观点，重点介绍提出、审查和有效实施KEP的过程、截止日期和最佳实践。加入我们，了解KEP如何推动Kubernetes社区内的创新和协作，赋予贡献者塑造Kubernetes开发未来的能力。

Speakers

Rayan Das

Senior Site Reliability Engineer, OneTrust LLC

As a Senior Site Reliability Engineer, I devote my expertise to work on the infrastructure of OneTrust Privacy Software. Within the Kubernetes community, I've served as the SIG-Release Enhancement Shadow for Kubernetes v1.29, I applied for release shadow for v1.31 as well. Beyond... Read More →

Sreeram Venkitesh

Software Engineer, BigBinary

Sreeram Venkitesh is a Software Engineer at BigBinary and is an active contributor to Kubernetes. He is active in the Kubernetes release team, where he served as a shadow in the enhancements team from v1.29-v1.30 and is the enhancements sub-team lead for v1.31. He also helps write... Read More →

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Cloud Native Experience

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

16:25 HKT

Uniting Sustainability and Edge Computing: Kepler & Open Horizon on RISC-V and Heterogeneous System | 团结可持续性和边缘计算：Kepler和Open Horizon在RISC-V和异构系统上 - Peng Hui Jiang & David Yao, IBM

Thursday August 22, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 1

The dynamic landscape of cloud-edge computing demands solutions to mitigate energy consumption and promote sustainability. Our proposal advocates for the integration of Kepler and Open Horizon with CNCF and LF Edge ecosystem to address diverse hardware requirements in Cloud and Edge deployments, including x86, arm, s390, and the emerging RISC-V architectures. Notably, the Chinese market, characterized by edge devices in manufacturing, retail and surveillance domains, stands to benefit significantly from this initiative. By using Kepler’s sophisticated energy estimation capabilities and Open Horizon’s autonomous workload management features, this proposal endeavors to optimize energy efficiency across heterogeneous edge environments. In the session, we will demonstrate one use case to build and integrate Kepler and Open Horizon to work on RISC-V platform, and monitor and optimize distributed and heterogeneous system to build a greener and more resilient cloud-edge computing paradigm.

云边计算的动态景观需要解决能源消耗问题并促进可持续发展。我们的提案主张将Kepler和Open Horizon与CNCF和LF Edge生态系统整合，以解决云和边缘部署中多样化的硬件需求，包括x86、arm、s390和新兴的RISC-V架构。值得注意的是，中国市场以制造、零售和监控领域的边缘设备为特征，这一举措将使其受益匪浅。通过利用Kepler的先进能源估算能力和Open Horizon的自主工作负载管理功能，本提案旨在优化异构边缘环境的能源效率。在本场演讲中，我们将演示一个使用案例，展示如何构建和整合Kepler和Open Horizon在RISC-V平台上运行，并监控和优化分布式和异构系统，以构建更环保、更具弹性的云边计算范式。

Speakers

Peng Hui Jiang

Architect, IBM

Peng Hui Jiang is working for IBM as Senior Software Engineer to build and operate Public Cloud services. He has rich experience in Cloud, Database, and Security. He is CNCF Kepler Maintainer and Apache CouchDB committer and Master Inventor in IBM holding more than 200 patents or... Read More →

勇姚

Program Director, IBM Cloud Platform, IBM

David Yao is the Program Director of IBM Cloud Platform in IBM China Development Lab, developing and managing the entire product development lifecycle and team for the dynamic cloud and edge environment. Passionate on learning open technology, building and transforming an open and... Read More →

Thursday August 22, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Addressing the #1 Threat to the Web: Authorization | 应对网络的头号威胁：授权 - Jimmy Zelinskie, authzed

Thursday August 22, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 7

As more folks deploy cloud-native architectures and technologies, store ever larger amounts of data, and build ever more complex software suites, the complexity required to correctly and securely authorize requests only becomes exponentially more difficult. Broken authorization now tops OWASP's Top 10 Security Risks for Web Apps. Their recommendation? Adopt an ABAC or ReBAC authorization model. This talk establishes the problems with the status quo, explains the core concepts behind ReBAC, and introduces SpiceDB, a widely adopted open source system inspired by the system internally powering Google: Zanzibar.

随着越来越多的人部署云原生架构和技术，存储越来越多的数据，并构建越来越复杂的软件套件，正确和安全地授权请求所需的复杂性变得指数级增加。破解授权现在已经成为OWASP Web应用程序安全风险前十名之首。他们的建议是采用ABAC或ReBAC授权模型。本次演讲将阐明现状存在的问题，解释ReBAC背后的核心概念，并介绍SpiceDB，这是一个广泛采用的开源系统，受到Google内部系统Zanzibar的启发。

Speakers

Jimmy Zelinskie

cofounder, authzed

Jimmy Zelinskie is a software engineer and product leader with a goal of democratizing software via open source development. He's currently CPO of authzed where he's focused on bringing hyperscaler best-practices in authorization to the industry at large. At CoreOS, he helped pioneer... Read More →

Thursday August 22, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Security

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

17:15 HKT

OpenTelemetry Amplified: Full Observability with EBPF-Enabled Distributed Tracing | OpenTelemetry放大：使用eBPF启用的分布式跟踪实现全面的可观测性 - Kai Liu, Alibaba Cloud & Wanqi Yang, Sun Yat

Thursday August 22, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 1

Within the cloud-native ecosystem, OpenTelemetry (otel) has established itself as the de facto standard for cross-language and cross-platform observability. By providing comprehensive tracing, metrics, and logging solutions for various programming languages, otel has empowered developers and operators with deep insights into complex systems. In recent years, otel has further expanded its observability frontiers by introducing innovative capabilities in the Linux kernel space using eBPF. However, this innovative journey has encountered new challenges, particularly in reducing the invasiveness in certain programming languages and correlating observability data between kernel and user spaces. This session chronicles Alibaba Cloud’s journey through these challenges. By leveraging eBPF technology, we've pioneered innovative solutions that redefine the landscape of system observability, presenting an integrated, less invasive approach for real-time insights into distributed systems.

在云原生生态系统中，OpenTelemetry（otel）已经成为跨语言和跨平台可观测性的事实标准。通过为各种编程语言提供全面的跟踪、度量和日志解决方案，otel为开发人员和运维人员提供了对复杂系统的深入洞察。近年来，otel通过在Linux内核空间引入eBPF的创新能力，进一步拓展了其可观测性边界。然而，这种创新之旅遇到了新的挑战，特别是在减少某些编程语言中的侵入性和在内核和用户空间之间相关联可观测性数据方面。本场演讲将记录阿里云在这些挑战中的旅程。通过利用eBPF技术，我们开创了重新定义系统可观测性景观的创新解决方案，提供了一种集成的、不那么侵入性的方法，实时洞察分布式系统。

Speakers

Kai Liu

Senior Software Developer, Alibaba Cloud

Liu Kai, a senior software development engineer in the Cloud Native Observability team of Alibaba Cloud. With years of practical experience and insights in the field of monitoring and observability, Liu Kai continuously delves into the realm of observability solutions, including architectural... Read More →

Wanqi Yang

Student, Sun Yat-sen University

Wanqi Yang received the B.S. degree in Computer Science and Technology from Sun Yat-Sen University, Guangzhou, China. She is currently working toward the PhD degree in Computer Science and Technology at School of Computer Science and Engineering, Sun Yat-Sen University. Her research... Read More →

Thursday August 22, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

17:15 HKT

Working with Raw Disk Drives in Kubrenetes — YDB's Experience | 在Kubernetes中使用原始磁盘驱动器——YDB的经验 - Ivan Blinkov, YDB

Thursday August 22, 2024 17:15 - 17:50 HKT

Level 2 | Grand Ballroom 1-2

YDB is an open-source distributed database management system that, for performance reasons, uses raw disk drives (block devices) to store all data, without any filesystem. It was relatively straightforward to manage such setup in the bare-metal world of the past, but the dynamic nature of cloud-native environments introduced new challenges to keep this performance benefit. In this talk, we'll explore how to leverage Kubernetes and the Operator design pattern to modernize how stateful distributed database clusters are managed without changing the primary approach to how the data is physically stored.

YDB是一个开源的分布式数据库管理系统，为了性能考虑，使用原始磁盘驱动器（块设备）存储所有数据，而不使用任何文件系统。在过去的裸金属世界中管理这样的设置相对比较简单，但云原生环境的动态特性引入了新的挑战，以保持这种性能优势。在这次演讲中，我们将探讨如何利用Kubernetes和运算符设计模式来现代化管理有状态的分布式数据库集群，而不改变数据物理存储的主要方法。

Speakers

Ivan Blinkov

VP, Product and Open-Source, YDB

Ivan Blinkov is a seasoned technical leader specializing in data storage and processing. Over the last decade, he was involved in the development of several database management systems, two of which are open-source: ClickHouse in the past and, more recently, YDB.

Thursday August 22, 2024 17:15 - 17:50 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

10:35 HKT

A Year in the Life of a Developer in the Era of Developer Portals: Navigating Backstage | 开发者在开发者门户时代的一年生活：导航Backstage - Helen Greul, Spotify

Friday August 23, 2024 10:35 - 11:10 HKT

Level 1 | Hung Hom Room 1

In today's rapidly evolving landscape of software development, the role of developer portals has become indispensable. This presentation delves into the experiences of developers over the course of a year, exploring the transformative impact of Backstage developer portal on their workflows, collaboration, and overall productivity based on case studies from existing adopters of Backstage. Through a comprehensive exploration of real-world scenarios, this talk offers insights into the daily challenges faced by developers and how Backstage empowers them to overcome these hurdles. From streamlined onboarding processes to simplified access to internal services and documentation, attendees will gain a deeper understanding of the multifaceted benefits that Backstage brings to developer teams. Moreover, we'll discuss best practices for leveraging Backstage to foster a culture of innovation, collaboration, and continuous improvement.

在当今快速发展的软件开发领域，开发者门户的作用变得不可或缺。本次演讲将深入探讨开发者在一年时间内的经验，通过现有Backstage采用者的案例研究，探讨Backstage开发者门户对他们的工作流程、协作和整体生产力的转变影响。通过对现实场景的全面探讨，本次演讲将为参与者提供洞察开发者面临的日常挑战，以及Backstage如何赋予他们克服这些障碍的能力。从简化入职流程到简化访问内部服务和文档，参与者将更深入地了解Backstage为开发团队带来的多方面好处。此外，我们还将讨论利用Backstage促进创新、协作和持续改进文化的最佳实践。

Speakers

Helen Greul

Head of Engineering for Backstage, Spotify

Helen is an engineering leader, speaker and a strong advocate for creating developer ecosystems that empower teams to thrive. Her journey has taken her from hands-on coding to steering engineering and platform teams, providing her with a holistic perspective on the challenges and... Read More →

Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 英语 (English)

10:35 HKT

Deep Dive Into Windows CSI Driver HostProcess Containers | 深入探讨Windows CSI驱动程序HostProcess容器 - Andy Zhang (OSTC) & Weizhi Chen, Microsoft

Friday August 23, 2024 10:35 - 11:10 HKT

Level 2 | Grand Ballroom 1-2

Currently, most Windows CSI drivers depend on Windows csi-proxy because various privileged operations cannot be done from a containerized application running on a Windows node. Beginning in Kubernetes 1.23, HostProcess container is supported and it can run directly on the host as a regular process. Switching to HostProcess container deployment will make Windows CSI driver development and deployment easier. This session will cover the history and implementation details of Windows csi-proxy project, why csi-proxy is needed on Windows CSI driver starting in kubernetes 1.18, and why we removed this csi-proxy dependency from Kubernetes 1.26. We will explore the key learnings and gotchas we resolved while migrating Windows CSI driver development from csi-proxy dependent deployment to HostProcess container deployment. After attending this session, you will understand why and how to migrate your Windows applications to gain the benefits of using HostProcess containers.

目前，大多数Windows CSI驱动程序依赖于Windows csi-proxy，因为各种特权操作无法从在Windows节点上运行的容器化应用程序中执行。从Kubernetes 1.23开始，支持HostProcess容器，它可以直接在主机上作为常规进程运行。切换到HostProcess容器部署将使Windows CSI驱动程序的开发和部署变得更加简单。本场演讲将涵盖Windows csi-proxy项目的历史和实施细节，解释为什么从Kubernetes 1.18开始在Windows CSI驱动程序中需要csi-proxy，以及为什么我们在Kubernetes 1.26中删除了这种csi-proxy依赖性。我们将探讨在将Windows CSI驱动程序开发从依赖于csi-proxy的部署迁移到HostProcess容器部署时解决的关键问题和注意事项。参加本场演讲后，您将了解为什么以及如何将您的Windows应用程序迁移到使用HostProcess容器以获得更多好处。

Speakers

Andy Zhang (OSTC)

Principal Software Engineer, Microsoft

Andy Zhang is the storage lead in Azure Kubernetes Service team at Microsoft, maintainer of multiple Kubernetes projects, including Windows csi-proxy project, Azure CSI drivers, SMB, NFS, iSCSI CSI drivers, etc. Andy focuses on improving the experience of using storage in Kuberne... Read More →

Weizhi Chen

Senior Software Engineer, Microsoft

Work at Microsoft AKS team on Kubernetes. Focus on k8s storage drivers on Azure.

Friday August 23, 2024 10:35 - 11:10 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

10:35 HKT

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano | 通过Volcano增强的智能基础设施优化LLM工作流程 - Xin Li, qihoo360 & William Wang, Huawei Cloud Technologies Co., LTD

Friday August 23, 2024 10:35 - 11:10 HKT

Level 1 | Hung Hom Room 2

As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型（LLMs）革新我们生活的各个方面，许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而，管理大规模LLM训练和推理平台面临更为关键的挑战，如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上，演讲者将分享他们在使用基于Kubernetes的智能基础设施（由Volcano增强）管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖：故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。

Speakers

Xin Li

Senior Engineer of Server Development, qihoo360

Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →

Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:25 HKT

Beyond Statefulset: Containerize Your Enterprise Stateful Applications in Practice | 超越StatefulSet：实践中将企业有状态应用容器化 - Mingshan Zhao, Alibaba Cloud & Vec Sun, xiaohongshu

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 1

Kubernetes provides StatefulSet to manage stateful services, but it is far from enough to run enterprise stateful applications in practice. For example: how does Zookeeper accomplish leader election, and how does MQ implement configuration hot loading? How to do daily operation and maintenance of the database? Many practitioners resort to operators that manages pod directly e.g. KubeBlocks, for specific applications e.g. database, yet they are not general enough for other stateful applications. OpenKruise provides several stateful features that are missing in native StatefulSet, such as in-place resource and volume resizing, progressive Configmap & Secret hot update and container operation channel. Teams from Alibaba and Xiaohongshu will share their lessons to build operators and platforms for general stateful apps and containerize database and middleware with a scale of hundreds of thousands of pods.

Kubernetes提供了StatefulSet来管理有状态服务，但实际上要运行企业级有状态应用还远远不够。例如：Zookeeper如何完成领导者选举，MQ如何实现配置热加载？如何进行数据库的日常运维？许多从业者借助直接管理pod的运营商，例如KubeBlocks，针对特定应用程序，例如数据库，但它们并不足够通用以适用于其他有状态应用程序。 OpenKruise提供了一些在原生StatefulSet中缺失的有状态功能，例如原地资源和卷大小调整，渐进式Configmap和Secret热更新以及容器操作通道。来自阿里巴巴和小红书的团队将分享他们构建运营商和平台以适用于通用有状态应用程序，并将数据库和中间件容器化的经验，规模达数十万个pod。

Speakers

Mingshan Zhao

Senior R&D Engineer, Alibaba Cloud

Senior R&D Engineer of AliCloud, Maintainer of OpenKruise community, has long been engaged in the research and development of cloud native, containers, scheduling and other fields; core R&D member of Alibaba's one million container scheduling system, and many years of experience in... Read More →

Vec Sun

software engineer, xiaohongshu

Sunweixiang has previously worked in the Alibaba Cloud container team as software engineer and is a contributor to the OpenKruise community's main, Karmada, and other communities. He is deeply involved in container application orchestration, multi-cluster.

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

11:25 HKT

Evolution of SPDK Vhost-FS Solution to Accelerate File Access in VMs and Secure Containers | SPDK Vhost-FS解决方案的演进，加速虚拟机中的文件访问并保护容器 - Changpeng Liu, Intel

Friday August 23, 2024 11:25 - 12:00 HKT

Level 2 | Grand Ballroom 1-2

Virtio-fs is a shared file system between virtual machines or secure containers and host, Storage Performance Development Kit(SPDK) vhost-fs is the backend implementation of virtio-fs in userspace, in this presentation, we will summarize typical storage solutions that use SPDK vhost-fs and components to build the storage stack, then go through the evolution of SPDK vhost-fs from BlobFS to latest FSDEV module, advanced features such as interrupt mode and thread modeling for data processing in SPDK vhost-fs are also covered.

Virtio-fs是虚拟机或安全容器与主机之间共享文件系统，Storage Performance Development Kit(SPDK) vhost-fs是virtio-fs在用户空间的后端实现。在本次演讲中，我们将总结使用SPDK vhost-fs和组件构建存储栈的典型存储解决方案，然后介绍SPDK vhost-fs从BlobFS到最新的FSDEV模块的演变过程，还将涵盖SPDK vhost-fs中用于数据处理的高级功能，如中断模式和线程建模。

Speakers

Changpeng Liu

Cloud Solution Architect, Intel

Changpeng is a Cloud Solution Architect at Intel. He has been working on Storage Performance Development Kit since 2014. Currently, Changpeng is a core maintainer for the SPDK. His areas of expertise include NVMe, I/O Virtualization, and storage offload on IPU.

Friday August 23, 2024 11:25 - 12:00 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:25 HKT

LLM's Anywhere: Browser Deployment with Wasm & WebGPU | LLM随处可用：使用Wasm和WebGPU进行浏览器部署 - Joinal Ahmed, Navatech Group & Nikhil Rana, Google Cloud

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 3

In today's interconnected world, deploying and accessing machine learning (ML) models efficiently poses significant challenges. Traditional methods rely on cloud GPU clusters and constant internet connectivity. However, WebAssembly (Wasm) and WebGPU technologies are revolutionizing this landscape. This talk explores leveraging Wasm and WebGPU for deploying Single Layer Models (SLMs) directly within web browsers, eliminating the need for extensive cloud GPU clusters and reducing reliance on constant internet access. We showcase practical examples and discuss how Wasm enables efficient cross-platform ML model execution, while WebGPU optimizes parallel computation within browsers. Join us to discover how this fusion empowers developers and users alike with unprecedented ease and efficiency in browser-based ML, while reducing dependence on centralized cloud infrastructure and internet connectivity constraints.

在当今互联世界中，高效部署和访问机器学习（ML）模型面临着重大挑战。传统方法依赖于云GPU集群和持续的互联网连接。然而，WebAssembly（Wasm）和WebGPU技术正在彻底改变这一局面。本次演讲探讨了如何利用Wasm和WebGPU在Web浏览器中直接部署单层模型（SLMs），消除了对庞大云GPU集群的需求，减少了对持续互联网访问的依赖。我们展示了实际示例，并讨论了Wasm如何实现高效的跨平台ML模型执行，以及WebGPU如何优化浏览器内的并行计算。加入我们，发现这种融合如何赋予开发人员和用户在基于浏览器的ML中前所未有的便利和效率，同时减少对集中式云基础设施和互联网连接的依赖。

Speakers

Joinal Ahmed

AI Architect, Navatech Group

Joinal is a seasoned Data Science expert passionate about rapid prototyping, community involvement, and driving technology adoption. With a robust technical background, he excels in leading diverse teams through ML projects, recruiting and mentoring talent, optimizing workflows, and... Read More →

Nikhil Rana

AI Consultant, Google Cloud

Nikhil is an applied data science professional with over a decade of experience in developing and implementing Machine learning, Deep Learning, and NLP-based solutions for a variety of industries like Finance, FMCG, etc. He is a passionate advocate for the use of data science to solve... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 3

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

11:25 HKT

New Advances for Cross-Platform AI Applications in Docker | Docker中跨平台AI应用程序的新进展 - Michael Yuan, Second State

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 2

The talk proposes to delve into novel methods for enhancing cross-platform GPU/AI workloads within container ecosystems, with a specific emphasis on Docker's incorporation of the WebGPU standard. This standard empowers containerized applications to utilize host GPUs and additional AI accelerators via a flexible API. Consequently, there's no longer a necessity to construct Docker images tailored to individual GPU vendors and their proprietary drivers. The presentation will feature a demonstration highlighting how the WasmEdge project capitalizes on the WebGPU standard to craft portable LLM inference applications in Rust. Additionally, Docker's seamless management and orchestration of these applications will be showcased.

本次演讲旨在探讨增强容器生态系统中跨平台GPU/AI工作负载的新方法，特别强调Docker对WebGPU标准的整合。该标准使容器化应用程序能够通过灵活的API利用主机GPU和额外的AI加速器。因此，不再需要构建针对个别GPU供应商及其专有驱动程序的Docker镜像。演示将展示WasmEdge项目如何利用WebGPU标准在Rust中创建可移植的LLM推理应用程序。此外，还将展示Docker对这些应用程序的无缝管理和编排能力。

Speakers

Michael Yuan

Product Manager, Second State

Dr. Michael Yuan is a maintainer of WasmEdge Runtime (a project under CNCF) and a co-founder of Second State. He is the author of 5 books on software engineering published by Addison-Wesley, Prentice-Hall, and O'Reilly. Michael is a long-time open-source developer and contributor... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

11:25 HKT

Rollout Patterns: Smoothly Migrating and Rolling Out Your Microservices | 部署模式：平稳迁移和部署您的微服务 - Tim Xiao, DaoCloud & Wu Chenhui, AS.Watson TechLab

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 7

At Watsons, most of their services are built on Dubbo. Now, they aim to utilize delivery tools like Argo CD and Argo Rollouts to automatically and securely deliver their services. However, they have encountered complexities beyond what Argo Rollouts assumes. We will summarize these patterns and demonstrate how to handle them, including: - Pattern 1: One service at a time. - Pattern 2: Multiple services, each forward-compatible. - Pattern 3: Multiple services with version dependency.

在Watsons，他们的大多数服务都是基于Dubbo构建的。现在，他们希望利用Argo CD和Argo Rollouts等交付工具来自动和安全地交付他们的服务。然而，他们遇到了超出Argo Rollouts假设的复杂性。我们将总结这些模式，并演示如何处理它们，包括： - 模式1：一次一个服务。 - 模式2：多个服务，每个都是向前兼容的。 - 模式3：具有版本依赖性的多个服务。

Speakers

旸肖

Developer, DaoCloud

Served as DevOps platform Principle Engineer in DaoCloud, participated in community projects including argo-cd, argo-rollouts, kubevela and other community projects, and has more than 5 years of kubernetes platform development experience.

Wu Chenhui

architecture, AS.Watson TechLab

I have nearly 30 years of experience in software development and architecture design, and 5 years of experience in k8s, responsible for k8s related architecture design of Watsons Group

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, SDLC (Software Development Lifecycle)

Experience Level | 内容经验水平 初级 (Beginner)
Language | 语言 中文 (Chinese)

13:20 HKT

Build Container Runtime Based on Sandbox API of Containerd | 基于Containerd的Sandbox API构建容器运行时 - Shaobao Feng, Huawei Cloud & Cai Wei, DaoCloud

Friday August 23, 2024 13:20 - 13:55 HKT

Level 1 | Hung Hom Room 1

Sandbox API is released in containerd 1.7 and will be stable in containerd 2.0. It provides a clean way to implement a sandbox oriented container runtime. Container is more a set of API specifications than a single technology now, with the introduction of different kinds of isolation techiques as sandboxes, We need a clear and abstract definition of Sandbox API, to make it easy to integrate different kinds of sandboxing techiniques to become a container runtime. In this sharing, We will: 1. Make an introduction of Sandbox API of containerd, and why we need it. 2. Show how we build our container runtimes based on the Sandobx API and the benefits comes with it. 3. We will show the demostration of different kinds of sandboxed containers created by Kuasar, a container runtime framework based on the new Sandbox API, currently supports sandboxes of VMM, UserMode Kernel, WebAssembly and Runc.

在KubeCon的会议描述中，我们将介绍Sandbox API在containerd 1.7中发布，并将在containerd 2.0中稳定。它提供了一种清晰的方式来实现面向沙箱的容器运行时。随着不同类型的隔离技术（如沙箱）的引入，容器现在更多地是一组API规范，而不是单一技术。我们需要对Sandbox API进行清晰和抽象的定义，以便轻松集成不同类型的沙箱技术，使其成为容器运行时。在这次分享中，我们将： 1. 介绍containerd的Sandbox API，以及为什么我们需要它。 2. 展示我们如何基于Sandbox API构建我们的容器运行时以及带来的好处。 3. 我们将展示由基于新Sandbox API的容器运行时框架Kuasar创建的不同类型的沙箱容器的演示，目前支持VMM、UserMode Kernel、WebAssembly和Runc的沙箱。

Speakers

Wei Cai(Iceber Gu)

Software Engineer, DaoCloud

Senior open source enthusiast, focused on cloud runtime, multi-cloud and WASM. I am a CNCF Ambassador and founded Clusterpedia and promoted it as a CNCF Sandbox project. I also created KasmCloud to promote the integration of WASM with Kubernetes and contribute it to the WasmCloud... Read More →

Shaobao Feng

Principal Engineer, Huawei Cloud

Shaobao is Principal Engineer working on Huawei Cloud, with his work focusing on the Serverless Platforms. He has been a leader in building secure container runtime of the first Serverless Kubernetes on public cloud. He is the main code contributor and maintainer of the open source... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:20 HKT

JuiceFS CSI in Multi-Thousand Node Kubernetes Clusters for LLM Pre-Training | JuiceFS CSI在LLM预训练中用于几千节点Kubernetes集群 - Weiwei Zhu, juicedata

Friday August 23, 2024 13:20 - 13:55 HKT

Level 2 | Grand Ballroom 1-2

The rapid advancement of artificial intelligence technologies, especially the development of large language models (LLMs), has led to a sharp increase in the amount of data that enterprises need to process. Managing large-scale data clusters in Kubernetes environments presents several challenges, including storage performance, complex access control management and system stability. JuiceFS is a distributed POSIX file system designed for cloud. It was open-sourced in 2021( 9.8k stars) To deliver an optimal experience in Kubernetes, JuiceFS developed JuiceFS CSI Driver. In addition, JuiceFS CSI introduced several new designs to support large-scale, complex AI training tasks such as the mount pod mode and the sidecar mode for serverless environments. Outline: - LLM Storage challenges - JuiceFS CSI Driver Architectural - Mount pod mode\Sidecar mode - Practical experience - Future

人工智能技术的快速发展，特别是大型语言模型（LLMs）的发展，导致企业需要处理的数据量急剧增加。在Kubernetes环境中管理大规模数据集群面临着多个挑战，包括存储性能、复杂的访问控制管理和系统稳定性。 JuiceFS是一种为云设计的分布式POSIX文件系统。它于2021年开源（拥有9.8k星）。为了在Kubernetes中提供最佳体验，JuiceFS开发了JuiceFS CSI驱动程序。此外，JuiceFS CSI引入了几项新设计，以支持大规模、复杂的人工智能训练任务，如挂载Pod模式和用于无服务器环境的Sidecar模式。大纲： - LLM存储挑战 - JuiceFS CSI驱动程序架构 - 挂载Pod模式\Sidecar模式 - 实践经验 - 未来

Speakers

Weiwei Zhu

Full stack engineer, juicedata

She is a full-stack engineer of Juicedata.Inc, maintainer of JuiceFS CSI driver and Fluid. She is responsible for development and maintenance of JuiceFS in the Cloud-Native ecosystem, completed the implementation and practice of JuiceFS in Kubernetes, and continued to improve the... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

13:20 HKT

What if Your System Experiences an Outage? Let's Build a Resilient Systems with Chaos Engineering | 如果您的系统遇到故障怎么办？让我们通过混沌工程构建弹性系统 - NamKyu Park, LitmusChaos

Friday August 23, 2024 13:20 - 13:55 HKT

Level 1 | Hung Hom Room 7

This session explores how LitmusChaos improves the resilience of cloud-native applications by injecting chaos. It also showcases the streamlined management of chaos engineering software through Backstage. Cloud-native applications can be complex to navigate and secure. Our session will present strategies to identify vulnerabilities using GitOps and monitoring, integrated seamlessly into your system. Learn how Backstage and LitmusChaos can enhance your application's resilience with ease! The session starts with chaos orchestration and analysis using LitmusChaos, followed by a live demo highlighting the utilization of LitmusChaos' Backstage plugin and others like Prometheus and ArgoCD. Learn how these plugins, when integrated with Backstage, effectively manage all components necessary for executing chaos engineering.

本场演讲探讨了LitmusChaos如何通过注入混沌来提高云原生应用程序的弹性。它还展示了通过Backstage简化混沌工程软件的管理。云原生应用程序可能很复杂，难以导航和保护。我们的会议将介绍使用GitOps和监控来识别漏洞的策略，无缝集成到您的系统中。了解如何使用Backstage和LitmusChaos轻松增强您的应用程序的弹性！本场演讲从使用LitmusChaos进行混沌编排和分析开始，然后展示了使用LitmusChaos的Backstage插件以及其他插件如Prometheus和ArgoCD的实时演示。了解这些插件与Backstage集成后，如何有效管理执行混沌工程所需的所有组件。

Speakers

Namkyu Park

Maintainer, LitmusChaos

Namkyu Park is a CNCF Ambassador and a Software Developer. He worked at several startups in South Korea. He has completed Linux Foundation Mentorship Programme(LitmusChaos) as a mentee and is currently a mentor and maintainer of LitmusChaos. He has previously spoken at GopherCon Korea... Read More →

Friday August 23, 2024 13:20 - 13:55 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, SDLC (Software Development Lifecycle)

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

14:10 HKT

Developing a Standard Multi-Cluster Inventory API | 开发标准的多集群Inventory API - Zhiying Lin & Chen Yu, Microsoft; Hongcai Ren, Huawei; Di Xu, Xiaohongshu; Jian Qiu, Redhat

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 1

With one year's effort, the kubernetes community has made great progress on final approval of the cluster inventory API project. The project has gained a lot of attention and interest from different companies and open source projects, with many new use cases being explored. This panel discussion brings together maintainers from different multicluster management projects who bootstraps this project. We will share what is cluster inventory API, and how we get there. We will also introduce the ongoing work and emerging use cases on this project, and our vision for the future plan. During the panel discussion, attendees will gain a comprehensive understanding of the use cases, eg, how to support multi-cluster AI workload scheduling using inventory API, and challenges, eg how to migrate a cluster manager tool to another seamlessly. We will shed light on the collaborative efforts to standardize cluster inventory APIs and how it evolves from a small group discussion to the community effort.

经过一年的努力，Kubernetes社区在最终批准集群清单API项目方面取得了巨大进展。该项目受到了不同公司和开源项目的关注和兴趣，许多新的用例正在被探索。本次小组讨论将汇集来自不同多集群管理项目的维护者，他们启动了这个项目。我们将分享什么是集群清单API，以及我们是如何实现的。我们还将介绍该项目的正在进行的工作和新兴用例，以及我们对未来计划的愿景。在小组讨论期间，与会者将全面了解用例，例如如何使用清单API支持多集群AI工作负载调度，以及挑战，例如如何无缝迁移集群管理工具。我们将阐明协作努力以标准化集群清单API，并介绍它是如何从一个小组讨论演变为社区努力的。

Speakers

Di Xu

Principle Software Engineer, Xiaohongshu

Chen Yu

Senior Software Engineer, Microsoft

Zhiying Lin

PRINCIPAL SOFTWARE ENGINEER, Microsoft

I'm a PRINCIPLE SOFTWARE ENGINEER at micosoft and my main contribution is the Azure Kubernetes Fleet Manager product. I'm one of the main maintainers of open source project Azure/fleet & Azure/fleet-networking.

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

14:10 HKT

KuaiShou's 100% Resource Utilization Boost: 100K Redis Migration from Bare Metal to Kubernetes | 快手的100%资源利用率提升：从裸机迁移100K Redis到Kubernetes - XueQiang Wu, ApeCloud & YuXing Liu, Kuaishou

Friday August 23, 2024 14:10 - 14:45 HKT

Level 2 | Grand Ballroom 1-2

In the past year, Kuaishou successfully migrated nearly 100,000 Redis instances from traditional bare metal environments to the Kubernetes platform, achieving a significant doubling of resource utilization. While ensuring business stability, this large-scale migration faced numerous challenges, including smooth migration execution, finding a balance between increasing deployment density (resource utilization) and ensuring system stability, avoiding interference with other services during coexistence, and addressing specific issues associated with stateful services like databases (including data management, configuration management, ensuring high availability, cross-cluster disaster recovery, etc.). This session will share Kuaishou's large-scale practical experience in Redis cloud-native transformation, in collaboration with the open-source project KubeBlocks, covering aspects such as smooth migration, resource efficiency improvement, and efficient database management.

在过去的一年中，快手成功将近10万个Redis实例从传统裸机环境迁移到Kubernetes平台，实现资源利用率显著翻倍。在确保业务稳定性的同时，这一大规模迁移面临诸多挑战，包括顺利执行迁移、在增加部署密度（资源利用率）和确保系统稳定性之间找到平衡、在共存期间避免与其他服务的干扰，以及解决与数据库等有状态服务相关的特定问题（包括数据管理、配置管理、确保高可用性、跨集群灾难恢复等）。本场演讲将分享快手在Redis云原生转型方面的大规模实践经验，与开源项目KubeBlocks合作，涵盖顺利迁移、资源效率提升和高效数据库管理等方面。

Speakers

yuxing liu

senior software engineer, Kuaishou

I have worked in the cloud-native teams of Alibaba Cloud and Kuaishou, focusing on the cloud-native field and gaining experience in open source, commercialization, and scaling of cloud-native technologies. I am one of the maintainers of the CNCF/Dragonfly project and also one of the... Read More →

XueQiang Wu

Director of Research and Development, ApeCloud

Former tech leader at Alibaba Cloud PolarDB-X, a cloud-native distributed database, with a wide range of interests and expertise in operating systems, cryptography, distributed systems, and more. Joined the PolarDB-X team in 2017, focusing on the development of high-concurrency, low-latency... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:10 HKT

Model Service Mesh: A New Paradigm for Large-Scale AI Model Service Deployment and Management | 模型服务网格：大规模AI模型服务部署和管理的新范式 - Xi Ning Wang, Alibaba Cloud & Huailong Zhang, Intel China

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 2

As AI/ML models grow in scale and complexity, how to efficiently deploy and manage model service in cloud-native environments has become a significant challenge. This proposal will introduce the Model Service Mesh (MSM), an emerging architectural paradigm designed specifically for large-scale AI model service deployment and management, to address the challenge. This new paradigm focuses on: 1. How to build a highly scalable and reliable model delivery system and the key features include dynamic model service routing, unified management for multi-models within single endpoint, an optimized caching layer, and cache-aware scheduling,etc. 2. How to leverage the MSM to optimize AI models service in lifecycle management, resource utilization improvement, security enhancement, and observability and resilience insurance. In essence, this architecture ensures a scalable, secure, and efficient model service in cloud native environment.

随着人工智能/机器学习模型规模和复杂性的增长，如何在云原生环境中高效部署和管理模型服务已成为一个重大挑战。本提案将介绍模型服务网格（MSM），这是一种专门为大规模人工智能模型服务部署和管理而设计的新兴架构范式，旨在解决这一挑战。这种新范式关注以下几点： 1. 如何构建一个高度可扩展和可靠的模型交付系统，关键特性包括动态模型服务路由、单个端点内多模型的统一管理、优化缓存层和缓存感知调度等。 2. 如何利用MSM优化人工智能模型服务的生命周期管理、资源利用率改善、安全增强以及可观察性和弹性保障。总的来说，这种架构确保了在云原生环境中可扩展、安全和高效的模型服务。

Speakers

王夕宁

Technical Leader, Alibaba Cloud

Wang Xining, senior technical expert of Alibaba Cloud, technical leader of ACK(Kubernetes)/ASM(Service Mesh) , focusing on Kubernetes, service mesh and other cloud native fields. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the... Read More →

Huailong Zhang

Cloud Software Engineer, Intel China

Steve(Huailong) Zhang has worked for Alcatel-Lucent, Baidu and IBM to engage in cloud computing research and development. Huailong is currently working for Intel China as a cloud-native software engineer, focusing on cloud-native technical fields, such as kubernetes and service mesh... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

14:10 HKT

Opportunities and Challenges of Cloud Native Technology in US Healthtech | 美国健康科技中云原生技术的机遇与挑战 - Katerina Arzhayev, SUSE

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 7

In this session I will share the strategic roadmap for Cloud Native Technology companies eyeing expansion into the intricate US healthcare market. Delving into the multifaceted landscape of American healthcare, the session navigates through its complexities, from the dichotomy of public and private sectors to the nuanced regulatory framework dominated by HIPAA and FDA regulations. By illuminating Cloud Native Technology's transformative potential, particularly in fostering interoperability, enhancing telehealth capabilities, and empowering data analytics, the session showcases how innovation can meet the industry's pressing needs. Moreover, it sheds light on the indispensable considerations for market entry, emphasizing regulatory compliance, trust-building with healthcare stakeholders, and the imperative of market localization. Attendees will be equipped with a strategic playbook to navigate the intricate terrain of US healthtech.

在这场演讲上，我将分享云原生技术公司进军美国复杂医疗市场的战略路线。深入探讨美国医疗保健的多层面景观，本场演讲将引导参与者了解其复杂性，从公共和私营部门的对立到以HIPAA和FDA法规为主导的细致监管框架。通过阐明云原生技术的变革潜力，特别是在促进互操作性、增强远程医疗能力和赋能数据分析方面，本场演讲展示了创新如何满足行业迫切需求。此外，它还揭示了进入市场的不可或缺的考虑因素，强调了监管合规性、与医疗保健利益相关者建立信任以及市场本地化的必要性。与会者将获得一份战略指南，帮助他们在美国医疗科技领域的复杂地形中航行。

Speakers

Katerina Arzhayev

Director of Product Management, Healthcare Edge, SUSE

Katerina Arzhayev is experienced in cross-cultural collaboration and technology strategy. She has a proven track record of driving business results through effective communication and strategic planning. Katerina's expertise lies in making highly complicated topics accessible to non-technical... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Cloud Native Experience

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

15:15 HKT

Expanding Cloud Native Capabilities with WASM: A Case Study of Harbor and WASM Integration | 通过WASM扩展云原生能力：Harbor和WASM集成案例研究 - Chenyu Zhang, AntGroup & Yan Wang, Broadcom

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 1

In the cloud-native realm, eBPF's versatility has led to scalable solutions in observability and security by attaching to system event checkpoints without kernel code modification. This concept has paved the way for extending business applications non-invasively and flexibly without altering the original code. In this session, we'll use Harbor, the cloud-native artifact registry, to showcase how WASM (WebAssembly) extends Harbor's functionalities without code modification. Here, Harbor is analogous to the Linux kernel, and WASM to user-provided eBPF programs. Harbor provides mounting points for various events, such as pre-pull requests, enabling users to filter requests with custom WASM programs. This facilitates fine-grained permission control and artifact security auditing before a user pulls the artifacts, with more features to discover.

在云原生领域，eBPF 的多功能性使得它能够通过附加到系统事件检查点而无需修改内核代码，从而实现可扩展的可观测性和安全性解决方案。这一概念为在不改变原始代码的情况下非侵入性和灵活地扩展业务应用程序铺平了道路。在本场演讲中，我们将使用 Harbor，云原生制品注册表，展示如何使用 WASM（WebAssembly）在不修改代码的情况下扩展 Harbor 的功能。在这里，Harbor 类似于 Linux 内核，而 WASM 则类似于用户提供的 eBPF 程序。Harbor 提供了各种事件的挂载点，例如预拉取请求，使用户能够使用自定义的 WASM 程序过滤请求。这有助于在用户拉取制品之前进行细粒度的权限控制和制品安全审计，还有更多功能等待您去发现。

Speakers

Yan Wang

Staff engineer, Broadcom

Yan Wang is a Staff engineer working on VMWare. As one of the core maintainer of CNCF project Harbor and the maintainer of CNCF project distribution, his main work focuses on technology research and innovation in the cloud native field.

Chenyu Zhang

Software Engineer, AntGroup

Chenyu Zhang is a software engineer, currently mainly responsible for the development and maintenance of project harbor, and also has some experience in devops and cloud native related technology stacks.

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

15:15 HKT

No More Runtime Setup! Let's Bundle, Distribute, Deploy, Scale LLMs Seamlessly with Ollama Operator | 无需运行时设置！让我们使用Ollama Operator轻松捆绑、分发、部署、扩展LLMs - Fanshi Zhang, DaoCloud

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 2

Seeking out a way to ship LLMs more seamlessly? Way too complicated to manage, composite, and setup a runtime with Python, C++, CUDA, GPUs when deploying LLMs? Tired of fighting against dependencies, model sizes, syncing deliverable model images across nodes? It's true that people often find it hard to bundle, distribute, deploy, and scale their own LLM workloads, but no worries, here is Ollama Operator, a scheduler, and utilizer for LLM models powered by Modelfile introduced by Ollama. You can now enjoy then unified bundled, runtime powered by llama.cpp with simple lines of CRD definition or the natively included kollama CLI with single command line, bundling, distributing, deploying, scaling of LLMs can never be easily and seamlessly accomplished across OS and environments. Let's dive in and find out what Ollama Operator with Ollama can do to deploy our own large langaugae models, what can we do and combine these features with Modelfile then bring them into the Kubernetes world!

寻找一种更无缝地运输LLM的方式？在部署LLM时，使用Python、C++、CUDA、GPU设置运行时太复杂？厌倦了与依赖、模型大小、在节点间同步可交付模型图像等问题作斗争？人们常常发现很难捆绑、分发、部署和扩展自己的LLM工作负载，但不用担心，这里有Ollama Operator，一个由Ollama引入的基于Modelfile的LLM模型调度器和利用者。现在，您可以通过简单的CRD定义行或内置的kollama CLI命令行，享受由llama.cpp提供支持的统一捆绑运行时，轻松实现LLM的捆绑、分发、部署和扩展，跨操作系统和环境都可以轻松实现。让我们深入了解一下Ollama Operator与Ollama能够做些什么来部署我们自己的大型语言模型，我们可以如何结合这些功能与Modelfile，然后将它们带入Kubernetes世界！

Speakers

Neko Ayaka

Software Engineer, DaoCloud

Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

15:15 HKT

The Challenges of Kubernetes Data Protection - Real Examples and Solutions with Velero | Kubernetes数据保护的挑战- Velero的真实案例和解决方案 - Wenkai Yin, Broadcom & Bruce Zou, Shanghai Jibu Tech

Friday August 23, 2024 15:15 - 15:50 HKT

Level 2 | Grand Ballroom 1-2

The distributed and dynamic nature of Kubernetes makes data protection challenging to guarantee data availability and durability, below are summaries of the issues we encountered in the real customer environments: 1. Application definition and resources capture 2. Application data consistency 3.Application restore on heterogenous and across-cloud environments We provide the detailed description of these issues in the "Additional resources" section due to the character limitation of the "Description".

Kubernetes的分布式和动态特性使得数据保护变得具有挑战性，以确保数据的可用性和持久性。以下是我们在真实客户环境中遇到的问题摘要： 1. 应用程序定义和资源捕获 2. 应用程序数据一致性 3. 跨异构和跨云环境的应用程序恢复由于“描述”部分的字符限制，我们将在“附加资源”部分提供这些问题的详细描述。

Speakers

Bruce Zou

Jibu Tech, Co-founder and Development Director, Shanghai Jibu Tech

Over 10 years storage development and architecture experience working at IBM storage system lab, submitted 15+ disclosures and publications; supported 10+ big accounts for high end storage system critical issues. Rich experience in building high available storage systems, leading... Read More →

Wenkai Yin

Staff Software Engineer, Broadcom

Staff software engineer, focus on cloud-native development. Core maintainers of open source project Harbor and Velero

Friday August 23, 2024 15:15 - 15:50 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:15 HKT

The Experience of ChillyRoom Developing & Managing Session-Based Game on K8s with OpenKruiseGame | 在K8s上使用OpenKruiseGame开发和管理基于会话的游戏的ChillyRoom经验 - Qiuyang Liu, Alibaba Cloud & Xinhao Liu, ChillyRoom

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 7

In the era of traditional game operation and maintenance, session-based games face huge challenges in terms of delivery efficiency and resource costs. Cloud native technology brings exactly the flexibility and highly automated capabilities that session-based games need. However, due to the game servers' strong stateful characteristics, there are also various difficulties in the process of implementing games on Kubernetes. This talk will focus on the characteristics of session-based games and describe how ChillyRoom uses OpenKruiseGame, which is the subproject of CNCF incubating project OpenKruise, to develop and manage session-based games on Kubernetes, providing developers in the game industry with cloud native implementation experience in automatic network access, elastic scaling of game servers, matching logic development, and room status management, etc.

在传统游戏运维时代，基于会话的游戏在交付效率和资源成本方面面临巨大挑战。云原生技术正好为会话型游戏带来了灵活性和高度自动化能力。然而，由于游戏服务器具有强烈的有状态特性，在实现游戏在 Kubernetes 上的过程中也存在各种困难。本次演讲将重点关注会话型游戏的特点，并描述 ChillyRoom 如何使用 OpenKruise 的子项目 OpenKruiseGame 来开发和管理基于会话的游戏在 Kubernetes 上，为游戏行业的开发人员提供云原生实现经验，包括自动网络访问、游戏服务器的弹性扩展、匹配逻辑开发和房间状态管理等。

Speakers

Qiuyang Liu

Senior R&D Engineer, Alibaba Cloud

Qiuyang Liu, head of cloud native game at Alibaba Cloud Container Service and maintainer of the kruise-game project. He has long been engaged in the research and development of cloud native in the gaming field and is committed to promoting the implementation of cloud native in the... Read More →

Xinhao Liu

Engineer, ChillyRoom

Xinhao Liu, an engineer with one year experience in game server development at ChillyRoom and three years experience in Linux OS and cloud core network software development in industry. He has a passion for creating flexible, high-performance, high-available and easy-to-maintain game... Read More →

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, Cloud Native Experience

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

16:05 HKT

JD Cloud's Large-Scale Serverless Practice : APP Management and Elastic Scaling on Karmada | 京东云的大规模无服务器实践：在Karmada上的应用管理和弹性扩展 - XiaoFei Wang & Chen Yanying, JDCloud

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 1

In JDCloud, the federated Serverless service is based on the federated management model and Serverless application model, providing JDOS application container control services for federated application container deployment, elastic scaling, and fault migration capabilities. It manages multiple clusters with over 10,000 nodes. Unify management of multiple sub-clusters to improve overall resource utilization. Reduce the complexity of multi-cluster management, scheduling, and distribution on the platform. End users can use our platform just like the native Kubernetes API. Throughout the process, we will address numerous technical challenges, including: 1. Multi-cluster management and distribution practice 2. Efficient cross-cluster elastic scaling solution 3. Problems encountered in production and sharing

在京东云中，联邦Serverless服务基于联邦管理模型和Serverless应用模型，为联邦应用容器部署、弹性扩展和故障迁移提供JDOS应用容器控制服务。它管理超过10,000个节点的多个集群。统一管理多个子集群，提高整体资源利用率。减少平台上多集群管理、调度和分发的复杂性。最终用户可以像使用本机Kubernetes API一样使用我们的平台。在整个过程中，我们将解决许多技术挑战，包括： 1. 多集群管理和分发实践 2. 高效的跨集群弹性扩展解决方案 3. 在生产和分享中遇到的问题

Speakers

Chen Yanying

Cloud Native Engineer, JDCloud

Engaged in the construction and internal promotion of basic platforms such as Federated Clusters, Serverless, Service Mesh and some middleware, based on JD's large-scale Kubernetes clusters

XiaoFei Wang

CloudNativeEngineer, JDCloud

As a software engineer, he is responsible for cluster deployment, multi-cluster management, and federated clusters. Participate in JD.com’s 618 and 11.11. Have rich practical experience in cloud native.

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 1

KubeCon + CloudNativeCon Sessions, Platform Engineering

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:05 HKT

TiDB: Your Next MySQL Is Not a MySQL | TiDB：你的下一个 MySQL 何必是 MySQL - Qizhi Wang, PingCAP

Friday August 23, 2024 16:05 - 16:40 HKT

Level 2 | Grand Ballroom 1-2

You might have heard of TiDB, a distributed open-source database known for its virtually limitless horizontal scalability, capable of handling both online transactional processing and analytical workloads while being compatible with the MySQL protocol. Traditionally, different databases have been employed to handle various workloads in our application architecture designs. Commonly, relational databases are used for online transaction processing, with data asynchronously distributed to analytical databases, document stores, and cache databases. With the rise of AI, an additional type of database needs consideration — the vector database. But introducing this type of database can add unnecessary complexity to your technology stack. This talk we will discuss how TiDB integrates multiple functionalities such as real-time transaction processing, online analytics, sharding-free architecture, and vector type computations, all aimed at reducing the cognitive load for developers.

您可能已经听说过 TiDB，这是一个分布式开源数据库，以其几乎无限的水平扩展性而闻名，能够处理在线事务处理和分析工作负载，同时兼容 MySQL 协议。传统上，在我们的应用架构设计中，通常会使用不同的数据库来处理各种工作负载。通常情况下，关系数据库用于在线事务处理，数据会异步分布到分析数据库、文档存储和缓存数据库。随着人工智能的兴起，还需要考虑一种额外的数据库类型 —— 向量数据库。但引入这种类型的数据库可能会给您的技术堆栈增加不必要的复杂性。在本次演讲中，我们将讨论 TiDB 如何集成多种功能，如实时事务处理、在线分析、无分片架构和向量类型计算，所有这些都旨在减少开发人员的认知负荷。

Speakers

Qizhi Wang

TiDB Ecosystem Software Architect and Senior Developer Advocate at PingCAP, PingCAP

Qizhi is a TiDB Ecosystem Software Architect & Senior Developer Advocate at PingCAP, the company behind TiDB. In this role, He focuses on EcoSystem development and has been instrumental in integrating TiDB with various platforms such as AWS, GORM, MySQL Connector/J, Hibernate, DBeaver... Read More →

Friday August 23, 2024 16:05 - 16:40 HKT
Level 2 | Grand Ballroom 1-2

KubeCon + CloudNativeCon Sessions, Data Processing + Storage

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

16:05 HKT

Unlocking LLM Performance with EBPF: Optimizing Training and Inference Pipelines | 通过eBPF解锁LLM性能：优化训练和推理管道 - Yang Xiang, Yunshan Networks, Inc.

Friday August 23, 2024 16:05 - 16:40 HKT

Level 1 | Hung Hom Room 2

The training and inference processes of Large Language Models (LLMs) involve handling vast amounts of model data and training data, and consume significant GPU compute resources. However, enhancing GPU utilization becomes extremely challenging in the absence of observability. This presentation will introduce how to achieve observability in LLM training and inference processes with zero disruption using eBPF. This includes utilizing Memory Profiling to understand the loading performance of models and training data, Network Profiling to comprehend the data exchange performance, and GPU Profiling to analyze GPU's MFU (Model FLOPs Utilization) and performance bottlenecks. Additionally, we will share the practical effects of implementing observability in a PyTorch LLM application and the llm.c project using eBPF, aiming to enhance training and inference performance.

大型语言模型（LLMs）的训练和推断过程涉及处理大量的模型数据和训练数据，并消耗大量的GPU计算资源。然而，在缺乏可观察性的情况下，提高GPU利用率变得极具挑战性。本次演讲将介绍如何利用eBPF在LLM训练和推理过程中实现零中断的可观察性。这包括利用内存分析来了解模型和训练数据的加载性能，网络分析来理解数据交换性能，以及GPU分析来分析GPU的MFU（模型FLOPs利用率）和性能瓶颈。此外，我们将分享在PyTorch LLM应用程序和llm.c项目中使用eBPF实现可观察性的实际效果，旨在提高训练和推理性能。

Speakers

Yang Xiang

VP of Engineering, Yunshan Networks, Inc.

Received a Ph.D. from Tsinghua University, and currently serving as VP of Engineering at Yunshan Networks and the head of the DeepFlow open-source community. He has presented academic papers on topics such as application observability and network measurement at top international academic... Read More →

Friday August 23, 2024 16:05 - 16:40 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, Observability

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)