KubeCon + CloudNativeCon + Open Source Summit + AI

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

11:00 HKT

Accelerating Serverless AI Large Model Inference with Functionalized Scheduling and RDMA | 通过功能化调度和RDMA加速无服务器AI大模型推理 - Yiming Li, Tianjin University& Chenglong Wang, Jinan Inspur Data Technology Co., Ltd.

Wednesday August 21, 2024 11:00 - 11:35 HKT

Level 1 | Hung Hom Room 7

The deployment of AI large models on standard Serverless inference platforms like KServe is gaining popularity due to its ability to improve resource utilization and reduce costs. However, existing large model inference faces significant scheduling and communication bottlenecks, making it challenging to meet low-latency and high-throughput demands. The centralized control plane of Kubernetes leads to low scheduling efficiency, unable to achieve second-level response to large-scale burst requests. Additionally, the large model inference needs to transfer GB-level KV cache for each request, resulting in high communication overhead. So, we have developed a highly elastic functionalized scheduling framework to guarantee second-level scheduling for thousands of Serverless AI large model inference task instances. Additionally, we leverage RDMA technology to achieve high-speed KV cache migration, avoiding the high overhead caused by traditional network protocol stacks.

AI大模型在像KServe这样的标准无服务器推理平台上的部署越来越受欢迎，因为它能够提高资源利用率并降低成本。然而，现有的大模型推理面临着重要的调度和通信瓶颈，使得满足低延迟和高吞吐量需求变得具有挑战性。Kubernetes的集中式控制平面导致低调度效率，无法实现对大规模突发请求的秒级响应。此外，大模型推理需要为每个请求传输GB级别的KV缓存，导致高通信开销。因此，我们开发了一个高度弹性的功能化调度框架，以确保对数千个无服务器AI大模型推理任务实例进行秒级调度。此外，我们利用RDMA技术实现高速KV缓存迁移，避免传统网络协议栈引起的高开销。

Speakers

Cookie

Senior Software Engineer, Jinan Inspur Data Technology Co., Ltd.

I'm employed in Inspur. I mainly do container computing related development and are familiar with container networks, especially Calico and Cilium. I'm also a contributor to the Openyurt community and mainly participate in the development of the raven project.

Yiming Li

PhD candidate, Tianjin University

Yiming Li received the bachelor’s and master’s degrees from Tianjin University, China, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include cloud com... Read More →

Wednesday August 21, 2024 11:00 - 11:35 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

11:50 HKT

AI Inference Performance Acceleration: Methods, Tools, and Deployment Workflows | AI推理性能加速：方法、工具和部署工作流程 - Yifei Zhang & 磊钱, Bytedance

Wednesday August 21, 2024 11:50 - 12:25 HKT

Level 1 | Hung Hom Room 7

As AI rapidly evolves and embraces cloud-native technologies, inference performance has become crucial for application value. GPU selection, serving framework configuration, and model/data loading significantly impact inference efficiency. We'll focus on cloud-native solutions to storage performance issues and tools for evaluating inference performance across configurations, offering optimal deployment setups integrated into cloud-native workflows. We'll discuss inference performance's impact on user experience and how optimization can reduce costs and improve efficiency. Using technologies like Fluid and model optimization, we'll share strategies to enhance inference performance. Based on performance and cost analysis of various GPUs, we'll guide AI engineers in hardware selection. Additionally, we'll introduce a performance testing tool to evaluate and recommend the best model, hardware, and acceleration scheme combinations, aligning with deployment workflows based on test results.

随着人工智能的快速发展和对云原生技术的采用，推理性能对应用价值变得至关重要。 GPU选择、服务框架配置以及模型/数据加载对推理效率有着重大影响。我们将专注于云原生解决方案，解决存储性能问题，并提供评估不同配置下推理性能的工具，为云原生工作流程提供最佳部署设置。我们将讨论推理性能对用户体验的影响，以及优化如何降低成本并提高效率。利用Fluid和模型优化等技术，我们将分享增强推理性能的策略。基于各种GPU的性能和成本分析，我们将指导人工智能工程师进行硬件选择。此外，我们将介绍一种性能测试工具，评估并推荐最佳模型、硬件和加速方案组合，根据测试结果与部署工作流程相匹配。

Speakers

Yifei Zhang

Software Engineer, Bytedance

Yifei Zhang, Software Engineer at Volcengine, focuses on technical research and product development in Kubernetes and AI, and has rich experience in public cloud, and is now fully working on VKE (Volcengine Kubernetes Engine), which is the managed Kubernetes product in Volcengine... Read More →

钱磊

Software Engineer, Bytedance

a kubernetes developer in bytedance. focus on building a stable kubernetes engine on public cloud.

Wednesday August 21, 2024 11:50 - 12:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

13:50 HKT

Boundaryless Computing: Optimizing LLM Performance, Cost, and Efficiency in Multi-Cloud Architecture | 无边界计算：在多云架构中优化LLM性能、成本和效率 - Jian Zhu, Red Hat & Kai Zhang, Alibaba Cloud Intelligence

Wednesday August 21, 2024 13:50 - 14:25 HKT

Level 1 | Hung Hom Room 7

For large language model (LLM) inference, GPU resources within a single data center or cloud region often cannot meet all user demands. Additionally, for the end-users, deploying across multiple geographic regions is necessary to provide an optimal user experience. However, managing model distribution, synchronization, and consistency across multiple regions presents new challenges. To address this, the OCM and Fluid communities have collaborated to automate the multi-region distribution of inference applications through OCM's multi-cluster application deployment capabilities, combined with Fluid's data orchestration capabilities. This automation facilitates the cross-regional distribution and pre-warming of large models, enhancing the efficiency of model deployment and upgrades.

对于大型语言模型（LLM）推理，单个数据中心或云区域内的GPU资源通常无法满足所有用户需求。此外，对于最终用户来说，跨多个地理区域部署是为了提供最佳用户体验。然而，在多个地区管理模型分发、同步和一致性会带来新的挑战。为了解决这个问题，OCM和Fluid社区合作，通过OCM的多集群应用部署能力和Fluid的数据编排能力自动化实现推理应用的多地区分发。这种自动化促进了大型模型的跨地区分发和预热，提高了模型部署和升级的效率。

Speakers

Kai Zhang

Senior Staff Engineer, Alibaba

Kai Zhang is a Senior Staff Engineer at Alibaba Cloud Intelligence, where he has been part of the team developing the Alibaba Cloud container service for Kubernetes (ACK) for over 6 years. He currently leads ACK’s Cloud native AI product and solution offerings. Before this, he spent... Read More →

Jian Zhu

Senior Software Engineer, RedHat

Zhu Jian is a senior software engineer at RedHat, core contributor to open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.

Wednesday August 21, 2024 13:50 - 14:25 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

14:40 HKT

Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience | 连接点：走向统一的多集群AI/ML体验 - Qing Hao, RedHat & Chen Yu, Microsoft

Wednesday August 21, 2024 14:40 - 15:15 HKT

Level 1 | Hung Hom Room 7

Today cloud-native infra is vital for AI/ML, administrative complexities and the growing demand for compute resources drive devs towards multi-cluster patterns. Batch scheduling projects, like Kueue, are valuable for efficient AI/ML training in a single Kubernetes cluster. Multi-cluster management platforms like OCM and Fleet simplify cluster management and provide advanced scheduling features. We hope to bridge the best of both worlds to simplify user operations and reduce confusion between different systems. In this talk, we will showcase that with the help of Sig Multi-Cluster's newly proposed API - ClusterProfile, combined with OCM, Fleet, and Kueue, to address these challenges. We will demonstrate that MultiKueue setup can be easily automated with the help of the ClusterProfile API; with a few tweaks, users can use OCM and Fleet's advanced scheduling features through MultiKueue to smart place AI/ML jobs across the clusters to maximize resource utilization like GPU to save costs.

今天，云原生基础设施对于人工智能/机器学习、管理复杂性以及对计算资源需求不断增长至关重要，这推动开发人员转向多集群模式。像Kueue这样的批处理调度项目对于在单个Kubernetes集群中高效进行人工智能/机器学习训练非常有价值。OCM和Fleet等多集群管理平台简化了集群管理，并提供了高级调度功能。我们希望将两者的优势结合起来，简化用户操作，减少不同系统之间的混乱。在本次演讲中，我们将展示如何借助Sig Multi-Cluster最新提出的API - ClusterProfile，结合OCM、Fleet和Kueue来解决这些挑战。我们将演示如何通过ClusterProfile API轻松自动化MultiKueue设置；通过一些调整，用户可以利用OCM和Fleet的高级调度功能，通过MultiKueue智能地在集群之间放置人工智能/机器学习作业，以最大化资源利用率，如GPU，以节省成本。

Speakers

Qing Hao

Senior Software Engineer, RedHat

Qing Hao is a senior software engineer at RedHat, where she works as the maintainer of Open Cluster Management. Qing has interest in solving complex problems in the multi-clusters areas, eg, application scheduling, and management components rolling upgrade. Prior to RedHat, she worked... Read More →

Chen Yu

Senior Software Engineer, Microsoft

Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.

Wednesday August 21, 2024 14:40 - 15:15 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:35 HKT

How Fast Can Your Model Composition Run in Serverless Inference? | 您的模型组合在无服务器推理中可以运行多快？ - Fog Dong, BentoML & Wenbo Qi, Ant Group

Wednesday August 21, 2024 15:35 - 16:10 HKT

Level 1 | Hung Hom Room 7

Are you struggling with slow deployment times, high operational costs, or scalability issues when serving your ML models? Now, imagine the added complexity when typical AI apps require not just one, but an interconnected suite of models. In this session, discover how the integration of BentoML with Dragonfly effectively addresses these challenges, transforming the landscape of multi-model composition and inference within serverless Kubernetes envs. Join the co-presentation by the BentoML and Dragonfly communities to explore a compelling case study: a RAG app that combines 3 models—LLM, embedding, and OCR. Learn how our framework not only packages these diverse models efficiently but also utilizes Dragonfly's innovative P2P network for swift distribution. We'll further delve into how other open-source technologies like JuiceFS and VLLM have enabled us to achieve remarkable deployment times of just 40 seconds and establish a scalable blueprint for multi-model composition deployments.

您是否在为机器学习模型的部署时间慢、运营成本高或可扩展性问题而苦恼？现在，想象一下当典型的人工智能应用程序不仅需要一个模型，而是一个相互连接的模型套件时所增加的复杂性。在本场演讲中，了解BentoML与Dragonfly的集成如何有效解决这些挑战，改变了无服务器Kubernetes环境中多模型组合和推理的格局。加入BentoML和Dragonfly社区的联合演示，探索一个引人注目的案例研究：一个结合了LLM、嵌入和OCR三个模型的RAG应用程序。了解我们的框架不仅高效打包这些多样化的模型，还利用Dragonfly创新的P2P网络进行快速分发。我们还将深入探讨其他开源技术，如JuiceFS和VLLM，如何帮助我们实现仅需40秒的部署时间，并为多模型组合部署建立可扩展的蓝图。

Speakers

Wenbo Qi

Senior Software Engineer, Ant Group

Wenbo Qi is a software engineer at Ant Group working on Dragonfly. He is a maintainer of the Dragonfly. He hopes to do some positive contributions to open source software and believe that fear springs from ignorance.

Fog Dong

Senior Software Engineer, BentoML

Fog Dong, a Senior Engineer at BentoML, KubeVela maintainer, CNCF Ambassador, and LFAPAC Evangelist, has a rich background in cloud native. Previously instrumental in developing Alibaba's large-scale Serverless Application Engine workflows and Bytedance's cloud-native CI/CD platform... Read More →

Wednesday August 21, 2024 15:35 - 16:10 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 高级 (Advanced)
Language | 语言 中文 (Chinese)

16:25 HKT

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training | 利用拓扑建模和拓扑感知调度加速LLM训练 - Yang Wang, Huawei

Wednesday August 21, 2024 16:25 - 17:00 HKT

Level 1 | Hung Hom Room 7

In the LLM training and inference era, the bottle neck has changed from computing to network. A lot of high throughput and low latency inter-connect technology are widely used, e.g. nvlink, nvswitch to build hyper computer such as nvidia super pod, google multi-slice, AWS placement group. However, Kubernetes has net yet addressed topology awareness efficiently, resulting in low performance when sub-optimal resources are provisioned. This talk will explore the inter-node communication and resources within node inter-connect. Also analyze how these two toplogical factors impacts on the runtime performance of AI workload especially for large language model training. The talk will cover: - How to model the topology on underlying resources like NUMA, Rack, Super Pod, Hyper Computer - How to make scheduler to aware of topology and make the best scheduling - How to coordinate topology-aware scheduling with DRA on node

在LLM训练和推断时代，瓶颈已经从计算转变为网络。许多高吞吐量和低延迟的互连技术被广泛使用，例如nvlink、nvswitch用于构建超级计算机，如nvidia超级Pod、谷歌多片、AWS放置组。然而，Kubernetes尚未有效地解决拓扑意识问题，导致在资源配置不佳时性能较低。本次演讲将探讨节点间通信和节点内部资源的互连。还将分析这两个拓扑因素如何影响AI工作负载的运行性能，特别是对于大型语言模型训练。演讲内容包括： - 如何对底层资源（如NUMA、机架、超级计算机）建模拓扑 - 如何使调度程序意识到拓扑并进行最佳调度 - 如何协调拓扑感知调度与节点上的DRA

Speakers

Yang Wang

Senior engineer and maintainer of Volcano, Huawei Cloud Technologies Co., LTD

Volcano maintainer and speaker at KCD and GOTC. Focus on cloud native scheduling and multi-cluster managment.

Wednesday August 21, 2024 16:25 - 17:00 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

17:15 HKT

Leveraging Wasm for Portable AI Inference Across GPUs, CPUs, OS & Cloud-Native Environments | 利用Wasm在GPU、CPU、操作系统和云原生环境中进行可移植的AI推理 - Miley Fu & Hung-Tung Tai, Second State

Wednesday August 21, 2024 17:15 - 17:50 HKT

Level 1 | Hung Hom Room 7

This talk will focus on the advantages of using WebAssembly (Wasm) for running AI inference tasks in a cloud-native ecosystem. We will explore how wasm empowers devs to develop on their own PC and have their AI inference uniformly performed across different hardware, including GPUs and CPUs, operating systems, edge cloud etc. We'll discuss how Wasm and Wasm runtime facilitates seamless integration into cloud-native frameworks, enhancing the deployment and scalability of AI applications. This presentation will specifically highlight how Wasm provides a flexible, efficient solution suitable for diverse cloud-native architectures, including Kubernetes, to allow developers to fully tap the potential of LLMs, especially open source LLMs. The session offers insights into maximizing the potential of AI applications by leveraging the cross-platform capabilities of Wasm, ensuring consistency, low cost, and efficiency in AI inference across different computing environments.

本次演讲将重点介绍在云原生生态中运行AI推理任务时使用WebAssembly（Wasm）的优势。我们将探讨如何使用Wasm使开发者能够在自己的个人电脑上开发，并在不同硬件（包括GPU和CPU）、操作系统、边缘云等上统一执行他们的AI推理。我们将讨论Wasm和Wasm运行时如何实现无缝集成到云原生框架中，增强AI应用程序的部署和可扩展性。本次演示将重点展示Wasm如何提供灵活、高效的解决方案，适用于各种云原生架构，包括Kubernetes，以帮助开发者充分发挥大语言模型的潜力，特别是开源大语言模型。将深入探讨通过利用Wasm的跨平台能力来最大限度地发挥AI应用的潜力，确保在不同计算环境中实现AI推理的一致性、低成本和高效性。

Speakers

Hung-Ying Tai

Software Engineer, Second State

Hung-Ying is a maintainer of the WasmEdge project and a pioneer in compiler optimization and virtual machine design. He is a prolific open-source contributor, participating in many open-source projects, including go-ethereum, solidity, SOLL, crun, and WasmEdge.

Miley Fu

CNCF Ambassador, Founding member at WasmEdge, Second State Inc

Miley is a Developer Advocate with a passion for empowering devs to build and contribute to open source. With over 5 years of experience working on WasmEdge runtime in CNCF sandbox as the founding member, she talked at KubeCon, KCD Shenzhen, CloudDay Italy, DevRelCon, Open Source... Read More →

Wednesday August 21, 2024 17:15 - 17:50 HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

10:35 HKT

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano | 通过Volcano增强的智能基础设施优化LLM工作流程 - Xin Li, qihoo360 & William Wang, Huawei Cloud Technologies Co., LTD

Friday August 23, 2024 10:35 - 11:10 HKT

Level 1 | Hung Hom Room 2

As Large Language Models (LLMs) revolutionize various aspects of our lives, many companies build their cloud native AI platforms to train and fine-tune the LLM. However, managing large-scale LLM training and inference platforms presents even more critical challenges, such as training efficiency, fault tolerance, resource fragmentation, operational costs and topology-aware scheduling on rack and supernode. In this session, the speaker will share insights from their experience using a Kubernetes-based smart infrastructure, enhanced by the Volcano, to manage thousands of GPUs and handle monthly workloads involving thousands of LLM training and inference jobs in qihoo360. This talk will cover: Fault detection, fast job recovery and self-healing drastically improving efficiency.Dealing with long downtime in LLM training on heterogeneous GPU. Intelligent GPU workload scheduling to reduce resource fragmentation and costs. Topology-aware scheduling on rack/supernode to accelerate LLM training.

随着大型语言模型（LLMs）革新我们生活的各个方面，许多公司构建他们的云原生人工智能平台来训练和微调LLM。然而，管理大规模LLM训练和推理平台面临更为关键的挑战，如训练效率、容错性、资源碎片化、运营成本和机架和超级节点上的拓扑感知调度。在这场演讲上，演讲者将分享他们在使用基于Kubernetes的智能基础设施（由Volcano增强）管理数千个GPU并处理qihoo360中涉及数千个LLM训练和推理作业的月度工作负载的经验。本次演讲将涵盖：故障检测、快速作业恢复和自愈大幅提高效率。处理异构GPU上LLM训练的长时间停机。智能GPU工作负载调度以减少资源碎片化和成本。机架/超级节点上的拓扑感知调度以加速LLM训练。

Speakers

Xin Li

Senior Engineer of Server Development, qihoo360

Xin Li is a seasoned senior back-end developer and an approver for the Volcano project. With a keen focus on Kubernetes and AI. The infrastructure he is responsible for provides support for the training and inference of 360GPT.Moreover, Li Xin delves deeply into optimizing distributed... Read More →

Friday August 23, 2024 10:35 - 11:10 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

11:25 HKT

LLM's Anywhere: Browser Deployment with Wasm & WebGPU | LLM随处可用：使用Wasm和WebGPU进行浏览器部署 - Joinal Ahmed, Navatech Group & Nikhil Rana, Google Cloud

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 3

In today's interconnected world, deploying and accessing machine learning (ML) models efficiently poses significant challenges. Traditional methods rely on cloud GPU clusters and constant internet connectivity. However, WebAssembly (Wasm) and WebGPU technologies are revolutionizing this landscape. This talk explores leveraging Wasm and WebGPU for deploying Single Layer Models (SLMs) directly within web browsers, eliminating the need for extensive cloud GPU clusters and reducing reliance on constant internet access. We showcase practical examples and discuss how Wasm enables efficient cross-platform ML model execution, while WebGPU optimizes parallel computation within browsers. Join us to discover how this fusion empowers developers and users alike with unprecedented ease and efficiency in browser-based ML, while reducing dependence on centralized cloud infrastructure and internet connectivity constraints.

在当今互联世界中，高效部署和访问机器学习（ML）模型面临着重大挑战。传统方法依赖于云GPU集群和持续的互联网连接。然而，WebAssembly（Wasm）和WebGPU技术正在彻底改变这一局面。本次演讲探讨了如何利用Wasm和WebGPU在Web浏览器中直接部署单层模型（SLMs），消除了对庞大云GPU集群的需求，减少了对持续互联网访问的依赖。我们展示了实际示例，并讨论了Wasm如何实现高效的跨平台ML模型执行，以及WebGPU如何优化浏览器内的并行计算。加入我们，发现这种融合如何赋予开发人员和用户在基于浏览器的ML中前所未有的便利和效率，同时减少对集中式云基础设施和互联网连接的依赖。

Speakers

Joinal Ahmed

AI Architect, Navatech Group

Joinal is a seasoned Data Science expert passionate about rapid prototyping, community involvement, and driving technology adoption. With a robust technical background, he excels in leading diverse teams through ML projects, recruiting and mentoring talent, optimizing workflows, and... Read More →

Nikhil Rana

AI Consultant, Google Cloud

Nikhil is an applied data science professional with over a decade of experience in developing and implementing Machine learning, Deep Learning, and NLP-based solutions for a variety of industries like Finance, FMCG, etc. He is a passionate advocate for the use of data science to solve... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 3

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

11:25 HKT

New Advances for Cross-Platform AI Applications in Docker | Docker中跨平台AI应用程序的新进展 - Michael Yuan, Second State

Friday August 23, 2024 11:25 - 12:00 HKT

Level 1 | Hung Hom Room 2

The talk proposes to delve into novel methods for enhancing cross-platform GPU/AI workloads within container ecosystems, with a specific emphasis on Docker's incorporation of the WebGPU standard. This standard empowers containerized applications to utilize host GPUs and additional AI accelerators via a flexible API. Consequently, there's no longer a necessity to construct Docker images tailored to individual GPU vendors and their proprietary drivers. The presentation will feature a demonstration highlighting how the WasmEdge project capitalizes on the WebGPU standard to craft portable LLM inference applications in Rust. Additionally, Docker's seamless management and orchestration of these applications will be showcased.

本次演讲旨在探讨增强容器生态系统中跨平台GPU/AI工作负载的新方法，特别强调Docker对WebGPU标准的整合。该标准使容器化应用程序能够通过灵活的API利用主机GPU和额外的AI加速器。因此，不再需要构建针对个别GPU供应商及其专有驱动程序的Docker镜像。演示将展示WasmEdge项目如何利用WebGPU标准在Rust中创建可移植的LLM推理应用程序。此外，还将展示Docker对这些应用程序的无缝管理和编排能力。

Speakers

Michael Yuan

Product Manager, Second State

Dr. Michael Yuan is a maintainer of WasmEdge Runtime (a project under CNCF) and a co-founder of Second State. He is the author of 5 books on software engineering published by Addison-Wesley, Prentice-Hall, and O'Reilly. Michael is a long-time open-source developer and contributor... Read More →

Friday August 23, 2024 11:25 - 12:00 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)

14:10 HKT

Model Service Mesh: A New Paradigm for Large-Scale AI Model Service Deployment and Management | 模型服务网格：大规模AI模型服务部署和管理的新范式 - Xi Ning Wang, Alibaba Cloud & Huailong Zhang, Intel China

Friday August 23, 2024 14:10 - 14:45 HKT

Level 1 | Hung Hom Room 2

As AI/ML models grow in scale and complexity, how to efficiently deploy and manage model service in cloud-native environments has become a significant challenge. This proposal will introduce the Model Service Mesh (MSM), an emerging architectural paradigm designed specifically for large-scale AI model service deployment and management, to address the challenge. This new paradigm focuses on: 1. How to build a highly scalable and reliable model delivery system and the key features include dynamic model service routing, unified management for multi-models within single endpoint, an optimized caching layer, and cache-aware scheduling,etc. 2. How to leverage the MSM to optimize AI models service in lifecycle management, resource utilization improvement, security enhancement, and observability and resilience insurance. In essence, this architecture ensures a scalable, secure, and efficient model service in cloud native environment.

随着人工智能/机器学习模型规模和复杂性的增长，如何在云原生环境中高效部署和管理模型服务已成为一个重大挑战。本提案将介绍模型服务网格（MSM），这是一种专门为大规模人工智能模型服务部署和管理而设计的新兴架构范式，旨在解决这一挑战。这种新范式关注以下几点： 1. 如何构建一个高度可扩展和可靠的模型交付系统，关键特性包括动态模型服务路由、单个端点内多模型的统一管理、优化缓存层和缓存感知调度等。 2. 如何利用MSM优化人工智能模型服务的生命周期管理、资源利用率改善、安全增强以及可观察性和弹性保障。总的来说，这种架构确保了在云原生环境中可扩展、安全和高效的模型服务。

Speakers

王夕宁

Technical Leader, Alibaba Cloud

Wang Xining, senior technical expert of Alibaba Cloud, technical leader of ACK(Kubernetes)/ASM(Service Mesh) , focusing on Kubernetes, service mesh and other cloud native fields. Previously worked in the IBM as tech architect focusing on SOA/Cloud and served as the chairman of the... Read More →

Huailong Zhang

Cloud Software Engineer, Intel China

Steve(Huailong) Zhang has worked for Alcatel-Lucent, Baidu and IBM to engage in cloud computing research and development. Huailong is currently working for Intel China as a cloud-native software engineer, focusing on cloud-native technical fields, such as kubernetes and service mesh... Read More →

Friday August 23, 2024 14:10 - 14:45 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 中文 (Chinese)

15:15 HKT

No More Runtime Setup! Let's Bundle, Distribute, Deploy, Scale LLMs Seamlessly with Ollama Operator | 无需运行时设置！让我们使用Ollama Operator轻松捆绑、分发、部署、扩展LLMs - Fanshi Zhang, DaoCloud

Friday August 23, 2024 15:15 - 15:50 HKT

Level 1 | Hung Hom Room 2

Seeking out a way to ship LLMs more seamlessly? Way too complicated to manage, composite, and setup a runtime with Python, C++, CUDA, GPUs when deploying LLMs? Tired of fighting against dependencies, model sizes, syncing deliverable model images across nodes? It's true that people often find it hard to bundle, distribute, deploy, and scale their own LLM workloads, but no worries, here is Ollama Operator, a scheduler, and utilizer for LLM models powered by Modelfile introduced by Ollama. You can now enjoy then unified bundled, runtime powered by llama.cpp with simple lines of CRD definition or the natively included kollama CLI with single command line, bundling, distributing, deploying, scaling of LLMs can never be easily and seamlessly accomplished across OS and environments. Let's dive in and find out what Ollama Operator with Ollama can do to deploy our own large langaugae models, what can we do and combine these features with Modelfile then bring them into the Kubernetes world!

寻找一种更无缝地运输LLM的方式？在部署LLM时，使用Python、C++、CUDA、GPU设置运行时太复杂？厌倦了与依赖、模型大小、在节点间同步可交付模型图像等问题作斗争？人们常常发现很难捆绑、分发、部署和扩展自己的LLM工作负载，但不用担心，这里有Ollama Operator，一个由Ollama引入的基于Modelfile的LLM模型调度器和利用者。现在，您可以通过简单的CRD定义行或内置的kollama CLI命令行，享受由llama.cpp提供支持的统一捆绑运行时，轻松实现LLM的捆绑、分发、部署和扩展，跨操作系统和环境都可以轻松实现。让我们深入了解一下Ollama Operator与Ollama能够做些什么来部署我们自己的大型语言模型，我们可以如何结合这些功能与Modelfile，然后将它们带入Kubernetes世界！

Speakers

Neko Ayaka

Software Engineer, DaoCloud

Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase

Friday August 23, 2024 15:15 - 15:50 HKT
Level 1 | Hung Hom Room 2

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 任意程度 (Any)
Language | 语言 英语 (English)