Name: Accelerating Serverless AI Large Model Inference with Functionalized Scheduling and RDMA | 通过功能化调度和RDMA加速无服务器AI大模型推理 - Yiming Li, Tianjin University& Chenglong Wang, Jinan Inspur Data Technology Co., Ltd.
Start: 2024-08-21T11:00:00+0800
End: 2024-08-21T11:35:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Wednesday August 21, 2024 11:00am - 11:35am HKT

Level 1 | Hung Hom Room 7

The deployment of AI large models on standard Serverless inference platforms like KServe is gaining popularity due to its ability to improve resource utilization and reduce costs. However, existing large model inference faces significant scheduling and communication bottlenecks, making it challenging to meet low-latency and high-throughput demands. The centralized control plane of Kubernetes leads to low scheduling efficiency, unable to achieve second-level response to large-scale burst requests. Additionally, the large model inference needs to transfer GB-level KV cache for each request, resulting in high communication overhead. So, we have developed a highly elastic functionalized scheduling framework to guarantee second-level scheduling for thousands of Serverless AI large model inference task instances. Additionally, we leverage RDMA technology to achieve high-speed KV cache migration, avoiding the high overhead caused by traditional network protocol stacks.

AI大模型在像KServe这样的标准无服务器推理平台上的部署越来越受欢迎，因为它能够提高资源利用率并降低成本。然而，现有的大模型推理面临着重要的调度和通信瓶颈，使得满足低延迟和高吞吐量需求变得具有挑战性。Kubernetes的集中式控制平面导致低调度效率，无法实现对大规模突发请求的秒级响应。此外，大模型推理需要为每个请求传输GB级别的KV缓存，导致高通信开销。因此，我们开发了一个高度弹性的功能化调度框架，以确保对数千个无服务器AI大模型推理任务实例进行秒级调度。此外，我们利用RDMA技术实现高速KV缓存迁移，避免传统网络协议栈引起的高开销。

Speakers

Cookie

Senior Software Engineer, Jinan Inspur Data Technology Co., Ltd.

I'm employed in Inspur. I mainly do container computing related development and are familiar with container networks, especially Calico and Cilium. I'm also a contributor to the Openyurt community and mainly participate in the development of the raven project.

Yiming Li

PhD candidate, Tianjin University

Yiming Li received the bachelor’s and master’s degrees from Tianjin University, China, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include cloud com... Read More →

Wednesday August 21, 2024 11:00am - 11:35am HKT
Level 1 | Hung Hom Room 7

KubeCon + CloudNativeCon Sessions, AI + ML

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 中文 (Chinese)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Cookie

Yiming Li

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!