Name: Simplify AI Infrastructure with Kubernetes Operators | 使用Kubernetes Operators简化AI基础设施 - Ganeshkumar Ashokavardhanan, Microsoft & Tariq Ibrahim US, NVIDIA
Start: 2024-08-21T16:25:00+0800
End: 2024-08-21T17:00:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Wednesday August 21, 2024 4:25pm - 5:00pm HKT

Level 1 | Hung Hom Room 3

ML applications often require specialized hardware and additional configuration to run efficiently and reliably on Kubernetes. However, managing the cluster lifecycle, the diversity and complexity of hardware configuration across nodes can be challenging. How can we simplify and automate this process to ensure a smooth experience for kubernetes users? Kubernetes Operators offer a great solution. In this session, we will go over operators and demonstrate how they can help automate the installation, configuration, and lifecycle management of AI-ready infra end to end from cluster provisioning and k8s node configuration to deep learning model deployments. We will demo a fine-tuning LLM workload, to showcase how existing operators in the ecosystem such as Cluster API Operator, GPU Operator, Network Operator, and the Kubernetes AI Toolchain Operator, can be used to simplify the infra. Finally, we will discuss challenges and best practices of using operators in production.

ML 应用通常需要专门的硬件和额外的配置才能在 Kubernetes 上高效可靠地运行。然而，管理集群生命周期、节点间硬件配置的多样性和复杂性可能具有挑战性。我们如何简化和自动化这个过程，以确保 Kubernetes 用户的顺畅体验？ Kubernetes 运算符提供了一个很好的解决方案。在本场演讲中，我们将介绍运算符，并演示它们如何帮助自动化 AI-ready 基础架构的安装、配置和生命周期管理，从集群提供和 k8s 节点配置到深度学习模型部署。我们将演示一个微调 LLM 工作负载，展示生态系统中现有运算符（如 Cluster API Operator、GPU Operator、Network Operator 和 Kubernetes AI Toolchain Operator）如何简化基础架构。最后，我们将讨论在生产环境中使用运算符的挑战和最佳实践。

Speakers

Ganeshkumar Ashokavardhanan

Software Engineer, Microsoft

Ganesh is a Software Engineer on the Azure Kubernetes Service team at Microsoft, working on node lifecycle, and is the lead for the GPU workload experience on this kubernetes platform. He collaborates with partners in the ecosystem like NVIDIA to support operator models for machine... Read More →

Tariq Ibrahim US

Senior Cloud Platform Engineer, NVIDIA

Tariq Ibrahim is a Senior Cloud Platform Engineer on the Cloud Native team at NVIDIA where he works on enabling GPUs in containers and Kubernetes. He is a maintainer of the NVIDIA GPU Operator. He has also contributed to several cloud native OSS projects like kube-state-metrics, Istio... Read More →

Wednesday August 21, 2024 4:25pm - 5:00pm HKT
Level 1 | Hung Hom Room 3

AI_dev: Open Source GenAI & ML Summit Sessions, Foundations + Frameworks + Tools for Machine Learning

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Ganeshkumar Ashokavardhanan

Tariq Ibrahim US

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!