Name: VeScale: A PyTorch Native LLM Training Framework | veScale：一个PyTorch原生LLM训练框架 - Hongyu Zhu, ByteDance
Start: 2024-08-22T11:50:00+0800
End: 2024-08-22T12:25:00+0800

In-person
21-23 August, 2024
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

亲临现场

2024年8月21-23日

了解更多并注册参加

Sched应用程序允许您创建自己的日程安排，但不能替代您的活动注册。您必须注册参加KubeCon + CloudNativeCon + Open Source Summit + AI_Dev China 2024，才能参加会议。如果您尚未注册但希望加入我们，请访问活动注册页面购买注册。

请注意：本日程自动显示为香港标准时间（UTC +8）。要查看您偏好的时区的日程，请从右侧“按日期筛选”上方的下拉菜单中选择。日程可能会有变动，会议席位先到先得。

Thursday August 22, 2024 11:50am - 12:25pm HKT

Level 1 | Hung Hom Room 3

The era of giant LLM today calls forth distributed training. Despite countless distributed training frameworks that have been published in the past decade, few have excelled at real industry production, as the quality favored the most is often the Ease of Use instead of pure Performance. The Ease of Use lies in two essentials -- PyTorch and Automatic Parallelism, because: i) PyTorch ecosystem dominates and owns 92% of models on HuggingFace, and ii) giant models cannot be trained without complex nD Parallelism. Currently, this Ease of Use is "broken" for industry-level frameworks, as they are either not PyTorch-native (TensorFlow/JAX) or not fully Automated (Megatron/DeepSpeed/torch). We propose a novel framework that combines PyTorch Nativeness and Automatic Parallelism for scaling LLM training with Ease of Use. We only expect developers to write single-device torch code but automatically parallelize it into nD parallelism with all heavy lifting handled transparently.

当今巨型LLM时代呼唤分布式训练。尽管过去十年中已经发布了无数分布式训练框架，但很少有能够在真实产业生产中表现出色，因为最受青睐的质量往往是易用性而不是纯性能。易用性在于两个关键点--PyTorch和自动并行性，因为：i）PyTorch生态系统主导并拥有HuggingFace上92%的模型，ii）巨型模型无法在没有复杂的nD并行性的情况下进行训练。目前，这种易用性对于产业级框架来说已经“破碎”，因为它们要么不是PyTorch原生的（TensorFlow/JAX），要么不是完全自动化的（Megatron/DeepSpeed/torch）。我们提出了一个结合了PyTorch原生性和自动并行性的新型框架，以便通过易用性扩展LLM训练。我们只期望开发人员编写单设备torch代码，但自动将其并行化为nD并行性，所有繁重的工作都由框架透明地处理。

Speakers

Hongyu Zhu

Machine Learning System Software Engineer, ByteDance

Hongyu is a Machine Learning System Engineer in ByteDance AML group, working on systems and compilers for training workloads. He got his PhD degree from University of Toronto, where he worked with Professor Gennady Pekhimenko. He is generally interested in machine learning compilers... Read More →

Thursday August 22, 2024 11:50am - 12:25pm HKT
Level 1 | Hung Hom Room 3

AI_dev: Open Source GenAI & ML Summit Sessions, Foundations + Frameworks + Tools for Machine Learning

Experience Level | 内容经验水平 中级 (Intermediate)
Language | 语言 英语 (English)

KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024

Hongyu Zhu

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!