Loading…
Virtual Event
November 17–November 20, 2020
Learn More and Register to Attend This Event

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2020 - Virtual to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Eastern Standard Time (UTC–05:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Wednesday, November 18 • 3:00pm - 3:35pm
High Performance Networking for Distributed DL Training in Production K8s - Nivedita Viswanath & Vatsan Kasturi, NVIDIA

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.


Distributed DL training requires high performance networks connecting tens, hundreds, or for certain natural language processing models, even thousands of GPUs. Running these workloads on Kubernetes clusters of GPU enhanced servers requires careful engineering to avoid bottlenecks at NIC and switching fabric that act as interconnect between nodes. In this presentation we will describe the design and architecture of a 800 GPU cluster interconnected over RoCE fabric to achieve line rate performance between communicating containers in a multi-node job. Some of the topics we will cover are scalable cookie-cutter POD design for DC, low latency one hop network design that enables NCCL rings to avoid output port congestion and K8s integration with a multi-homed network for optimal GPU utilization. We will share performance numbers for training workloads from our production clusters.

Speakers
avatar for Nivedita Viswanath

Nivedita Viswanath

Software Engineer, NVIDIA
Nivedita Viswanath is a Senior Systems Software Engineer at Nvidia and is a member of the Nvidia GPU Cloud team. During her time at Nvidia, Nivedita has worked on orchestrating execution of DL workloads on GPU clusters managed by Mesos and Kubernetes, with a focus on batch job scheduling... Read More →
avatar for Vatsan Kasturi

Vatsan Kasturi

Architect, Nvidia Inc
Vatsan Kasturi is an Architect at Nvidia and is a member of the Nvidia GPU Cloud team. My focus areas are DCs, Networking, Systems and Clusters, AI/ML workloads and work on infrastructure projects.



Wednesday November 18, 2020 3:00pm - 3:35pm EST
Intrado Virtual Event Platform