Multi-GPU Training with PyTorch: Data and Model Parallelism

About

The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown. This workshop aims to prepare researchers to use the new H100 GPU nodes as part of Princeton Language and Intelligence.

Setup

Make sure you can run Python on Adroit:

$ ssh <YourNetID>@adroit.princeton.edu  # VPN required if off-campus
$ git clone https://github.com/PrincetonUniversity/multi_gpu_training.git
$ cd multi_gpu_training
$ module load anaconda3/2023.9
(base) $ python --version
Python 3.11.5

Getting Help

If you encounter any difficulties with the material in this guide then please send an email to [email protected] or attend a help session.

Authorship

This guide was created by Mengzhou Xia, Alexander Wettig and Jonathan Halverson. Members of Princeton Research Computing made contributions to this material.

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
01_single_gpu		01_single_gpu
02_pytorch_ddp		02_pytorch_ddp
03_pytorch_lightning		03_pytorch_lightning
04_model_parallel_with_fsdp		04_model_parallel_with_fsdp
tensorflow		tensorflow
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Setup

Getting Help

Authorship

About

Releases

Packages

Contributors 3

Languages

PrincetonUniversity/multi_gpu_training

Folders and files

Latest commit

History

Repository files navigation

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Setup

Getting Help

Authorship

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages