- This repository provides an example of reading from a single shared memory tensor from multiple processes (e.g., with DDP).
- Useful for loading a large tensor (e.g., the entire dataset) to the CPU to speed up I/O without incurring Nx memory usage where N is the number of GPUs/processes
- We use the standard
torch.utils.data.Dataloader
which might make it easier for you to use this in your own code - Works with
torchrun
- Does not depend on
detectron2
- We did not test this script in the multi-node setting. It probably would not work.
(N
is the number of GPUs/processes)
-
Run
torchrun --standalone --nproc_per_node=N main-multigpu-naive.py
-
Look at the memory usage.
-
Run
torchrun --standalone --nproc_per_node=N main-multigpu-shared.py
-
Look at the memory usage again.
- Python >= 3.7
- Linux
- PyTorch >= 1.10
pip install psutil tabulate tensordict
Inspired by and modified from https://github.com/ppwwyyxx/RAM-multiprocess-dataloader
See also: