Pipeline Parallelism (Supported? How to?) #827

casper-hansen · 2024-11-14T09:20:16Z

🚀 Feature Request

Supporting TP and SP seems quite easy to do with the `replication parameter:

replication = tp * sp

I have tried various ways to enable PP without success (unexpected high loss). I tried adding pp into the equation when computing replication and num_canonical_nodes, but I cannot get it to function normally because I get an unexpected high loss.

Motivation

I want to use the mosaicml streaming library with 4D parallel. Specifically, I rely on TorchTitan as my training tool and have simply swapped in the mosaicml streaming library by modifying the StreamingTextDataset implementation from LLM Foundry.

The text was updated successfully, but these errors were encountered:

ethantang-db · 2024-11-15T19:57:39Z

we can look into this more in detail, meanwhile, have you tried using mosaicml/composer though for training? Are there specific features you are relying on in Torchtitan?

casper-hansen · 2024-11-15T20:07:46Z

I would really appreciate if you could look into it! TorchTitan uses torch.distributed.pipelining, most of which is only available from 2.5.0 or in nightly builds.

There are many key features like FSDP2, 4D parallelism, FP8, and torch.compile that makes LLaMa models scale well in pretraining. You also get full control over the training loop which is desirable if you want to experiment.

snarayan21 · 2025-01-04T05:49:16Z

@casper-hansen So StreamingDataset's replication argument assumes that the ranks that have replicated samples are in contiguous blocks of global rank indices. Concretely, suppose on 16 GPUs, I have a replication factor of 2. Then StreamingDataset will replicate the same samples on GPU ranks 0 and 1, 2 and 3, 4 and 5, and so on. In the 4D parallelism case, you likely have ranks that are not contiguous, but still want to replicate samples over these ranks (as in, using the previous example, you may want GPU ranks 0, 1, 8, and 9 to see the same samples).

We currently enable replication through the World object's replicate function (called here) which is used to set the correct global node and rank indices to construct the sample partition over and retrieve samples. If you want to try enabling 4D parallelism yourself, I would take a look at the replicate function here and allow it to create a new World object with the right information according to your sharding & parallelism strategy.

casper-hansen added the enhancement New feature or request label Nov 14, 2024

casper-hansen mentioned this issue Nov 14, 2024

Support tensor parallel/pipeline parallel #397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Parallelism (Supported? How to?) #827

Pipeline Parallelism (Supported? How to?) #827

casper-hansen commented Nov 14, 2024

ethantang-db commented Nov 15, 2024

casper-hansen commented Nov 15, 2024 •

edited

Loading

snarayan21 commented Jan 4, 2025

Pipeline Parallelism (Supported? How to?) #827

Pipeline Parallelism (Supported? How to?) #827

Comments

casper-hansen commented Nov 14, 2024

🚀 Feature Request

Motivation

ethantang-db commented Nov 15, 2024

casper-hansen commented Nov 15, 2024 • edited Loading

snarayan21 commented Jan 4, 2025

casper-hansen commented Nov 15, 2024 •

edited

Loading