[python] Balance distributed partition size more evenly #1135

atolopko-czi · 2024-05-09T13:35:31Z

Generate the partitions by splitting on obs_joinids instead of on soma chunks.

Previously, when the number of soma chunks was not evenly divisible by the partition count ("world size"), partition sizes could become drastically imbalanced.

The new algo for shuffled, partitioned ids is:

Chunk global ids into soma chunks
Globally shuffle the chunks (do not shuffle the chunks internally)
Create a flat list of global ids again
Partition the flat list of globals id, and select the ids for the current partition
Chunk into soma chunks (again) within the partition

When performing partitioning without shuffling, the soma chunks from step 1 may end up being split across two partitions. This introduces a minor read performance hit, since each original soma chunk that is split across partitions will result in 2 read operations instead of 1.

When performing partitioning with shuffling, each soma chunk from step 1 may end up being split across two soma chunks in step 5. This is because the latter chunks will not be "aligned" with the former chunks when the partition sizes are not evenly divisible by the chunk size. This introduces a 2x read performance hit in the worst case. Consider that each partition-local chunk may be composed of 2 original chunks that are not stored contiguously on disk. This could be addressed by explicitly aligning the step 5 chunks with the step 1 chunks, but this has not been implemented in the current PR.

Now, the worst case imbalance is that any two partitions will only differ in size by 1 row. This is due to the fact the numpy.array_split() evenly distributes the imbalances across the splits that it produces. Since only imbalances of size 1 can occur, it should not be necessary to drop rows or pad rows in any partition.

New unit test cases are included in test_distributed__returns_data_partition_for_rank, which now exercises the imbalanced cases as well. I've also added test_distributed__returns_data_partition_for_rank_globally_shuffled to demonstrate that global shuffling is maintained.

Generate the partitions by splitting on obs_joinids instead of on soma chunks. Previously, when the number of soma chunks was not evenly divisble by the partition count ("world size"), partition sizes could become drastically imbalanced. Now, the worst case imbalance is that any two partitions will only differ in size by 1 row. This is due to the fact the numpy.array_split() distributed imbalances across the splits that it produces.

codecov · 2024-05-09T14:29:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.12%. Comparing base (c796baf) to head (095c34c).

❗ Current head 095c34c differs from pull request most recent head cd84ef2. Consider uploading reports for the commit cd84ef2 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1135   +/-   ##
=======================================
  Coverage   91.12%   91.12%           
=======================================
  Files          77       77           
  Lines        5857     5860    +3     
=======================================
+ Hits         5337     5340    +3     
  Misses        520      520

Flag	Coverage Δ
unittests	`91.12% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

atolopko-czi self-assigned this May 9, 2024

perform global chunk shuffling for distributed partitions

095c34c

atolopko-czi mentioned this pull request May 9, 2024

Fix ml.ExperimentDataPipe for Distributed Training #1119

Closed

ebezzi changed the title ~~#1119 Balance distributed partition size more evenly~~ [python] Balance distributed partition size more evenly May 9, 2024

tests, comments

cd84ef2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Balance distributed partition size more evenly #1135

[python] Balance distributed partition size more evenly #1135

atolopko-czi commented May 9, 2024 •

edited

Loading

codecov bot commented May 9, 2024 •

edited

Loading

[python] Balance distributed partition size more evenly #1135

Are you sure you want to change the base?

[python] Balance distributed partition size more evenly #1135

Conversation

atolopko-czi commented May 9, 2024 • edited Loading

codecov bot commented May 9, 2024 • edited Loading

Codecov Report

atolopko-czi commented May 9, 2024 •

edited

Loading

codecov bot commented May 9, 2024 •

edited

Loading