Port distributed training support from existing PR #1315

lrzpellegrini · 2023-03-07T14:44:58Z

I'm opening this PR to keep track of the work needed to port the content of the #996 PR to the main branch.

The idea is to split that PR (which is huge and based on a quite old version of the codebase) and, starting from the current state of the main branch, port its main elements in smaller PRs.
I'll keep this issue updated as I work on this.

Many changes are not strictly related to supporting distributed training but may benefit Avalanche in general.

I'm starting with porting the modernized object detection/segmentation dataset, strategies, and metrics. I'll also port the generalized batch collate functionality.

Changes in Distributed Training PR #996:

Legend:

🔲 Not ported
⌛ Work in progress
💬 PR opened, discussion in progress
✔️ Merged into main branch

Base elements

✔️ DistributedHelper implementation (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
🔲 Distributed value, object, batch, model, tensor, ...
✔️ Distributed consistency (hashers) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
🔲 Distributed training example (and runner script)

Strategy e plugins

✔️ New supports_distributed plugin field (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ New _distributed_check strategy field and related _check_distributed_training_compatibility() check (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
🔲 New wrap_distributed_model strategy lifecycle method. Called from ..._observation.py
✔️ _obtain_common_dataloader_parameters strategy method (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
🔲 Strategy support superclasses
🔲 Various plugin adaptations for distributed training (LwF, CWR, ...)
✔️ AR1: modernize to use _obtain_common_dataloader_parameters (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
🔲 Strategy templates: wrap various lifecycle methods to allow for seamless support of distributed training
- Implementations should now be in _backward(), _forward(), ... while wrapping happens in backward, forward. Wrapper methods should be final, but Python is not strict on this (flexibility).

Models

✔️ Fixed device issues with dynamic models (unrelated to distributed training) (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ In avalanche_forward, generalize using is_multi_task_module to consider DDP wrapping (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)

Detection

✔️ Detection scenario modernization (Typing system overhaul. Improve support for object detection scenarios. #1333)
✔️ Detection template (incl. Naive) modernization (Typing system overhaul. Improve support for object detection scenarios. #1333)
✔️ Updated detection example (Typing system overhaul. Improve support for object detection scenarios. #1333)
✔️ Detection dataset based on new dataset creation procedure (Typing system overhaul. Improve support for object detection scenarios. #1333)
⌛ Collate generalization

Data Loader

✔️ Use DistributedHelper, remove mock _DistributedHelper (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ Various fixes to address drop_last, shuffle, etcetera

Loggers and metrics

✔️ Disable loggers creation for non-main processes (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ Default logger: pass 'default' instead of loggers list (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ Strategies constructor: allow strategies to accept a factory for the evaluator constructor parameter (evaluator=default_evaluator() -> evaluator=default_evaluator). (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
✔️ All strategy classes: change the default evaluator parameter value to use a factory. (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)

Unit tests

🔲 Called in both environment-update and unit-test actions
- ✔️ Unit test runner: run_dist_tests.py and related utils (Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370)
- End-to-end test script

Typing

✔️ Various typing fixes/integrations in AvalancheDataset and FlatData (Typing system overhaul. Improve support for object detection scenarios. #1333)
- Mostly to improve the programming experience of VSCode users

The text was updated successfully, but these errors were encountered:

lrzpellegrini self-assigned this Mar 7, 2023

lrzpellegrini mentioned this issue Mar 7, 2023

Distributed support (rework) #996

Draft

lrzpellegrini mentioned this issue Mar 30, 2023

Typing system overhaul. Improve support for object detection scenarios. #1333

Merged

lrzpellegrini mentioned this issue May 10, 2023

Add base elements to support distributed comms. Add supports_distributed plugin flag. #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port distributed training support from existing PR #1315

Port distributed training support from existing PR #1315

lrzpellegrini commented Mar 7, 2023 •

edited

Loading

Port distributed training support from existing PR #1315

Port distributed training support from existing PR #1315

Comments

lrzpellegrini commented Mar 7, 2023 • edited Loading

Changes in Distributed Training PR #996:

Base elements

Strategy e plugins

Models

Detection

Data Loader

Loggers and metrics

Unit tests

Typing

lrzpellegrini commented Mar 7, 2023 •

edited

Loading