-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
157 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Distributed Utils | ||
|
||
!!! note | ||
|
||
These functionalities are available via the `Lux.DistributedUtils` module. | ||
|
||
```@meta | ||
CurrentModule = Lux | ||
``` | ||
|
||
## Index | ||
|
||
```@index | ||
Pages = ["distributed_utils.md"] | ||
``` | ||
|
||
## [Backends](@id communication-backends) | ||
|
||
```@docs | ||
MPIBackend | ||
NCCLBackend | ||
``` | ||
|
||
## Initialization | ||
|
||
```@docs | ||
DistributedUtils.initialize | ||
DistributedUtils.initialized | ||
DistributedUtils.get_distributed_backend | ||
``` | ||
|
||
## Helper Functions | ||
|
||
```@docs | ||
DistributedUtils.local_rank | ||
DistributedUtils.total_workers | ||
``` | ||
|
||
## Communication Primitives | ||
|
||
```@docs | ||
DistributedUtils.allreduce! | ||
DistributedUtils.bcast! | ||
DistributedUtils.reduce! | ||
DistributedUtils.synchronize!! | ||
``` | ||
|
||
## Optimizers.jl Integration | ||
|
||
```@docs | ||
DistributedUtils.DistributedOptimizer | ||
``` | ||
|
||
## MLUtils.jl Integration | ||
|
||
```@docs | ||
DistributedUtils.DistributedDataContainer | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Distributed Data Parallel Training | ||
|
||
!!! tip | ||
|
||
For a fully functional example, see the | ||
[ImageNet Training Example](https://github.com/LuxDL/Lux.jl/tree/main/examples/ImageNet) | ||
|
||
DDP Training using `Lux.DistributedUtils` is a spiritual successor to | ||
[FluxMPI.jl](https://github.com/avik-pal/FluxMPI.jl), but has some key differences. | ||
|
||
## Guide to Integrating DistributedUtils into your code | ||
|
||
## [GPU-Aware MPI](@id gpu-aware-mpi) | ||
|
||
If you are using a custom MPI build that supports CUDA or ROCM, you can use the following | ||
preferences with [Preferences.jl](https://github.com/JuliaPackaging/Preferences.jl): | ||
|
||
1. `LuxDistributedMPICUDAAware` - Set this to `true` if your MPI build is CUDA aware. | ||
2. `LuxDistributedMPIROCMAware` - Set this to `true` if your MPI build is ROCM aware. | ||
|
||
By default, both of these values are set to `false`. | ||
|
||
## Migration Guide from `FluxMPI.jl` | ||
|
||
Let's compare the changes we need to make wrt the | ||
[FluxMPI.jl integration guide](https://avik-pal.github.io/FluxMPI.jl/dev/guide/). | ||
|
||
1. `FluxMPI.Init` is now [`DistributedUtils.initialize`](@ref). | ||
2. `FluxMPI.synchronize!(x)` needs to be changed to | ||
`x_new = DistributedUtils.synchronize!!(backend, x)`. | ||
3. [`DistributedUtils.DistributedDataContainer`](@ref), | ||
[`DistributedUtils.local_rank`](@ref), and | ||
[`DistributedUtils.DistributedOptimizer`](@ref) need `backend` as the first input. | ||
|
||
And that's pretty much it! | ||
|
||
### Removed Functionality | ||
|
||
1. `FluxMPI.allreduce_gradients` no longer exists. Previously this was needed when CUDA | ||
communication was flaky, with `NCCL.jl` this is no longer the case. | ||
2. `FluxMPIFluxModel` has been removed. `DistributedUtils` no longer works with `Flux`. | ||
|
||
### Key Differences | ||
|
||
1. `FluxMPI.synchronize!` is now `DistributedUtils.synchronize!!` to highlight the fact | ||
that some of the inputs are not updated in-place. | ||
2. All of the functions now require a [communication backend](@ref communication-backends) | ||
as input. | ||
3. We don't automatically determine if the MPI Implementation is CUDA or ROCM aware. See | ||
[GPU-aware MPI](@ref gpu-aware-mpi) for more information. | ||
4. Older [`Lux.gpu`](@ref) implementations used to "just work" with `FluxMPI.jl`. We expect | ||
[`gpu_device`](@ref) to continue working as expected, however, we recommend using | ||
[`gpu_device`](@ref) after calling [`DistributedUtils.initialize`](@ref) to avoid any | ||
mismatch between the device set via `DistributedUtils` and the device stores in | ||
`LuxCUDADevice` or `LuxAMDGPUDevice`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters