Support MPI #752

mofeing · 2025-02-15T18:52:47Z

This PR...

Registers MPI routine symbol address when MPI.jl gets loaded
Specializes MPI.jl methods to be traced by Reactant

unresolved questions

~~how can we represent MPI_Request with tensor and stablehlo types?~~
mmm stablehlo.custom_call has a backend attribute that could be useful during lowering; e.g. if we want to lower to NCCL instead of MPI, since both have a similar API, we could potentially add our own custom c-functions that use NCCL but adapt them to MPI-like API
@wsmoses can we create @cfunctions in Julia and pass them to the symbol table? some MPI routines might need a lil bit of adaption and writing them in Julia would be easier, faster (and also, would use the correct symbols from MPI.jl-loaded libmpi)

tested

to do

MPI communicators
sharding
more MPI routines
custom reduction operators

cc @JBlaschke @hhkit

wsmoses · 2025-02-15T23:05:50Z

you won't, instead you'll emit something like


function send_wrap(%arg : memref<axb>) {
    mpi.send %arg
}

function main() {
    ...
    enzymexla.jit_call @set_wrap(%x : tensor<...>)
}

And then lower-jit will convert into a custom call. however you will need to define a lowering of mpi.send into a corresponding MPI_Send call [which will use the symbol you just registered here]

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

ext/ReactantMPIExt/Overrides.jl

mofeing · 2025-02-16T11:19:50Z

mmm from our last discussion on this a couple of weeks ago, i understood that we would emit this

function main() {
    ...
    mpi.send(%arg0, ...)
    ...
}

and it would get lowered to

function send_wrap(%arg : memref<axb>) {
    llvm.call <0xffff> (%arg)
}

function main() {
    ...
    enzymexla.jit_call @send_wrap(%x : tensor<...>)
    ...
}

which will finally lower to the following with the enzymexla.jit pass

function main() {
    ...
    stablehlo.custom_call @mpi_send_wrap(%x : tensor<...>)
    ...
}

is this correct or do we need to emit the enzymexla.jit_call directly from Reactant?

ahh or do you mean that any wrapping we need to do around MPI should be done in this way?

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

okay, this will probably be required for NCCL

ext/ReactantMPIExt/Overrides.jl

Co-authored-by: Paul Berg <[email protected]>

src/mlir/IR/Operation.jl

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

mofeing · 2025-03-16T09:08:24Z

The PR is ready for review. The missing MPI routines are waiting to other PRs or need some fix, but they can be added later.

@giordano MPI testset result gets printed multiple times due to them being run on all ranks but I guess is not a problem?

giordano

MPI testset result gets printed multiple times due to them being run on all ranks but I guess is not a problem?

That's unfortunate. Just don't use @testset? We don't in https://github.com/JuliaParallel/MPI.jl/tree/5ef7fef6d6c3e2ab2ad380f346c77235f47213bf/test

giordano · 2025-03-16T16:02:55Z

test/runtests.jl

+        @safetestset "MPI" begin
+            using MPI
+            nranks = 2
+            run(`$(mpiexec()) -n $nranks $(Base.julia_cmd()) integration/mpi.jl`)


Suggested change

run(`$(mpiexec()) -n $nranks $(Base.julia_cmd()) integration/mpi.jl`)

run(`$(mpiexec()) -n $nranks $(Base.julia_cmd()) --startup-file=no $(joinpath(@__DIR__, "integration", "mpi.jl")`)

Pangoraw reviewed Feb 16, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

avik-pal reviewed Feb 22, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

mofeing commented Feb 22, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

mofeing force-pushed the ss/mpi branch from b42cc8f to af77526 Compare February 27, 2025 16:25

mofeing and others added 24 commits March 6, 2025 10:25

Register MPI symbols on load

c5f72cd

ops

d5eaa2d

Update ext/ReactantMPIExt/Overrides.jl

5f800fe

Co-authored-by: Paul Berg <[email protected]>

Update ext/ReactantMPIExt/Overrides.jl

5e4a8cd

register MPI constants

215cb1d

Fix MPI specializations

4b54755

fix some symbol registration

e7fd20e

refactor MPI Ops

774982a

Add functionality for parsing single operations (Julia code)

d1fe99b

Update Ops.comm_rank to use handwritten MLIR injection

81dd8f2

comment

2ebb52a

Update Ops.comm_size

11ef645

fixes

406fe08

Refactor Ops.barrier

108c679

Refactor to try_inject_to_top_block!

bfac727

Refactor MLIR injection

20fadc2

Refactor MPI constante registration

2babb79

Fix type inference in Ops.hlo_call on empty args

2c8e95c

Fix MLIR of Ops.comm_rank

a6738f5

Fix MLIR injection C-functions

9710e59

Go back to Cint for registering symbols

f865e3e

Add tryinjectop!

2778f1d

Add tryinject!, inject! methods

eb5da6e

Update comm_rank

5995a7b

mofeing added 2 commits March 6, 2025 13:25

Add verify flag to tryinject!, parse(::Operation)

6e9b1c5

Update Ops.comm_rank

f4acb15

mofeing force-pushed the ss/mpi branch from 62f4ec5 to f4acb15 Compare March 6, 2025 19:28

mofeing added 14 commits March 11, 2025 05:46

Update comm_rank, comm_size, barrier, wait

85a79ed

Implement Ops.allreduce

4e3477f

Implement Ops.send

9fb0163

Remove comment

f14dd80

Update Ops.wait

14d84e2

Remove comm argument from Comm_size, Barrier overrides

beebfe8

Fix Ops.comm_size

708a574

Fix Ops.barrier

da59f4a

Fixes and renames

b6a9cdf

Override MPI.Allreduce!

c207e02

Fix conversion of MPI constants to word-size type

287bd32

Comment unused MPI datatypes

0c32d65

small fixes

7eb4107

Implement MPI.Recv!

19c0eca

mofeing force-pushed the ss/mpi branch 2 times, most recently from 2744d58 to 19c0eca Compare March 16, 2025 07:33

mofeing and others added 2 commits March 16, 2025 09:01

Test MPI

946339c

Merge branch 'main' into ss/mpi

8d2c3d1

github-actions bot reviewed Mar 16, 2025

View reviewed changes

src/mlir/IR/Operation.jl Outdated Show resolved Hide resolved

mofeing and others added 4 commits March 16, 2025 09:08

Update src/mlir/IR/Operation.jl

42bf6b2

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix mpiexec symbol import

fad1845

Fix typo

0359559

Init and Finalize on MPI tests

dd327ec

mofeing requested a review from wsmoses March 16, 2025 09:06

mofeing marked this pull request as ready for review March 16, 2025 09:06

giordano reviewed Mar 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MPI #752

Support MPI #752

mofeing commented Feb 15, 2025 •

edited

Loading

wsmoses commented Feb 15, 2025

mofeing commented Feb 16, 2025 •

edited

Loading

mofeing commented Mar 16, 2025

giordano left a comment

giordano Mar 16, 2025

	run(`$(mpiexec()) -n $nranks $(Base.julia_cmd()) integration/mpi.jl`)
	run(`$(mpiexec()) -n $nranks $(Base.julia_cmd()) --startup-file=no $(joinpath(@__DIR__, "integration", "mpi.jl")`)

Support MPI #752

Are you sure you want to change the base?

Support MPI #752

Conversation

mofeing commented Feb 15, 2025 • edited Loading

unresolved questions

tested

to do

wsmoses commented Feb 15, 2025

mofeing commented Feb 16, 2025 • edited Loading

mofeing commented Mar 16, 2025

giordano left a comment

Choose a reason for hiding this comment

giordano Mar 16, 2025

Choose a reason for hiding this comment

mofeing commented Feb 15, 2025 •

edited

Loading

mofeing commented Feb 16, 2025 •

edited

Loading