Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Multi GPU for near_global_ocean_simulation.jl #225

Open
sb4233 opened this issue Nov 9, 2024 · 22 comments
Open

Using Multi GPU for near_global_ocean_simulation.jl #225

sb4233 opened this issue Nov 9, 2024 · 22 comments

Comments

@sb4233
Copy link

sb4233 commented Nov 9, 2024

Hi, I am running the near_global_ocean_simulation.jl on GPU. For a single GPU, it works fine. But for multi-GPU usage what changes do I have to make in the code?
I am new to ClimaOcean, so any help would be appreciated. Thanks!

@glwagner
Copy link
Member

glwagner commented Nov 9, 2024

The most important change is to use arch = Distributed(GPU()) rather than just arch = GPU(). However, other changes may be needed. Let us know what you find.

@sb4233
Copy link
Author

sb4233 commented Nov 10, 2024

output.log
I tried this but got the below error:

[ Info: Regridding bathymetry from existing file /g/data/er50/sb4233/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/Bathymetry/ETOPO_2022_v1_60s_N90W180_surface.nc.
ERROR: LoadError: type FixedTimeStepSize has no field averaging_weights
Stacktrace:

it points to somewhere here -

ocean = ocean_simulation(grid)

@simone-silvestri
Copy link
Collaborator

The default configuration of ocean_simulation is not yet working with distributed grids.
As a quick fix, you can do

free_surface = SplitExplicitFreeSurface(grid; substeps = 70)
ocean = ocean_simulation(grid; free_surface)

I will start changing our defaults to allow a Distributed grid.

@glwagner
Copy link
Member

We need tests for a wide variety of expected configurations.

@glwagner
Copy link
Member

@sb4233 thank you for finding this bug

@sb4233
Copy link
Author

sb4233 commented Nov 11, 2024

@sb4233 thank you for finding this bug

Happy to get involved :)

@sb4233
Copy link
Author

sb4233 commented Nov 12, 2024

The default configuration of ocean_simulation is not yet working with distributed grids. As a quick fix, you can do

free_surface = SplitExplicitFreeSurface(grid; substeps = 70)
ocean = ocean_simulation(grid; free_surface)

I will start changing our defaults to allow a Distributed grid.

I got segmentation fault for using this along with arch = Distributed(GPU())
output.log

Think it has something to do with accessing ECCO data for ocean.model -
set!(ocean.model, T=ECCOMetadata(:temperature; dates=date), S=ECCOMetadata(:salinity; dates=date))

@simone-silvestri
Copy link
Collaborator

It looks like you do not have enabled CUDA-aware MPI, which is required to run simulations on multiple GPU.
Enabling CUDA-aware MPI is somewhat system-dependent, but if you have an installation of CUDA-aware MPI on your system, most of the times it is enough to do:

using Pkg
Pkg.add("MPIPreferences")
using MPIPreferences
MPIPreferences.use_system_binary()
# Restart the julia session

You can look at these docs for more in-depth instructions:
https://juliaparallel.org/MPI.jl/stable/configuration/
https://juliaparallel.org/MPI.jl/stable/usage/#CUDA-aware-MPI-support

@glwagner
Copy link
Member

@sb4233 try this and let us know what you find:

https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2

@glwagner
Copy link
Member

Also @sb4233 it would be more convenient if you can post your error here in the chat so we don't have to download a file to see it!

@glwagner
Copy link
Member

@sb4233 can you let us know what is on line 89 of your file near_global_ocean_simulation.jl? I am trying to interpret your error which contains

in expression starting at /g/data/er50/sb4233/Oceananigans/test_julia/near_global_ocean_simulation.jl:89

@glwagner
Copy link
Member

glwagner commented Nov 12, 2024

@simone-silvestri do you understand how the error can come from here:

#set!#20 at /g/data/er50/sb4233/.julia/packages/ClimaOcean/D5bjt/src/DataWrangling/ECCO/ECCO.jl:242
set! at /g/data/er50/sb4233/.julia/packages/ClimaOcean/D5bjt/src/DataWrangling/ECCO/ECCO.jl:226

If CUDA-aware MPI is not available, shouldn't we get an error earlier? For example, during the construction of HydrostaticFreeSurfaceModel (where we call fill_halo_regions!)?

@sb4233 it will help if you let us know what version you're using, because this is not main as far as I can tell.

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Nov 12, 2024

Hmmm right. I think it's cuda aware because the error appears when trying to MPI.Allreduce a cuda array. Weird that the error does not appear before.
Could be a problem of different sizes in the broadcast operation.

That piece of code has changed in the new main, so maybe using main might give a different error.

@sb4233
Copy link
Author

sb4233 commented Nov 12, 2024

@sb4233 can you let us know what is on line 89 of your file near_global_ocean_simulation.jl? I am trying to interpret your error which contains

in expression starting at /g/data/er50/sb4233/Oceananigans/test_julia/near_global_ocean_simulation.jl:89

It's this line - set!(ocean.model, T=ECCOMetadata(:temperature; dates=date), S=ECCOMetadata(:salinity; dates=date))

The error comes after it processes the ECCO temperature data.

@glwagner
Copy link
Member

glwagner commented Nov 12, 2024

I'm getting the sense that CUDA-aware MPI issues are going to be a significant source of pain for us so I'm trying to think how to improve the situation...

One thought is to throw some kind of test of CUDA-aware MPI at the right location (eg at the moment we know we are going to need it). If the test fails, we can print info to help us debug, eg the MPI version, maybe some info about the configuration, etc.

I thought this could logically go into the model constructor, before we call update_state!. Then we know that it will be used. But in this case, the error comes later...

@sb4233 how many GPUs are you trying to use? What is grid.architecture?

@sb4233
Copy link
Author

sb4233 commented Nov 12, 2024

I'm getting the sense that CUDA-aware MPI issues are going to be a significant source of pain for us so I'm trying to think how to improve the situation...

One thought is to throw some kind of test of CUDA-aware MPI at the right location (eg at the moment we know we are going to need it). If the test fails, we can print info to help us debug, eg the MPI version, maybe some info about the configuration, etc.

I thought this could logically go into the model constructor, before we call update_state!. Then we know that it will be used. But in this case, the error comes later...

@sb4233 how many GPUs are you trying to use? What is grid.architecture?

At the moment I am trying with just 2 GPUs just to test the parallelisation. Arch is set to Distributed(GPU())

@glwagner
Copy link
Member

Can you show us grid.architecture, eg by putting a line

@show grid.architecture

@sb4233
Copy link
Author

sb4233 commented Nov 12, 2024

Can you show us grid.architecture, eg by putting a line

@show grid.architecture

Sure! will give this a try in the morning and post it here.

@sb4233
Copy link
Author

sb4233 commented Nov 13, 2024

Can you show us grid.architecture, eg by putting a line

@show grid.architecture

I got this -

grid.architecture = Distributed{GPU} across 1 rank:
├── local_rank: 0 of 0-0
└── local_index: [1, 1, 1]

Does this mean it is using only one GPU?

Also, what's the purpose of the partition argument? Like here -

arch = GPU() #Distributed(GPU(), partition = Partition(2))

@sb4233
Copy link
Author

sb4233 commented Nov 13, 2024

@simone-silvestri do you understand how the error can come from here:

#set!#20 at /g/data/er50/sb4233/.julia/packages/ClimaOcean/D5bjt/src/DataWrangling/ECCO/ECCO.jl:242
set! at /g/data/er50/sb4233/.julia/packages/ClimaOcean/D5bjt/src/DataWrangling/ECCO/ECCO.jl:226

If CUDA-aware MPI is not available, shouldn't we get an error earlier? For example, during the construction of HydrostaticFreeSurfaceModel (where we call fill_halo_regions!)?

@sb4233 it will help if you let us know what version you're using, because this is not main as far as I can tell.

I am using the latest version of ClimaOcean.jl -

Status `/g/data/er50/sb4233/.julia/environments/v1.11/Project.toml`
  [0376089a] ClimaOcean v0.2.2 `https://github.com/CliMA/ClimaOcean.jl.git#main`

@sb4233
Copy link
Author

sb4233 commented Nov 13, 2024

@sb4233 try this and let us know what you find:

https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2

rank=0, size=1, dst=0, src=0

[1232240] signal 11 (2): Segmentation fault
in expression starting at /g/data/er50/sb4233/Oceananigans/test_julia/test_CUDA_MPI.jl:15
__memcpy_evex_unaligned_erms at /lib64/libc.so.6 (unknown line)
MPIDI_CH3U_Buffer_copy at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIDI_Isend_self at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPID_Isend at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIR_Sendrecv_impl at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Sendrecv at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Sendrecv at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:2268
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:234 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:239 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:244 [inlined]
#Sendrecv!#102 at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:225 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:225 [inlined]
Sendrecv! at ./deprecated.jl:105
unknown function (ip: 0x150c4c1a099f)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2628
_include at ./loading.jl:2688
include at ./Base.jl:557
jfptr_include_46600.1 at /apps/julia/1.11.0/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:323
_start at ./client.jl:531
jfptr__start_72051.1 at /apps/julia/1.11.0/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 14959334 (Pool: 14958854; Big: 480); GC: 12
Segmentation fault

@glwagner
Copy link
Member

Ok @sb4233 I think that means that you don't have CUDA-aware MPI. You have to have CUDA-aware MPI to use multiple GPUs with ClimaOcean / Oceananigans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants