-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Multi GPU for near_global_ocean_simulation.jl #225
Comments
The most important change is to use |
output.log
it points to somewhere here -
|
The default configuration of free_surface = SplitExplicitFreeSurface(grid; substeps = 70)
ocean = ocean_simulation(grid; free_surface) I will start changing our defaults to allow a Distributed grid. |
We need tests for a wide variety of expected configurations. |
@sb4233 thank you for finding this bug |
Happy to get involved :) |
I got segmentation fault for using this along with Think it has something to do with accessing ECCO data for |
It looks like you do not have enabled CUDA-aware MPI, which is required to run simulations on multiple GPU. using Pkg
Pkg.add("MPIPreferences")
using MPIPreferences
MPIPreferences.use_system_binary()
# Restart the julia session You can look at these docs for more in-depth instructions: |
@sb4233 try this and let us know what you find: https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2 |
Also @sb4233 it would be more convenient if you can post your error here in the chat so we don't have to download a file to see it! |
@sb4233 can you let us know what is on line 89 of your file in expression starting at /g/data/er50/sb4233/Oceananigans/test_julia/near_global_ocean_simulation.jl:89 |
@simone-silvestri do you understand how the error can come from here:
If CUDA-aware MPI is not available, shouldn't we get an error earlier? For example, during the construction of @sb4233 it will help if you let us know what version you're using, because this is not |
Hmmm right. I think it's cuda aware because the error appears when trying to MPI.Allreduce a cuda array. Weird that the error does not appear before. That piece of code has changed in the new main, so maybe using main might give a different error. |
It's this line - The error comes after it processes the ECCO temperature data. |
I'm getting the sense that CUDA-aware MPI issues are going to be a significant source of pain for us so I'm trying to think how to improve the situation... One thought is to throw some kind of test of CUDA-aware MPI at the right location (eg at the moment we know we are going to need it). If the test fails, we can print info to help us debug, eg the MPI version, maybe some info about the configuration, etc. I thought this could logically go into the model constructor, before we call @sb4233 how many GPUs are you trying to use? What is |
At the moment I am trying with just 2 GPUs just to test the parallelisation. Arch is set to |
Can you show us @show grid.architecture |
Sure! will give this a try in the morning and post it here. |
I got this - grid.architecture = Distributed{GPU} across 1 rank:
├── local_rank: 0 of 0-0
└── local_index: [1, 1, 1] Does this mean it is using only one GPU? Also, what's the purpose of the
|
I am using the latest version of Status `/g/data/er50/sb4233/.julia/environments/v1.11/Project.toml`
[0376089a] ClimaOcean v0.2.2 `https://github.com/CliMA/ClimaOcean.jl.git#main` |
rank=0, size=1, dst=0, src=0
[1232240] signal 11 (2): Segmentation fault
in expression starting at /g/data/er50/sb4233/Oceananigans/test_julia/test_CUDA_MPI.jl:15
__memcpy_evex_unaligned_erms at /lib64/libc.so.6 (unknown line)
MPIDI_CH3U_Buffer_copy at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIDI_Isend_self at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPID_Isend at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIR_Sendrecv_impl at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Sendrecv at /g/data/er50/sb4233/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Sendrecv at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:2268
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:234 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:239 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:244 [inlined]
#Sendrecv!#102 at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:225 [inlined]
Sendrecv! at /g/data/er50/sb4233/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:225 [inlined]
Sendrecv! at ./deprecated.jl:105
unknown function (ip: 0x150c4c1a099f)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
include_string at ./loading.jl:2628
_include at ./loading.jl:2688
include at ./Base.jl:557
jfptr_include_46600.1 at /apps/julia/1.11.0/lib/julia/sys.so (unknown line)
exec_options at ./client.jl:323
_start at ./client.jl:531
jfptr__start_72051.1 at /apps/julia/1.11.0/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 14959334 (Pool: 14958854; Big: 480); GC: 12
Segmentation fault |
Ok @sb4233 I think that means that you don't have CUDA-aware MPI. You have to have CUDA-aware MPI to use multiple GPUs with ClimaOcean / Oceananigans. |
Hi, I am running the
near_global_ocean_simulation.jl
on GPU. For a single GPU, it works fine. But for multi-GPU usage what changes do I have to make in the code?I am new to
ClimaOcean
, so any help would be appreciated. Thanks!The text was updated successfully, but these errors were encountered: