Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MWE of problem on GPU with regional ACC code #186

Open
francispoulin opened this issue Sep 24, 2024 · 8 comments
Open

MWE of problem on GPU with regional ACC code #186

francispoulin opened this issue Sep 24, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@francispoulin
Copy link

Following up on @142, this is a MWE of where this code fails on a GPU. The error and code are copied below.

The error occurs when we include both or either salinity and temperature forcing.

@simone-silvestri and @glwagner

Error:

julia> include("acc_regional_simulation.jl")
Precompiling Oceananigans
  166 dependencies successfully precompiled in 109 seconds
Precompiling ClimaOcean
        Info Given ClimaOcean was explicitly requested, output will be shown live 
WARNING: using Units.day in module ECCO conflicts with an existing identifier.
  204 dependencies successfully precompiled in 143 seconds. 168 already precompiled.
  2 dependencies had output during precompilation:
┌ ClimaOcean
│  [Output was shown above]
└  
┌ Accessors → AccessorsUnitfulExt
│  [pid 760940] waiting for IO to finish:
│   Handle type        uv_handle_t->data
│   fs_event           0x25d5fe0->0x7fdd5f8fbeb0
│   timer              0x2385440->0x7fdd5f8fbee0
│  This means that a package has started a background task or event source that has not finished running. For precompilation to complete successfully, the event source needs to be closed explicitly. See the developer documentation on fixing precompilation hangs for more help.
└  
[ Info: Regridding bathymetry from existing file /u/fpoulin/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/Bathymetry/ETOPO_2022_v1_60s_N90W180_surface.nc.
┌ Warning: The westernmost meridian of `target_grid` 0.0 does not coincide with the closest meridian of the bathymetry grid, -1.4210854715202004e-14.
└ @ ClimaOcean.Bathymetry ~/software/ClimaOcean.jl/src/Bathymetry.jl:147
[ Info: In-painting ecco temperature
[ Info: In-painting ecco temperature
[ Info: In-painting ecco salinity
[ Info: In-painting ecco salinity
ERROR: a bounds error was thrown during kernel execution on thread (1, 1, 1) in block (3, 1, 1).
Stacktrace:
 [1] indexed_iterate at ./tuple.jl:92
 [2] indexed_iterate at ./tuple.jl:92
 [3] stateindex at /u/fpoulin/software/ClimaOcean.jl/src/ClimaOcean.jl:40
 [4] ECCORestoring at /u/fpoulin/software/ClimaOcean.jl/src/DataWrangling/ecco_restoring.jl:210
 [5] DiscreteForcing at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Forcings/discrete_forcing.jl:51
 [6] hydrostatic_free_surface_tracer_tendency at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_tendency_kernel_functions.jl:133
 [7] macro expansion at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/compute_hydrostatic_free_surface_tendencies.jl:240
 [8] gpu_compute_hydrostatic_free_surface_Gc! at /u/fpoulin/.julia/packages/KernelAbstractions/QE5mt/src/macros.jl:95
 [9] gpu_compute_hydrostatic_free_surface_Gc! at ./none:0

ERROR: a bounds error was thrown during kernel execution on thread (1, 1, 1) in block (67, 1, 1).
Stacktrace:
 [1] indexed_iterate at ./tuple.jl:92
 [2] indexed_iterate at ./tuple.jl:92
 [3] stateindex at /u/fpoulin/software/ClimaOcean.jl/src/ClimaOcean.jl:40
 [4] ECCORestoring at /u/fpoulin/software/ClimaOcean.jl/src/DataWrangling/ecco_restoring.jl:210
 [5] DiscreteForcing at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Forcings/discrete_forcing.jl:51
 [6] hydrostatic_free_surface_tracer_tendency at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_tendency_kernel_functions.jl:133
 [7] macro expansion at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/compute_hydrostatic_free_surface_tendencies.jl:240
Unhandled Task ERROR: KernelException: exception thrown during kernel execution on device NVIDIA A100-SXM4-40GB

Code

using Oceananigans
using Oceananigans.Units
using ClimaOcean
using ClimaOcean.OceanSeaIceModels.CrossRealmFluxes: LatitudeDependentAlbedo

using CFTime
using Dates

using ClimaOcean.ECCO

z_faces = exponential_z_faces(Nz=4, depth=6000)
Nx = 144
Ny = 60
Nz = length(z_faces) - 1

grid = LatitudeLongitudeGrid(GPU();
                             size = (Nx, Ny, Nz),
                             halo = (7, 7, 7),
                             z = z_faces,
                             latitude  = (-80, -20),
                             longitude = (0, 360))

bottom_height = regrid_bathymetry(grid;
                                  minimum_depth = 10,
                                  interpolation_passes = 5,
                                  connected_regions_allowed = 0)

grid = ImmersedBoundaryGrid(grid, GridFittedBottom(bottom_height), active_cells_map=true)

dates = DateTimeProlepticGregorian(1993, 1, 1) : Month(1) : DateTimeProlepticGregorian(1993, 5, 1)

temperature = ECCOMetadata(:temperature, dates, ECCO4Monthly())
salinity    = ECCOMetadata(:salinity,    dates, ECCO4Monthly())

@inline mask(λ, φ, z, t) = min(1, max(0, -(λ + 80)/10 + 1, (λ + 30)/10))

FT = ECCO_restoring_forcing(temperature; grid, architecture = GPU(), timescale = 2days, mask)
FS = ECCO_restoring_forcing(salinity;    grid, architecture = GPU(), timescale = 2days, mask)

forcing = (T=FT, S=FS)

ocean = ocean_simulation(grid; forcing)
model = ocean.model

set!(model,
     T = temperature[1],
     S = salinity[1])
@simone-silvestri
Copy link
Collaborator

I tried it on the CPU with --check-bounds=yes and there was no problem. I think is a problem of defining the mask as a function but I am still not sure what causes this. I will try running with -g2 on the GPU and see what the error is

@francispoulin
Copy link
Author

I tried it twice with --check-bounds=yes and both times I obtained an error but the error is a bit different, see below.

ERROR: a bounds error was thrown during kernel execution on thread (1, 1, 1) in block (60, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

Any ideas why this would fail for me?

@glwagner
Copy link
Member

@simone-silvestri said he tried it on the CPU so the issue may be GPU-specific right? @francispoulin maybe try running with julia -g2 ...

@francispoulin
Copy link
Author

Good idea @glwagner .

I tried it and got a slightly different output, see below. Hmm...

ERROR: a bounds error was thrown during kernel execution on thread (129, 1, 1) in block (46, 1, 1).
Stacktrace:
 [1] indexed_iterate at ./tuple.jl:92
 [2] indexed_iterate at ./tuple.jl:92
 [3] stateindex at /u/fpoulin/software/ClimaOcean.jl/src/ClimaOcean.jl:40
 [4] ECCORestoring at /u/fpoulin/software/ClimaOcean.jl/src/DataWrangling/ecco_restoring.jl:210
 [5] DiscreteForcing at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Forcings/discrete_forcing.jl:51
 [6] hydrostatic_free_surface_tracer_tendency at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_tendency_kernel_functions.jl:133
 [7] macro expansion at /u/fpoulin/.julia/packages/Oceananigans/dvdXO/src/Models/HydrostaticFreeSurfaceModels/compute_hydrostatic_free_surface_tendencies.jl:240
 [8] gpu_compute_hydrostatic_free_surface_Gc! at /u/fpoulin/.julia/packages/KernelAbstractions/QE5mt/src/macros.jl:95
 [9] gpu_compute_hydrostatic_free_surface_Gc! at ./none:0

Unhandled Task ERROR: KernelException: exception thrown during kernel execution on device NVIDIA A100-SXM4-40GB

@glwagner
Copy link
Member

Great, that tells you exactly where the error comes from.

@glwagner
Copy link
Member

It's here:

LX, LY, LZ = loc

So you need to figure out what this returns:

loc = location(p.ECCO_fts)

since loc is used here:

mask = stateindex(p.mask, i, j, k, grid, clock.time, loc)

@glwagner
Copy link
Member

glwagner commented Sep 25, 2024

I believe the problem is that location is not defined for GPUAdaptedFieldTimeSeries?

There is a fallback which is maybe problematic...

@glwagner
Copy link
Member

this will help I think: CliMA/Oceananigans.jl#3790

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants