Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH200 run on ALPS #7

Open
luraess opened this issue Jan 27, 2025 · 25 comments
Open

GH200 run on ALPS #7

luraess opened this issue Jan 27, 2025 · 25 comments
Labels
bug Something isn't working

Comments

@luraess
Copy link

luraess commented Jan 27, 2025

Reporting here about an attempt in running https://github.com/PRONTOLab/GB-25/blob/main/oceananigans-dynamical-core/super_simple_simulation.jl on a single Nvidia GH200 on the ALPS infrastructure.

Config

julia> CUDA.versioninfo()
CUDA runtime 12.4, local installation
CUDA driver 12.6
NVIDIA driver 550.54.15

CUDA libraries: 
- CUBLAS: 12.4.2
- CURAND: 10.3.5
- CUFFT: 11.2.0
- CUSOLVER: 11.6.0
- CUSPARSE: 12.3.0
- CUPTI: 2024.1.0 (API 22.0.0)
- NVML: 12.0.0+550.54.15

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.11.3
- LLVM: 16.0.6

Environment:
- JULIA_CUDA_MEMORY_POOL: none

Preferences:
- CUDA_Runtime_jll.version: 12.4
- CUDA_Runtime_jll.local: true

4 devices:
  0: NVIDIA GH200 120GB (sm_90, 93.134 GiB / 95.577 GiB available)
  1: NVIDIA GH200 120GB (sm_90, 94.094 GiB / 95.577 GiB available)
  2: NVIDIA GH200 120GB (sm_90, 94.094 GiB / 95.577 GiB available)
  3: NVIDIA GH200 120GB (sm_90, 94.095 GiB / 95.577 GiB available)

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 288 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, neoverse-v2)
Threads: 1 default, 0 interactive, 1 GC (on 288 virtual cores)
Environment:
  JULIA_CUDA_MEMORY_POOL = none
  JULIA_DEPOT_PATH = /capstor/scratch/cscs/lraess/daint/juliaup/depot
  JULIA_ADIOS2_PATH = /user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/adios2-2.10.0-baj4a2sfsk4fggnqfmd5ysdzaqoip4e3
  JULIA_LOAD_PATH = :/user-environment/juhpc_setup/julia_preferences
  JULIA_DEBUG = Reactant_jll

Output

julia> include("super_simple_simulation.jl")
AssertionError("Could not find registered platform with name: \"cuda\". Available platform names are: ")
┌ Debug: Detected CUDA Driver version 12.4.0
└ @ Reactant_jll /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant_jll/TDalS/.pkg/platform_augmentation.jl:60
Reactant_jll.cuDriverGetVersion(dlopen("libcuda.so")) = v"12.4.0"
┌ Warning: `Adapt.parent_type` is not implemented for Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}}. Assuming Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}} isn't a wrapped array.
└ @ Reactant /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/Reactant.jl:77
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (8.437 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (11.849 seconds).
[ Info: Simulation is stopping after running for 20.330 seconds.
[ Info: Model iteration 2 equals or exceeds stop iteration 2.
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (1.968 minutes)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (1.279 minutes).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 3 equals or exceeds stop iteration 2.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-ac1dae.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-a19424.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-a3daa2.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-3ee61e.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-517fdb.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-e487ff.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-05ea9f.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-78686e.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-dbf1e9.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-9b34fc.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-2c23f7.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-c832e0.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-fc8398.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-71e330.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-40761c.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-110511.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-f1cef8.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-db9e69.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-53c08f.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-d00ba5.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-70db6d.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-52cf34.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-e15919.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-8fa9b0.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-98a30a.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-87d6dd.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-5e4501.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-313f8b.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-06d722.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-374058.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-043ff5.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-e2bf3a.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-338d84.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-4beee0.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-64aa7a.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-cf8f4f.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-fc2701.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-284c33.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-55191a.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-fad7e0.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-e0de1b.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-c90f85.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-4ddfc3.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-cc0b7f.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-789ff0.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-53d05a.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-29c599.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-c0d8ad.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
error: `ptxas` invocation failed. Log:
ptxas /tmp/mlir-gpumodname-nvptx64-nvidia-cuda-sm_90-75aa1f.ptx, line 5; fatal   : Unsupported .version 8.6; current version is '8.4'
ptxas fatal   : Ptx assembly aborted due to errors

error: An error happened while serializing the module.
2025-01-27 14:04:33.155762: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 1417842432263479884
ERROR: LoadError: NOT_FOUND: No registered implementation for FFI custom call to enzymexla_compile_gpu for Host

Stacktrace:
 [1] reactant_err(msg::Cstring)
   @ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/XLA.jl:120
 [2] Compile
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/XLA.jl:514 [inlined]
 [3] compile_xla(f::Function, args::Tuple{Simulation{…}}; client::Nothing, optimize::Bool, no_nan::Bool)
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/Compiler.jl:980
 [4] compile_xla
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/Compiler.jl:954 [inlined]
 [5] compile(f::Function, args::Tuple{Simulation{…}}; client::Nothing, optimize::Bool, sync::Bool, no_nan::Bool)
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/Compiler.jl:992
 [6] top-level scope
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/M4ejd/src/Compiler.jl:677
 [7] include(fname::String)
   @ Main ./sysimg.jl:38
 [8] top-level scope
   @ REPL[1]:1
in expression starting at /capstor/scratch/cscs/lraess/GB-25/oceananigans-dynamical-core/super_simple_simulation.jl:39
Some type information was truncated. Use `show(err)` to see complete types.
@glwagner
Copy link
Collaborator

I think @wsmoses fixed a very similar error but for "unsupported .version 8.2" compared to 8.1 rather than 8.6 and 8.4.

@wsmoses
Copy link
Member

wsmoses commented Jan 27, 2025

What version of reactant are you using, we released the fix on the latest patch

@luraess
Copy link
Author

luraess commented Jan 27, 2025

(GB-25) pkg> st
Status `/capstor/scratch/cscs/lraess/GB-25/Project.toml`
  [6e4b80f9] BenchmarkTools v1.6.0
  [9e8cae18] Oceananigans v0.95.7 `https://github.com/CliMA/Oceananigans.jl.git#main`
  [3c362404] Reactant v0.2.21 `https://github.com/EnzymeAD/Reactant.jl.git#main`
  [0192cb87] Reactant_jll v0.0.48+0

@wsmoses
Copy link
Member

wsmoses commented Jan 27, 2025

Can you update to latest release (0.2.22).

Tho FYI if it's aarch64 you'll need the aarch64 CUDA jll @giordano is working on currently (but not yet landed). X86 CUDA jll is already in the package manager

@luraess
Copy link
Author

luraess commented Jan 27, 2025

Thanks, will update. And yes, it's aarch64, so will have to wait until @giordano lands the jll.

@giordano
Copy link
Collaborator

If/when JuliaPackaging/Yggdrasil#10313 is green.

@giordano
Copy link
Collaborator

Reactant#main now supports CUDA on aarch64-linux.

@luraess
Copy link
Author

luraess commented Feb 3, 2025

I gave the code https://github.com/PRONTOLab/GB-25/blob/glw/super-simple-distributed/oceananigans-dynamical-core/super_simple_simulation.jl from glw/super-simple-distributed branch a try on GH200, setting

# arch = Distributed(GPU(), partition=Partition(2, 2)) # distributed on 4 GPUs
arch = GPU()

running it:

julia> simulation = Simulation(model, Δt=60, stop_iteration=2);

julia> run!(simulation)
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (9.565 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (12.128 seconds).
[ Info: Simulation is stopping after running for 21.753 seconds.
[ Info: Model iteration 2 equals or exceeds stop iteration 2.

julia> r_simulation = Simulation(r_model, Δt=60, stop_iteration=2);

julia> pop!(r_simulation.callbacks, :nan_checker);

julia> r_run! = @compile sync = true run!(r_simulation);
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (1.858 minutes)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (1.687 minutes).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 3 equals or exceeds stop iteration 2.
ERROR: UNAVAILABLE: No PTX compilation provider is available. Neither ptxas/nvlink nor nvjtlink is available. As a fallback you can enable JIT compilation in the CUDA driver via the flag `--xla_gpu_unsafe_fallback_to_driver_on_ptxas_not_found`. Details: 
 - Has NvJitLink support: LibNvJitLink is not supported (disabled during compilation).
 - Has NvPtxCompiler support: LibNvPtxCompiler is not supported (disabled during compilation).
 - Parallel compilation support is desired: 0
 - ptxas_path: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
 - ptxas_version: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
 - nvlink_path: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
 - nvlink_version: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
 - Driver compilation is enabled: 0


Stacktrace:
 [1] reactant_err(msg::Cstring)
   @ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/XLA.jl:164
 [2] Compile(client::Reactant.XLA.Client, mod::Reactant.MLIR.IR.Module)
   @ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/XLA.jl:571
 [3] compile_xla(f::Function, args::Tuple{Simulation{…}}; client::Nothing, optimize::Bool, no_nan::Bool, device::Nothing)
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/Compiler.jl:1034
 [4] compile_xla
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/Compiler.jl:983 [inlined]
 [5] compile(f::Function, args::Tuple{…}; sync::Bool, kwargs::@Kwargs{…})
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/Compiler.jl:1052
 [6] top-level scope
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/Vrbvs/src/Compiler.jl:706
Some type information was truncated. Use `show(err)` to see complete types.

julia> 

Env:

(GB-25) pkg> st
Status `/capstor/scratch/cscs/lraess/GB-25/Project.toml`
  [6e4b80f9] BenchmarkTools v1.6.0
  [9e8cae18] Oceananigans v0.95.8 `https://github.com/CliMA/Oceananigans.jl.git#main`
  [3c362404] Reactant v0.2.24 `https://github.com/EnzymeAD/Reactant.jl.git#main`
  [0192cb87] Reactant_jll v0.0.60+0

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

can you retry on latest main? We just had a hopeful fix for this cc @giordano

@luraess
Copy link
Author

luraess commented Feb 3, 2025

Ok testing it rn. Besides the issue, seems to be quite some significant run time overhead in the reactant case.

The new run reports following issue(s)

julia> include("super_simple_simulation.jl")
┌ Debug: Detected CUDA Driver version 12.4.0
└ @ Reactant_jll /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant_jll/qvFaw/.pkg/platform_augmentation.jl:60
Reactant_jll.cuDriverGetVersion(dlopen("libcuda.so")) = v"12.4.0"
┌ Warning: `Adapt.parent_type` is not implemented for Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}}. Assuming Field{Center, Center, Face, Nothing, LatitudeLongitudeGrid{Float64, Periodic, Bounded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Float64, Float64}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}} isn't a wrapped array.
└ @ Reactant /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Reactant.jl:39
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (9.342 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (12.374 seconds).
[ Info: Simulation is stopping after running for 21.765 seconds.
[ Info: Model iteration 2 equals or exceeds stop iteration 2.
[ Info: KA
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (1.207 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (1.944 ms).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 3 equals or exceeds stop iteration 2.
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (1.939 minutes)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (2.021 minutes).
[ Info: Simulation is stopping after running for 0 seconds.
[ Info: Model iteration 4 equals or exceeds stop iteration 2.
ERROR: LoadError: UNAVAILABLE: No PTX compilation provider is available. Neither ptxas/nvlink nor nvjtlink is available. As a fallback you can enable JIT compilation in the CUDA driver via the flag `--xla_gpu_unsafe_fallback_to_driver_on_ptxas_not_found`. Details: 
 - Has NvJitLink support: LibNvJitLink is not supported (disabled during compilation).
 - Has NvPtxCompiler support: LibNvPtxCompiler is not supported (disabled during compilation).
 - Parallel compilation support is desired: 0
 - ptxas_path: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
 - ptxas_version: Couldn't find a suitable version of ptxas. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /user-environment/juhpc_setup/juliaup_wrapper/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/bin/ptxas, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/ptxas, /users/lraess/bin/ptxas, /usr/local/bin/ptxas, /usr/bin/ptxas, /bin/ptxas, /usr/lib/mit/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/ptxas, /capsto/cuda_nvcc/bin/ptxas, bin/ptxas, /usr/local/cuda/bin/ptxas, /opt/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/ptxas, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/ptxas
 - nvlink_path: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
 - nvlink_version: Couldn't find a suitable version of nvlink. The following locations were considered: /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /user-environment/juhpc_setup/juliaup_wrapper/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/bin/nvlink, /users/lraess/.vscode-server/cli/servers/Stable-cd4ee3b1c348a13bafd8f9ad8060705f6d4b9cba/server/bin/remote-cli/nvlink, /users/lraess/bin/nvlink, /usr/local/bin/nvlink, /usr/bin/nvlink, /bin/nvlink, /usr/lib/mit/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/juliaup/julia-1.11.3+0.aarch64.linux.gnu/bin/julia.runfiles/cuda_nvcc/bin/nvlink, /capsto/cuda_nvcc/bin/nvlink, bin/nvlink, /usr/local/cuda/bin/nvlink, /opt/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../nvidia/cuda_nvcc/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../bin/nvlink, /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/../../../../../bin/nvlink
 - Driver compilation is enabled: 0


Stacktrace:
 [1] reactant_err(msg::Cstring)
   @ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/XLA.jl:164
 [2] Compile(client::Reactant.XLA.Client, mod::Reactant.MLIR.IR.Module)
   @ Reactant.XLA /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/XLA.jl:571
 [3] compile_xla(f::Function, args::Tuple{Simulation{…}}; client::Nothing, optimize::Bool, no_nan::Bool, device::Nothing)
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:1034
 [4] compile_xla
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:983 [inlined]
 [5] compile(f::Function, args::Tuple{…}; sync::Bool, kwargs::@Kwargs{})
   @ Reactant.Compiler /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:1052
 [6] top-level scope
   @ /capstor/scratch/cscs/lraess/daint/juliaup/depot/packages/Reactant/7y9bj/src/Compiler.jl:706
 [7] include(fname::String)
   @ Main ./sysimg.jl:38
 [8] top-level scope
   @ REPL[3]:1
in expression starting at /capstor/scratch/cscs/lraess/GB-25/oceananigans-dynamical-core/super_simple_simulation.jl:42
Some type information was truncated. Use `show(err)` to see complete types.

julia> 

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

I mean that supposed runtime oceanigans prints is actually all compile time atm (and regardless something seems to be going awry @giordano if you can take a look)

@giordano
Copy link
Collaborator

giordano commented Feb 3, 2025

@luraess have you ever tried this before? Seeing that ptxas can't be found anywhere (we only modified the first location searched) suggests me that this would have never worked for you. Can you check if ptxas is available at any of the printed locations?

@luraess
Copy link
Author

luraess commented Feb 3, 2025

Could be a module issue. Now, I am seeing

lraess@nid007256:~/scratch/GB-25> which ptxas
/user-environment/linux-sles15-neoverse_v2/gcc-13.2.0/cuda-12.4.0-wjg6in2hqntqkxkvtcitw32w3iluoae3/bin/ptxas

let me try again

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

I mean that's different anyways though as we should be using the ptxas shipped with the JLL. If you look in the JLL artifact path, is there a ptxas there?

@giordano
Copy link
Collaborator

giordano commented Feb 3, 2025

But does the file /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas really not exist? That's the first one of the list

@luraess
Copy link
Author

luraess commented Feb 3, 2025

Seems it does exist

lraess@daint-ln001:~/scratch> ls /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/
fatbinary  ptxas

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

what is Reactant.XLA.CUDA_DATA_DIR. and also what is dirname(dirname(Reactant_jll.ptxas_path))

@giordano
Copy link
Collaborator

giordano commented Feb 3, 2025

So the question is why xla doesn't seem to like it? Also, is there a ptxas in any of the other fallback locations? I had it under /usr/local/cuda/bin, and that was a system version, which was correctly picked up. If you have it there (or in any of the other locations) but xla decides it doesn't like it then that's a problem someone will have to debug

@luraess
Copy link
Author

luraess commented Feb 3, 2025

what is Reactant.XLA.CUDA_DATA_DIR. and also what is dirname(dirname(Reactant_jll.ptxas_path))

julia> Reactant.XLA.CUDA_DATA_DIR
Base.RefValue{String}("/capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda")

julia> dirname(dirname(Reactant_jll.ptxas_path))
"/capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda"

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

oh wait. @luraess can you use latest Reactant release and not dev'd. I bet you're accidentally on an old jll (if you run st for example). Julia won't update dependencies necessarily if you dev a main

edit: Enzyme->Reactant in the text

@giordano
Copy link
Collaborator

giordano commented Feb 3, 2025

Can you run

/capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas --version

?

And I insist, do you have ptxas in any of the fallback system locations? If so, this seems to me a problem in XLA unable to accept ptxas anywhere.

@luraess can you use latest Enzyme release and not dev'd. I bet you're accidentally on an old jll (if you run st for example). Julia won't update dependencies necessarily if you dev a main

Message above showed it's Reactant_jll.jl 0.0.60, and that matches the hash of the artifact. I don't think that's the issue.

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

oh right....

@luraess
Copy link
Author

luraess commented Feb 3, 2025

Can you run

lraess@daint-ln001:~/scratch> /capstor/scratch/cscs/lraess/daint/juliaup/depot/artifacts/9a81e22975230cdad93b753211166f6e159cf9a2/lib/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:08_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

@giordano
Copy link
Collaborator

giordano commented Feb 3, 2025

Then what's wrong with XLA? 😄

@wsmoses
Copy link
Member

wsmoses commented Feb 3, 2025

@definelicht definelicht added the bug Something isn't working label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants