Compilation failure due to high register usage #214

maleadt · 2023-06-24T08:40:04Z

As seen on DiffEqGPU.jl:

  caused by: NSError: Compute function exceeds available temporary registers (AGXMetal13_3, code 3)
  Stacktrace:
    [1] MTLComputePipelineState(dev::Metal.MTL.MTLDeviceInstance, fun::Metal.MTL.MTLFunctionInstance)
      @ Metal.MTL /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/lib/mtl/compute_pipeline.jl:60
    [2] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}}; return_function::Bool)
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:71
    [3] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:66
    [4] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:132
    [5] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:103
    [6] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:162 [inlined]
    [7] macro expansion
      @ ./lock.jl:267 [inlined]
    [8] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:157
    [9] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:155
   [10] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:77 [inlined]
   [11] (::KernelAbstractions.Kernel{MetalBackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_ode_asolve_kernel)})(::MtlVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, ::Vararg{Any}; ndrange::Int64, workgroupsize::Nothing)
      @ Metal.MetalKernels /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:105
   [12] Kernel
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:101 [inlined]
   [13] #vectorized_asolve#166
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/build/default-macmini-aarch64-3-0/julialang/diffeqgpu-dot-jl/src/solve.jl:182 [inlined]

It's interesting because IIUC the dynamic workgroup size setting there should have used maxTotalThreadsPerThreadgroup, which in the case of CUDA takes register usage into account. Maybe there's additional limits we need to respect with Metal?

The text was updated successfully, but these errors were encountered:

vchuravy · 2023-06-24T12:00:27Z

Does Metal support register spilling? What happens when you exceed the total numbers of registers available (on CUDA 255 iirc)

maleadt · 2023-06-26T07:13:57Z

Actually, this isn't a launch failure, it's a compilation failure. So it has nothing to do with the launch configuration. It also means that there's a hard limit on how many registers a kernel can use, however, there's no way to query either that limit or the amount of registers a kernel uses.

So I guess we can't do anything about this...

maleadt · 2023-06-26T07:35:26Z

I think it can spill though. Dummy kernel:

using Metal

function kernel(a::AbstractArray{<:NTuple{N, T}}) where {N, T}
    i = thread_position_in_grid_1d()

    @inbounds begin
        # load a large tuple
        x = a[i]

        # force all of the tuple to be available
        s = zero(T)
        for i in 1:N
            s += x[i]
        end
        y = let s = s
            ntuple(i->x[i]+s, Val(N))
        end

        # write back out
        a[i] = y
    end

    return
end

function main(N=1)
    x = MtlArray{NTuple{N, Int}}(undef, 1)
    @metal threads=len kernel(x)
end

If I have it load a large amount of data (1000 elements), the generated code starts with:

   0: f2051004             get_sr           r1.cache, sr80 (thread_position_in_grid.x)
   4: 62f9000000000030     mov_imm          r126, 0
   c: 9e07c28610840100     imadd            r1_r2.cache, r1.discard.sx, u4l, u2
  14: 62fd000000000030     mov_imm          r127, 0
  1c: 0e09c46218000000     iadd             r2, r2.discard, u3
  24: 0529020500c8f200     device_load      0, i32, xyzw, r5_r6_r7_r8, r1_r2, 0, signed, lsl 2
  2c: 0549124500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 1, signed, lsl 2
  34: b500c1052a80000f     stack_store      i8, 1, 2, 0, 4012, 0
  3c: 3801                 wait             1
  3e: b54a00050cc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15296, 0
  46: 0549224500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 2, signed, lsl 2
  4e: 3801                 wait             1
  50: b54a00050bc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15280, 0
  58: 0549324500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 3, signed, lsl 2
  60: 3801                 wait             1
  62: b54a00050ac0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15264, 0
  6a: 0549424500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 4, signed, lsl 2
  72: 3801                 wait             1
  74: b54a000509c0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15248, 0
  7c: 0549524500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 5, signed, lsl 2
  84: 3801                 wait             1
...

i.e. loading device memory into registers, and spilling it immediately after. Computing the sum and storing the resulting tuple then consists of a sequence of:

 82c: 352a000506c0f205     stack_load       r5_r6_r7_r8, i32, 1, 0, xyzw, 4, 1376, 0
 834: 3800                 wait             0
 836: 8e1986c22c000000     iadd             r6.cache, r3.cache, r6.discard
 83e: 8e0546022d000000     iadd             r1.cache, r3, r8.discard
 846: 8e1584a22c000000     iadd             r5.cache, r2.cache, r5.discard
 84e: 8e1d84e22c000000     iadd             r7.cache, r2.cache, r7.discard
 856: f2211004             get_sr           r8.cache, sr80 (thread_position_in_grid.x)
 85a: 9e23d08610000000     imadd            r8_r9.cache, r8.discard.sx, u4l, 0
 862: 8e2986212d000000     iadd             r10.cache, u3, r9.discard
 86a: 8e5584012d200000     iadd             r85.cache, u2, r8.discard
 872: 92104a4228010130     icmpsel          ult, r4l.cache, r5, r2.cache, 1, 0
 87a: 92a06a42180101300008 icmpsel          ult, r8l.cache, r85, u2, 1, 0
 884: 92024e4224010130     icmpsel          ult, r0h.cache, r7, r2, 1, 0
 88c: 0e59d0402d200000     iadd             r86, r8l.discard, r10.discard
 894: 0e19c8c02c000000     iadd             r6, r4l.discard, r6.discard
 89c: 0e21c1202c000000     iadd             r8, r0h.discard, r1.discard
 8a4: 45290a05a0c8f200     device_store     0, i32, xyzw, r5_r6_r7_r8, r85_r86, 0, signed, lsl 2, 0

maleadt · 2024-06-20T09:50:11Z

@ChrisRackauckas @utkarsh530 Do either of you remember where exactly this happened, and how to reproduce? I'd like to investigate, but that's hard without a MWE.

ChrisRackauckas · 2024-06-21T21:46:48Z

this came from the kernel generating methods of DiffEqGPU.

using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

function lorenz2(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
    trajectories = 10_000,
    saveat = 1.0f0)

Is the small case that should work, but if you pump it to one of the bigger ODEs it should hit this. @utkarsh530 do you remember which ODE hit it?

maleadt · 2024-07-05T14:00:32Z

I'm getting Float64 values with that MWE:

julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
           trajectories = 10_000,
           saveat = 1.0f0)
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_ode_asolve_kernel(::KernelAbstractions.CompilerMetadata{…}, ::MtlDeviceVector{…}, ::GPUTsit5, ::MtlDeviceMatrix{…}, ::MtlDeviceMatrix{…}, ::Float32, ::CallbackSet{…}, ::Nothing, ::Float32, ::Float32, ::StepRangeLen{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported use of double value
Reason: unsupported use of double value
Reason: unsupported use of double value
Stacktrace:
  [1] Float64
    @ ./float.jl:159
  [2] convert
    @ ./number.jl:7
  [3] _promote
    @ ./promotion.jl:370
  [4] promote
    @ ./promotion.jl:393
  [5] *
    @ ./promotion.jl:423
  [6] unsafe_getindex
    @ ./range.jl:963
  [7] getindex
    @ ./range.jl:956
  [8] macro expansion
    @ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
  [9] gpu_ode_asolve_kernel
    @ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
 [10] gpu_ode_asolve_kernel
    @ ./none:0

ChrisRackauckas · 2024-07-05T18:22:14Z

@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87

Is that to be trusted? Because saveat = 1.0f0 and tspan = (0.0f0, 10.0f0) but the line is if tspan[1] == saveat[1]? That seems to definitely be Float32.

utkarsh530 · 2024-07-10T05:01:55Z

SciML/DiffEqGPU.jl#317
This might be related to this

maleadt · 2024-08-06T12:25:39Z

Any workaround?

maleadt changed the title ~~Launch failure due to register usage~~ Compilation failure due to register usage Jun 26, 2023

maleadt added the upstream Out of our hands label Jun 26, 2023

maleadt changed the title ~~Compilation failure due to register usage~~ Compilation failure due to high register usage Jun 26, 2023

maleadt added bug kernels Things about kernels and how they are compiled. labels Feb 28, 2024

maleadt mentioned this issue Aug 6, 2024

Fix automated conversion in adaptive solve SciML/DiffEqGPU.jl#317

Open

tgymnich removed the bug label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation failure due to high register usage #214

Compilation failure due to high register usage #214

maleadt commented Jun 24, 2023

vchuravy commented Jun 24, 2023

maleadt commented Jun 26, 2023

maleadt commented Jun 26, 2023

maleadt commented Jun 20, 2024

ChrisRackauckas commented Jun 21, 2024

maleadt commented Jul 5, 2024

ChrisRackauckas commented Jul 5, 2024

utkarsh530 commented Jul 10, 2024

maleadt commented Aug 6, 2024

Compilation failure due to high register usage #214

Compilation failure due to high register usage #214

Comments

maleadt commented Jun 24, 2023

vchuravy commented Jun 24, 2023

maleadt commented Jun 26, 2023

maleadt commented Jun 26, 2023

maleadt commented Jun 20, 2024

ChrisRackauckas commented Jun 21, 2024

maleadt commented Jul 5, 2024

ChrisRackauckas commented Jul 5, 2024

utkarsh530 commented Jul 10, 2024

maleadt commented Aug 6, 2024