Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation failure due to high register usage #214

Open
maleadt opened this issue Jun 24, 2023 · 9 comments
Open

Compilation failure due to high register usage #214

maleadt opened this issue Jun 24, 2023 · 9 comments
Labels
kernels Things about kernels and how they are compiled. upstream Out of our hands

Comments

@maleadt
Copy link
Member

maleadt commented Jun 24, 2023

As seen on DiffEqGPU.jl:

  caused by: NSError: Compute function exceeds available temporary registers (AGXMetal13_3, code 3)
  Stacktrace:
    [1] MTLComputePipelineState(dev::Metal.MTL.MTLDeviceInstance, fun::Metal.MTL.MTLFunctionInstance)
      @ Metal.MTL /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/lib/mtl/compute_pipeline.jl:60
    [2] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}}; return_function::Bool)
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:71
    [3] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry), Tuple{Vector{UInt8}, String}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/compilation.jl:66
    [4] actual_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::typeof(Metal.compile), linker::typeof(Metal.link))
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:132
    [5] cached_compilation(cache::Dict{Any, Any}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.MetalCompilerTarget, Metal.MetalCompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/NVLGB/src/execution.jl:103
    [6] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:162 [inlined]
    [7] macro expansion
      @ ./lock.jl:267 [inlined]
    [8] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:157
    [9] mtlfunction(f::typeof(DiffEqGPU.gpu_ode_asolve_kernel), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, MtlDeviceVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, GPURosenbrock23{true}, MtlDeviceMatrix{SVector{20, Float32}, 1}, MtlDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false}}})
      @ Metal /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:155
   [10] macro expansion
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/compiler/execution.jl:77 [inlined]
   [11] (::KernelAbstractions.Kernel{MetalBackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_ode_asolve_kernel)})(::MtlVector{ODEProblem{SVector{20, Float32}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(f_large), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, ::Vararg{Any}; ndrange::Int64, workgroupsize::Nothing)
      @ Metal.MetalKernels /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:105
   [12] Kernel
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/Metal/9shJi/src/MetalKernels.jl:101 [inlined]
   [13] #vectorized_asolve#166
      @ /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-3.0/build/default-macmini-aarch64-3-0/julialang/diffeqgpu-dot-jl/src/solve.jl:182 [inlined]

It's interesting because IIUC the dynamic workgroup size setting there should have used maxTotalThreadsPerThreadgroup, which in the case of CUDA takes register usage into account. Maybe there's additional limits we need to respect with Metal?

@vchuravy
Copy link
Member

Does Metal support register spilling? What happens when you exceed the total numbers of registers available (on CUDA 255 iirc)

@maleadt
Copy link
Member Author

maleadt commented Jun 26, 2023

Actually, this isn't a launch failure, it's a compilation failure. So it has nothing to do with the launch configuration. It also means that there's a hard limit on how many registers a kernel can use, however, there's no way to query either that limit or the amount of registers a kernel uses.

So I guess we can't do anything about this...

@maleadt maleadt changed the title Launch failure due to register usage Compilation failure due to register usage Jun 26, 2023
@maleadt maleadt added the upstream Out of our hands label Jun 26, 2023
@maleadt
Copy link
Member Author

maleadt commented Jun 26, 2023

I think it can spill though. Dummy kernel:

using Metal

function kernel(a::AbstractArray{<:NTuple{N, T}}) where {N, T}
    i = thread_position_in_grid_1d()

    @inbounds begin
        # load a large tuple
        x = a[i]

        # force all of the tuple to be available
        s = zero(T)
        for i in 1:N
            s += x[i]
        end
        y = let s = s
            ntuple(i->x[i]+s, Val(N))
        end

        # write back out
        a[i] = y
    end

    return
end

function main(N=1)
    x = MtlArray{NTuple{N, Int}}(undef, 1)
    @metal threads=len kernel(x)
end

If I have it load a large amount of data (1000 elements), the generated code starts with:

   0: f2051004             get_sr           r1.cache, sr80 (thread_position_in_grid.x)
   4: 62f9000000000030     mov_imm          r126, 0
   c: 9e07c28610840100     imadd            r1_r2.cache, r1.discard.sx, u4l, u2
  14: 62fd000000000030     mov_imm          r127, 0
  1c: 0e09c46218000000     iadd             r2, r2.discard, u3
  24: 0529020500c8f200     device_load      0, i32, xyzw, r5_r6_r7_r8, r1_r2, 0, signed, lsl 2
  2c: 0549124500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 1, signed, lsl 2
  34: b500c1052a80000f     stack_store      i8, 1, 2, 0, 4012, 0
  3c: 3801                 wait             1
  3e: b54a00050cc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15296, 0
  46: 0549224500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 2, signed, lsl 2
  4e: 3801                 wait             1
  50: b54a00050bc0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15280, 0
  58: 0549324500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 3, signed, lsl 2
  60: 3801                 wait             1
  62: b54a00050ac0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15264, 0
  6a: 0549424500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 4, signed, lsl 2
  72: 3801                 wait             1
  74: b54a000509c0f23b     stack_store      i32, 1, 0, xyzw, 4, r9_r10_r11_r12, 15248, 0
  7c: 0549524500c8f200     device_load      1, i32, xyzw, r9_r10_r11_r12, r1_r2, 5, signed, lsl 2
  84: 3801                 wait             1
...

i.e. loading device memory into registers, and spilling it immediately after. Computing the sum and storing the resulting tuple then consists of a sequence of:

 82c: 352a000506c0f205     stack_load       r5_r6_r7_r8, i32, 1, 0, xyzw, 4, 1376, 0
 834: 3800                 wait             0
 836: 8e1986c22c000000     iadd             r6.cache, r3.cache, r6.discard
 83e: 8e0546022d000000     iadd             r1.cache, r3, r8.discard
 846: 8e1584a22c000000     iadd             r5.cache, r2.cache, r5.discard
 84e: 8e1d84e22c000000     iadd             r7.cache, r2.cache, r7.discard
 856: f2211004             get_sr           r8.cache, sr80 (thread_position_in_grid.x)
 85a: 9e23d08610000000     imadd            r8_r9.cache, r8.discard.sx, u4l, 0
 862: 8e2986212d000000     iadd             r10.cache, u3, r9.discard
 86a: 8e5584012d200000     iadd             r85.cache, u2, r8.discard
 872: 92104a4228010130     icmpsel          ult, r4l.cache, r5, r2.cache, 1, 0
 87a: 92a06a42180101300008 icmpsel          ult, r8l.cache, r85, u2, 1, 0
 884: 92024e4224010130     icmpsel          ult, r0h.cache, r7, r2, 1, 0
 88c: 0e59d0402d200000     iadd             r86, r8l.discard, r10.discard
 894: 0e19c8c02c000000     iadd             r6, r4l.discard, r6.discard
 89c: 0e21c1202c000000     iadd             r8, r0h.discard, r1.discard
 8a4: 45290a05a0c8f200     device_store     0, i32, xyzw, r5_r6_r7_r8, r85_r86, 0, signed, lsl 2, 0

@maleadt maleadt changed the title Compilation failure due to register usage Compilation failure due to high register usage Jun 26, 2023
@maleadt maleadt added bug kernels Things about kernels and how they are compiled. labels Feb 28, 2024
@maleadt
Copy link
Member Author

maleadt commented Jun 20, 2024

@ChrisRackauckas @utkarsh530 Do either of you remember where exactly this happened, and how to reproduce? I'd like to investigate, but that's hard without a MWE.

@ChrisRackauckas
Copy link
Member

this came from the kernel generating methods of DiffEqGPU.

using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

function lorenz2(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] *- u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
    trajectories = 10_000,
    saveat = 1.0f0)

Is the small case that should work, but if you pump it to one of the bigger ODEs it should hit this. @utkarsh530 do you remember which ODE hit it?

@maleadt
Copy link
Member Author

maleadt commented Jul 5, 2024

I'm getting Float64 values with that MWE:

julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
           trajectories = 10_000,
           saveat = 1.0f0)
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_ode_asolve_kernel(::KernelAbstractions.CompilerMetadata{…}, ::MtlDeviceVector{…}, ::GPUTsit5, ::MtlDeviceMatrix{…}, ::MtlDeviceMatrix{…}, ::Float32, ::CallbackSet{…}, ::Nothing, ::Float32, ::Float32, ::StepRangeLen{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported use of double value
Reason: unsupported use of double value
Reason: unsupported use of double value
Stacktrace:
  [1] Float64
    @ ./float.jl:159
  [2] convert
    @ ./number.jl:7
  [3] _promote
    @ ./promotion.jl:370
  [4] promote
    @ ./promotion.jl:393
  [5] *
    @ ./promotion.jl:423
  [6] unsafe_getindex
    @ ./range.jl:963
  [7] getindex
    @ ./range.jl:956
  [8] macro expansion
    @ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
  [9] gpu_ode_asolve_kernel
    @ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
 [10] gpu_ode_asolve_kernel
    @ ./none:0

@ChrisRackauckas
Copy link
Member

@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87

Is that to be trusted? Because saveat = 1.0f0 and tspan = (0.0f0, 10.0f0) but the line is if tspan[1] == saveat[1]? That seems to definitely be Float32.

@utkarsh530
Copy link

SciML/DiffEqGPU.jl#317
This might be related to this

@maleadt
Copy link
Member Author

maleadt commented Aug 6, 2024

Any workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernels Things about kernels and how they are compiled. upstream Out of our hands
Projects
None yet
Development

No branches or pull requests

5 participants