-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation failure due to high register usage #214
Comments
Does Metal support register spilling? What happens when you exceed the total numbers of registers available (on CUDA 255 iirc) |
Actually, this isn't a launch failure, it's a compilation failure. So it has nothing to do with the launch configuration. It also means that there's a hard limit on how many registers a kernel can use, however, there's no way to query either that limit or the amount of registers a kernel uses. So I guess we can't do anything about this... |
I think it can spill though. Dummy kernel: using Metal
function kernel(a::AbstractArray{<:NTuple{N, T}}) where {N, T}
i = thread_position_in_grid_1d()
@inbounds begin
# load a large tuple
x = a[i]
# force all of the tuple to be available
s = zero(T)
for i in 1:N
s += x[i]
end
y = let s = s
ntuple(i->x[i]+s, Val(N))
end
# write back out
a[i] = y
end
return
end
function main(N=1)
x = MtlArray{NTuple{N, Int}}(undef, 1)
@metal threads=len kernel(x)
end If I have it load a large amount of data (1000 elements), the generated code starts with:
i.e. loading device memory into registers, and spilling it immediately after. Computing the sum and storing the resulting tuple then consists of a sequence of:
|
@ChrisRackauckas @utkarsh530 Do either of you remember where exactly this happened, and how to reproduce? I'd like to investigate, but that's hard without a MWE. |
this came from the kernel generating methods of DiffEqGPU. using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA
function lorenz2(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return SVector{3}(du1, du2, du3)
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz2, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
trajectories = 10_000,
saveat = 1.0f0) Is the small case that should work, but if you pump it to one of the bigger ODEs it should hit this. @utkarsh530 do you remember which ODE hit it? |
I'm getting Float64 values with that MWE: julia> sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(Metal.MetalBackend()),
trajectories = 10_000,
saveat = 1.0f0)
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_ode_asolve_kernel(::KernelAbstractions.CompilerMetadata{…}, ::MtlDeviceVector{…}, ::GPUTsit5, ::MtlDeviceMatrix{…}, ::MtlDeviceMatrix{…}, ::Float32, ::CallbackSet{…}, ::Nothing, ::Float32, ::Float32, ::StepRangeLen{…}, ::Val{…}) resulted in invalid LLVM IR
Reason: unsupported use of double value
Reason: unsupported use of double value
Reason: unsupported use of double value
Stacktrace:
[1] Float64
@ ./float.jl:159
[2] convert
@ ./number.jl:7
[3] _promote
@ ./promotion.jl:370
[4] promote
@ ./promotion.jl:393
[5] *
@ ./promotion.jl:423
[6] unsafe_getindex
@ ./range.jl:963
[7] getindex
@ ./range.jl:956
[8] macro expansion
@ ~/.julia/packages/DiffEqGPU/I999k/src/ensemblegpukernel/kernels.jl:87
[9] gpu_ode_asolve_kernel
@ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
[10] gpu_ode_asolve_kernel
@ ./none:0 |
Is that to be trusted? Because |
SciML/DiffEqGPU.jl#317 |
Any workaround? |
As seen on DiffEqGPU.jl:
It's interesting because IIUC the dynamic workgroup size setting there should have used
maxTotalThreadsPerThreadgroup
, which in the case of CUDA takes register usage into account. Maybe there's additional limits we need to respect with Metal?The text was updated successfully, but these errors were encountered: