-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SimpleGMRES implementation #366
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI-Maintainer Review for PR - Add SimpleGMRES implementation
Title and Description 👍
Scope of Changes 👍
Testing 👎
Documentation 👎
LinearSolveBlockDiagonalsExt
LinearSolveNNlibExt
SimpleGMRES
SimpleGMRESCache
init_cacheval
_init_cacheval
default_alias_A
default_alias_b
SciMLBase.solve!
(forSimpleGMRESCache
)SciMLBase.solve!
(forSimpleGMRES
)
These entities should have docstrings added to describe their behavior, arguments, and return values.
Suggested Changes
- Please add docstrings to the functions, classes, or methods listed above. Each docstring should describe the behavior, arguments, and return values of the function, class, or method.
- Please provide details about how you tested the changes. This could include the specific test cases you used, the platforms or environments you tested on, and any relevant test results or observations.
Reviewed with AI Maintainer
Codecov Report
@@ Coverage Diff @@
## main #366 +/- ##
===========================================
+ Coverage 41.83% 72.35% +30.51%
===========================================
Files 20 23 +3
Lines 1446 1787 +341
===========================================
+ Hits 605 1293 +688
+ Misses 841 494 -347
... and 14 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Simple Benchmark Scriptusing LinearAlgebra, LinearSolve, SciMLOperators
using BlockDiagonals, NNlib
using Plots
blocksizes_ = [1, 2, 4, 5, 10, 25, 50]
nblocks = 100 .÷ blocksizes_
simplegmres_timings = Vector{Float64}(undef, length(blocksizes_))
simplegmres_nobsize_timings = Vector{Float64}(undef, length(blocksizes_))
krylovjlgmres_timings = Vector{Float64}(undef, length(blocksizes_))
for (i, (bsize, N)) in enumerate(zip(blocksizes_, nblocks))
A_ = Array(BlockDiagonal([randn(bsize, bsize) for _ in 1:N]))
b_ = randn(size(A_, 1))
lincache_ = init(LinearProblem(A_, b_), SimpleGMRES(; blocksize=bsize, restart = 20);
maxiters = 1000)
lincache_krylov_ = init(LinearProblem(A_, b_), KrylovJL_GMRES(; gmres_restart = 20);
maxiters = 1000)
lincache_simple_nobsize = init(LinearProblem(A_, b_), SimpleGMRES(; restart = 20);
maxiters = 1000)
simplegmres_timings[i] = @belapsed solve!($(deepcopy(lincache_)))
krylovjlgmres_timings[i] = @belapsed solve!($(deepcopy(lincache_krylov_)))
simplegmres_nobsize_timings[i] = @belapsed solve!($(deepcopy(lincache_simple_nobsize)))
@info "Trial $i" bsize=bsize N=N SimpleGMRES=simplegmres_timings[i] KrylovJL_GMRES=krylovjlgmres_timings[i] SimpleGMRES_nobsize=simplegmres_nobsize_timings[i]
end
begin
plt = plot(blocksizes_, simplegmres_timings, label = "SimpleGMRES", xlabel = "Block Size",
ylabel = "Time (s)", title = "SimpleGMRES vs KrylovJL_GMRES", legend = :topleft,
xscale = :log10, yscale = :log10)
scatter!(plt, blocksizes_, simplegmres_timings; label = "")
plot!(plt, blocksizes_, krylovjlgmres_timings, label = "KrylovJL_GMRES")
scatter!(plt, blocksizes_, krylovjlgmres_timings; label = "")
plot!(plt, blocksizes_, simplegmres_nobsize_timings, label = "SimpleGMRES (no blocksize)")
scatter!(plt, blocksizes_, simplegmres_nobsize_timings; label = "")
savefig(plt, "wip/simplegmres_vs_krylovjl_gmres.png")
end Takeaways
|
The non-unform blocks handling is a bit more finicky (not hard, just a bit irritating to handle). Unless we have a concrete use case for that, we can hold off implementing that in a later PR.
I will follow up with benchmarks in DEQs.jl and SciMLSensitivity.jl, my initial runs suggest massive improvements in backpropagation (simply due to less number of iterations ==> less VJPs computed). Again, we can't default to the blocksize version. It will make code really really fast for 99% users, for the 1% (say some using BatchNorm) it will give incorrect results. So, I will add a doc entry in |
(Don't know why I cant seem to request reviews) @ChrisRackauckas, this is ready for review |
91c0ea0
to
f6155f6
Compare
# Update the QR factorization of Hₖ₊₁.ₖ. | ||
# Apply previous Givens reflections Ωᵢ. | ||
# [cᵢ sᵢ] [ r̄ᵢ.ₖ ] = [ rᵢ.ₖ ] | ||
# [s̄ᵢ -cᵢ] [rᵢ₊₁.ₖ] [r̄ᵢ₊₁.ₖ] | ||
for i in 1:(inner_iter - 1) | ||
Rtmp = c[i] * R[nr + i] + s[i] * R[nr + i + 1] | ||
R[nr + i + 1] = conj(s[i]) * R[nr + i] - c[i] * R[nr + i + 1] | ||
R[nr + i] = Rtmp | ||
end | ||
|
||
# Compute and apply current Givens reflection Ωₖ. | ||
# [cₖ sₖ] [ r̄ₖ.ₖ ] = [rₖ.ₖ] | ||
# [s̄ₖ -cₖ] [hₖ₊₁.ₖ] [ 0 ] | ||
(c[inner_iter], s[inner_iter], R[nr + inner_iter]) = _sym_givens(R[nr + inner_iter], | ||
Hbis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there Base functions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is. I have opened an issue to track those #375
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avik-pal This a 2x2 Givens reflection and not a 2x2 Givens rotation.
This a not the same thing.
The advantage of the reflections is that they are Hermitian and it's easier to perform products with Q'
.
It's at least good enough to merge. I think some things could be simplified but that can be done over time. |
@ChrisRackauckas @avik-pal |
That's fine, but it imposes that the user does that. Sometimes doing something for the user is simpler and removes restrictions. |
@amontoison I was testing out if we can use a efficient Linear Operator to get rid of this. I have a julia> @benchmark CUDA.@sync solve(prob1, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 462.166 ms … 480.612 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 476.150 ms ┊ GC (median): 0.00%
Time (mean ± σ): 473.402 ms ± 6.727 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▁ ▁ ▁ ▁ ▁ █▁
█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁█▁▁▁▁▁▁██ ▁
462 ms Histogram: frequency by time 481 ms <
Memory estimate: 1.25 MiB, allocs estimate: 37978.
julia> @benchmark CUDA.@sync solve(prob2, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
Range (min … max): 467.924 ms … 488.971 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 474.196 ms ┊ GC (median): 0.00%
Time (mean ± σ): 474.975 ms ± 6.809 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█ ▁ ▁ ▁▁ ▁ ▁ ▁ ▁
██▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
468 ms Histogram: frequency by time 489 ms <
Memory estimate: 1.09 MiB, allocs estimate: 36442.
julia> @benchmark CUDA.@sync solve(prob2, SimpleGMRES(; blocksize = size(A1.data, 1)))
BenchmarkTools.Trial: 27 samples with 1 evaluation.
Range (min … max): 143.097 ms … 641.160 ms ┊ GC (min … max): 0.00% … 8.66%
Time (median): 152.867 ms ┊ GC (median): 0.00%
Time (mean ± σ): 186.930 ms ± 125.551 ms ┊ GC (mean ± σ): 2.32% ± 2.52%
█▆
██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ ▁
143 ms Histogram: frequency by time 641 ms <
Memory estimate: 10.75 MiB, allocs estimate: 289135.
|
@avik-pal I don't understand why the |
mul!(w, A, p) # w ← ANvₖ | ||
PlisI || ldiv!(q, Pl, w) # q ← MANvₖ | ||
for i in 1:inner_iter | ||
sum!(R[nr + i]', __batch(V[i]) .* __batch(q)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amontoison see this line. For the batched case (N x B input size), the dot
needs to only reduce to 1 x B
instead of reducing to a scalar.
I will share the profile in a couple of hrs. |
julia> CUDA.@profile solve(prob2, SimpleGMRES(; blocksize=8))
Profiler ran for 5.86 ms, capturing 5269 events.
Host-side activity: calling CUDA APIs took 3.58 ms (61.09% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────────┤
│ 25.14% │ 1.47 ms │ 469 │ 3.14 µs ± 26.0 ( 0.72 ‥ 535.73) │ cuMemAllocFromPoolAsync │
│ 16.65% │ 976.32 µs │ 417 │ 2.34 µs ± 0.96 ( 1.67 ‥ 14.07) │ cuLaunchKernel │
│ 5.96% │ 349.28 µs │ 2 │ 174.64 µs ± 240.57 ( 4.53 ‥ 344.75) │ cuMemcpyDtoDAsync │
│ 3.07% │ 179.77 µs │ 24 │ 7.49 µs ± 0.91 ( 6.2 ‥ 10.01) │ cuMemcpyDtoHAsync │
│ 1.32% │ 77.25 µs │ 2 │ 38.62 µs ± 43.5 ( 7.87 ‥ 69.38) │ cudaMemcpyAsync │
│ 0.56% │ 32.9 µs │ 11 │ 2.99 µs ± 1.13 ( 1.91 ‥ 5.01) │ cudaLaunchKernel │
│ 0.50% │ 29.09 µs │ 48 │ 605.98 ns ± 230.47 (238.42 ‥ 1192.09) │ cuStreamSynchronize │
│ 0.11% │ 6.68 µs │ 1 │ │ cuMemsetD32Async │
│ 0.07% │ 4.29 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.04% │ 2.62 µs │ 14 │ 187.33 ns ± 166.72 ( 0.0 ‥ 476.84) │ cudaGetLastError │
│ 0.04% │ 2.15 µs │ 1 │ │ cudaStreamSynchronize │
│ 0.02% │ 1.43 µs │ 5 │ 286.1 ns ± 261.17 ( 0.0 ‥ 715.26) │ cudaStreamGetCaptureInfo_v2 │
│ 0.02% │ 1.43 µs │ 1 │ │ cudaEventRecord │
│ 0.01% │ 476.84 ns │ 4 │ 119.21 ns ± 137.65 ( 0.0 ‥ 238.42) │ cuDeviceGet │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 679.49 µs (11.59% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 1.38% │ 80.82 µs │ 36 │ 2.25 µs ± 0.16 ( 1.91 ‥ 2.38) │ _Z22partial_mapreduce_grid8identity7add_sumv16CartesianIndicesILi2E5TupleI5OneToI5Int64ES3_IS4_EEES1_ILi2ES2_IS3_IS4_ES3_IS4_EEE3ValILitrueEE13ReshapedArrayI7Float32Li3E7AdjointIS7_13CuDeviceArrayI ⋯
│ 1.10% │ 64.37 µs │ 52 │ 1.24 µs ± 0.15 ( 0.95 ‥ 1.43) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES9_IS0 ⋯
│ 0.98% │ 57.46 µs │ 37 │ 1.55 µs ± 0.12 ( 1.43 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ ⋯
│ 0.94% │ 54.84 µs │ 36 │ 1.52 µs ± 0.13 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedI7AdjointIS1_S0_IS1_Li1ELi1EEES5_I ⋯
│ 0.92% │ 54.12 µs │ 45 │ 1.2 µs ± 0.14 ( 0.95 ‥ 1.43) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_E ⋯
│ 0.70% │ 41.25 µs │ 28 │ 1.47 µs ± 0.15 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8 ⋯
│ 0.69% │ 40.53 µs │ 28 │ 1.45 µs ± 0.14 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_I8ExtrudedIS0_IS1_Li1ELi1EES ⋯
│ 0.67% │ 39.1 µs │ 28 │ 1.4 µs ± 0.18 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3 ⋯
│ 0.64% │ 37.43 µs │ 36 │ 1.04 µs ± 0.12 ( 0.95 ‥ 1.19) │ _Z2_615CuKernelContext7AdjointI7Float3213CuDeviceArrayIS1_Li1ELi1EEES1_ ⋯
│ 0.48% │ 28.13 µs │ 25 │ 1.13 µs ± 0.19 ( 0.95 ‥ 1.67) │ [copy device to pageable memory] ⋯
│ 0.38% │ 22.41 µs │ 15 │ 1.49 µs ± 0.17 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ ⋯
│ 0.36% │ 20.98 µs │ 9 │ 2.33 µs ± 0.11 ( 2.15 ‥ 2.38) │ _Z22partial_mapreduce_grid8identity7add_sum7Float3216CartesianIndicesILi2E5TupleI5OneToI5Int64ES4_IS5_EEES2_ILi2ES3_IS4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li3ELi1EE11BroadcastedI12CuArr ⋯
│ 0.31% │ 18.12 µs │ 8 │ 2.26 µs ± 0.18 ( 2.15 ‥ 2.62) │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_ ⋯
│ 0.28% │ 16.69 µs │ 8 │ 2.09 µs ± 0.11 ( 1.91 ‥ 2.15) │ _Z22partial_mapreduce_grid8identity1_4Bool16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBuffe ⋯
│ 0.27% │ 15.74 µs │ 8 │ 1.97 µs ± 0.21 ( 1.67 ‥ 2.15) │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device ⋯
│ 0.27% │ 15.74 µs │ 8 │ 1.97 µs ± 0.17 ( 1.67 ‥ 2.15) │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device ⋯
│ 0.22% │ 12.87 µs │ 9 │ 1.43 µs ± 0.21 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE4sqrtS5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_E ⋯
│ 0.21% │ 12.16 µs │ 8 │ 1.52 µs ± 0.18 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE6ifelseS5_IS2_IS3_ILi1ES4_Ev2__S5_IS2_IS3_ILi1ES4_Ev3absS ⋯
│ 0.20% │ 11.92 µs │ 8 │ 1.49 µs ± 0.21 ( 1.19 ‥ 1.91) │ _Z29gpu___fast_sym_givens_kernel_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE13CuDe ⋯
│ 0.18% │ 10.73 µs │ 8 │ 1.34 µs ± 0.12 ( 1.19 ‥ 1.43) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8ExtrudedIS0_IS1_Li1ELi1E ⋯
│ 0.18% │ 10.73 µs │ 8 │ 1.34 µs ± 0.18 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE2__S5_IS2_IS3_ILi1ES4_Ev3absS5_I8ExtrudedIS0_I7Float32Li1ELi ⋯
│ 0.10% │ 5.72 µs │ 2 │ 2.86 µs ± 0.0 ( 2.86 ‥ 2.86) │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E ⋯
│ 0.06% │ 3.58 µs │ 2 │ 1.79 µs ± 0.84 ( 1.19 ‥ 2.38) │ [copy device to device memory] ⋯
│ 0.03% │ 1.67 µs │ 2 │ 834.47 ns ± 168.59 (715.26 ‥ 953.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_ ⋯
│ 0.02% │ 1.43 µs │ 1 │ │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E ⋯
│ 0.01% │ 476.84 ns │ 1 │ │ [set device memory] ⋯
│ 0.01% │ 476.84 ns │ 1 │ │ [copy pageable to device memory] ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 column omitted julia> CUDA.@profile solve(prob2, KrylovJL_GMRES())
Profiler ran for 672.18 ms, capturing 767034 events.
Host-side activity: calling CUDA APIs took 573.04 ms (85.25% of the trace)
┌──────────┬────────────┬────────┬──────────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼────────┼──────────────────────────────────────────────┼─────────────────────────────┤
│ 32.08% │ 215.65 ms │ 33412 │ 6.45 µs ± 1.86 ( 1.91 ‥ 108.48) │ cudaMemcpyAsync │
│ 30.94% │ 207.96 ms │ 99716 │ 2.09 µs ± 10.1 ( 1.43 ‥ 2249.96) │ cudaLaunchKernel │
│ 7.66% │ 51.47 ms │ 33154 │ 1.55 µs ± 0.94 ( 1.19 ‥ 113.49) │ cudaFuncGetAttributes │
│ 4.98% │ 33.45 ms │ 165770 │ 201.78 ns ± 164.18 ( 0.0 ‥ 12397.77) │ cudaStreamGetCaptureInfo_v2 │
│ 3.49% │ 23.49 ms │ 33154 │ 708.47 ns ± 11711.15 (476.84 ‥ 2.13193893e6) │ cudaStreamSynchronize │
│ 2.27% │ 15.25 ms │ 166280 │ 91.73 ns ± 5416.7 ( 0.0 ‥ 2.20823288e6) │ cudaGetLastError │
│ 1.78% │ 11.93 ms │ 33154 │ 359.97 ns ± 693.56 (238.42 ‥ 117301.94) │ cudaEventRecord │
│ 1.01% │ 6.81 ms │ 1 │ │ cuMemsetD32Async │
│ 0.16% │ 1.07 ms │ 262 │ 4.07 µs ± 2.01 ( 0.72 ‥ 16.21) │ cuMemAllocFromPoolAsync │
│ 0.12% │ 833.75 µs │ 278 │ 3.0 µs ± 1.41 ( 1.43 ‥ 16.69) │ cuLaunchKernel │
└──────────┴────────────┴────────┴──────────────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 163.36 ms (24.30% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 7.12% │ 47.83 ms │ 32896 │ 1.45 µs ± 0.17 ( 1.19 ‥ 1.91) │ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStridedBatchedIfEEEvT2_ ⋯
│ 6.64% │ 44.64 ms │ 32896 │ 1.36 µs ± 0.16 ( 1.19 ‥ 1.67) │ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2_iS4_S2_T3_T4_19cublasPointerMode_t18cublasLtEpilogue_tS0_IKN8biasTypeINS7_10value_typeES2_E4typeEE ⋯
│ 5.61% │ 37.68 ms │ 33152 │ 1.14 µs ± 0.17 ( 0.95 ‥ 1.43) │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E ⋯
│ 4.57% │ 30.74 ms │ 33154 │ 927.25 ns ± 353.61 (715.26 ‥ 12397.77) │ [copy device to pageable memory] ⋯
│ 0.21% │ 1.39 ms │ 516 │ 2.7 µs ± 0.24 ( 2.15 ‥ 3.1) │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E ⋯
│ 0.10% │ 641.11 µs │ 256 │ 2.5 µs ± 0.17 ( 2.15 ‥ 2.86) │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_ ⋯
│ 0.04% │ 302.08 µs │ 256 │ 1.18 µs ± 0.16 ( 0.95 ‥ 1.43) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES1_E ⋯
│ 0.02% │ 103.95 µs │ 258 │ 402.91 ns ± 165.07 (238.42 ‥ 953.67) │ [copy pageable to device memory] ⋯
│ 0.00% │ 21.46 µs │ 21 │ 1.02 µs ± 0.11 ( 0.95 ‥ 1.19) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_ ⋯
│ 0.00% │ 1.43 µs │ 1 │ │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7 ⋯
│ 0.00% │ 715.26 ns │ 1 │ │ [set device memory] ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 column omitted
For sanity check both the algorithms do converge: julia> solve(prob1, KrylovJL_GMRES()).resid
0.00023247192f0
julia> solve(prob2, KrylovJL_GMRES()).resid
0.00023807214f0
julia> solve(prob2, SimpleGMRES(; blocksize=8)).resid
1.1054129f-5 |
Non-Uniform Square BlocksNon-Uniform Square Blocks