-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Support for ALS #37
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #37 +/- ##
===========================================
- Coverage 100.00% 90.55% -9.45%
===========================================
Files 12 8 -4
Lines 258 233 -25
===========================================
- Hits 258 211 -47
- Misses 0 22 +22 ☔ View full report in Codecov by Sentry. |
…i rao function from dahong67#34
# Random initialization | ||
M0 = CPD(ones(T, r), rand.(T, size(X), r)) | ||
#M0norm = sqrt(mapreduce(abs2, +, M0[I] for I in CartesianIndices(size(M0)))) | ||
M0norm = sqrt(sum(abs2, M0[I] for I in CartesianIndices(size(M0)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added CUDA as an extension, where the extension has a gcp definition for CuArray input. Right now M0 is created and normalized on the CPU then moved to the GPU for ALS (and moved back to the CPU to be returned at the end). Need to figure out how rewrite line 24 without scalar indexing so M0 can be created directly as a CuArray.
Some simple benchmarking with gcp call shows a good speed-up with larger tensors:
|
@alexmul1114 Looking great! Before I forget, since the code allows For simplicity, let's just add |
For some reason the CPU gcp calls on the Caviness node are much slower than doing the same on my laptop, and benchmarking gcp on the CPU for the larger tensors on Caviness is very slow. Using 32G memory instead of 16 (laptop has 32) makes the difference a bit less but still significant. So these comparisons are against my laptop CPU. Compared to my laptop, the GPU (T4) is slower for the smaller tensors but scales better and is faster for the larger tensors: Here's the GPU times:
And the CPU times on my laptop:
|
The gcp calls on the cpu and gpu are not returning quite the same thing. Running through gcp step-by-step, it looks like the first place where they differ slightly is when computing V on line 167 in gcp-opt.jl (or line 37 in CUDAExt.jl):
|
It looks like the difference in the calculation of V comes from the U[i]'U[i] and not the element-wise multiply. Using the same set-up as above:
Also, when using Float64 instead of Float32, the final results are much closer:
|
ext/CUDAExt.jl
Outdated
for n in 1:N | ||
V = reduce(.*, U[i]'U[i] for i in setdiff(1:N, n)) | ||
U[n] = GCPDecompositions.mttkrp(X, U, n) / V | ||
λ = CuArray(CUDA.norm.(eachcol(U[n]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like line 39 here is much slower when having to move the data around devices:
julia> @btime norm.(eachcol(U_cpu[n]))
1.408 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
1.941741
2.0177264
1.904886
1.636472
1.4084961
0.9433015
1.0931745
1.6607362
1.273025
1.8184978
julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
440.045 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.9417399
2.0177276
1.9048846
1.6364713
1.408497
0.94329995
1.0931759
1.6607368
1.2730248
1.8184997
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewriting norm.(eachcol(U[n])) as vec(sqrt.(sum(abs2, U_gpu[n]; dims=1))) prevents data from being transferred back to CPU and gets a 5x speedup for GPU version, similar time for CPU:
julia> @btime norm.(eachcol(U_cpu[n]))
1.364 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
1.941741
2.0177264
1.904886
1.636472
1.4084961
0.9433015
1.0931745
1.6607362
1.273025
1.8184978
julia> @btime vec(sqrt.(sum(abs2, U_cpu[n]; dims=1)))
1.212 μs (6 allocations: 304 bytes)
10-element Vector{Float32}:
1.9417411
2.0177264
1.904886
1.6364719
1.408496
0.9433015
1.0931743
1.6607362
1.273025
1.8184978
julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
368.776 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.9417399
2.0177276
1.9048846
1.6364713
1.408497
0.94329995
1.0931759
1.6607368
1.2730248
1.8184997
julia> @btime CUDA.@sync vec(sqrt.(sum(abs2, U_gpu[n]; dims=1)))
67.685 μs (112 allocations: 5.52 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.9417399
2.0177276
1.9048846
1.6364713
1.408497
0.9432999
1.093176
1.6607368
1.2730248
1.8184998
Testing on Caviness node with T4 GPU and CPU with 8 cores of 8GB each:
|
Fixes #36 (with simple MTTKRP implementation for now).