Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Support for ALS #37

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Conversation

alexmul1114
Copy link
Contributor

Fixes #36 (with simple MTTKRP implementation for now).

Copy link

codecov bot commented Jan 8, 2024

Codecov Report

Attention: Patch coverage is 52.17391% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 90.55%. Comparing base (bb701f2) to head (f4696c0).

❗ Current head f4696c0 differs from pull request most recent head d7f68ed. Consider uploading reports for the commit d7f68ed to get more accurate results

Files Patch % Lines
ext/CUDAExt.jl 0.00% 22 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##            master      #37      +/-   ##
===========================================
- Coverage   100.00%   90.55%   -9.45%     
===========================================
  Files           12        8       -4     
  Lines          258      233      -25     
===========================================
- Hits           258      211      -47     
- Misses           0       22      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Random initialization
M0 = CPD(ones(T, r), rand.(T, size(X), r))
#M0norm = sqrt(mapreduce(abs2, +, M0[I] for I in CartesianIndices(size(M0))))
M0norm = sqrt(sum(abs2, M0[I] for I in CartesianIndices(size(M0))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added CUDA as an extension, where the extension has a gcp definition for CuArray input. Right now M0 is created and normalized on the CPU then moved to the GPU for ALS (and moved back to the CPU to be returned at the end). Need to figure out how rewrite line 24 without scalar indexing so M0 can be created directly as a CuArray.

@alexmul1114
Copy link
Contributor Author

Some simple benchmarking with gcp call shows a good speed-up with larger tensors:

julia> using CUDA

julia> using BenchmarkTools

julia> using GCPDecompositions

julia> X_gpu_small = CUDA.rand(40,40,40);

julia> X_cpu_small = Array(X_gpu_small);

julia> X_gpu_med = CUDA.rand(80,80,80);

julia> X_cpu_med = Array(X_gpu_med);

julia> X_gpu_large = CUDA.rand(120,120,120);

julia> X_cpu_large = Array(X_gpu_large);

julia> @btime gcp(X_cpu_small, 10);
  2.629 s (429623 allocations: 607.02 MiB)

julia> @btime CUDA.@sync gcp(X_gpu_small, 10);
  5.924 s (816637 allocations: 29.37 MiB)

julia> @btime gcp(X_cpu_med, 10);
  20.161 s (3117623 allocations: 4.09 GiB)

julia> @btime CUDA.@sync gcp(X_gpu_med, 10);
  5.285 s (3505787 allocations: 90.92 MiB)

julia> @btime gcp(X_cpu_large, 10);
  55.513 s (10413623 allocations: 13.13 GiB)

julia> @btime CUDA.@sync gcp(X_gpu_large, 10);
  6.846 s (10803600 allocations: 257.98 MiB)


@dahong67
Copy link
Owner

@alexmul1114 Looking great! Before I forget, since the code allows AbstractArray inputs now, we should make sure we handle custom indices properly. Read this: https://docs.julialang.org/en/v1/devdocs/offset-arrays/#man-custom-indices

For simplicity, let's just add Base.require_one_based_indexing(arrays...) for now. We can enable support other indices in a future PR.

@alexmul1114
Copy link
Contributor Author

For some reason the CPU gcp calls on the Caviness node are much slower than doing the same on my laptop, and benchmarking gcp on the CPU for the larger tensors on Caviness is very slow. Using 32G memory instead of 16 (laptop has 32) makes the difference a bit less but still significant. So these comparisons are against my laptop CPU. Compared to my laptop, the GPU (T4) is slower for the smaller tensors but scales better and is faster for the larger tensors: Here's the GPU times:

julia> X_gpu_xlarge = CUDA.rand(180,180,180);

julia> @btime CUDA.@sync gcp(X_gpu_xlarge, 10);
  7.933 s (35696825 allocations: 832.98 MiB)

julia> X_gpu_2xlarge = CUDA.rand(240,240,240);

julia> @btime CUDA.@sync gcp(X_gpu_2xlarge, 10);
  8.871 s (83643936 allocations: 1.89 GiB)

julia> X_gpu_3xlarge = CUDA.rand(300,300,300);

julia> @btime CUDA.@sync gcp(X_gpu_3xlarge, 10);
  10.782 s (162679584 allocations: 3.65 GiB)

julia> X_gpu_4xlarge = CUDA.rand(400,400,400);

julia> @btime CUDA.@sync gcp(X_gpu_4xlarge, 10);
  12.952 s (384673404 allocations: 8.61 GiB)

julia> X_gpu_5xlarge = CUDA.rand(600,600,600);

julia> @btime CUDA.@sync gcp(X_gpu_5xlarge, 10);
  27.484 s (1296697109 allocations: 29.00 GiB)

julia> X_gpu_6xlarge = CUDA.rand(800,800,800);

julia> @btime CUDA.@sync gcp(X_gpu_6xlarge, 10);
  70.759 s (3072720821 allocations: 68.70 GiB)

And the CPU times on my laptop:

julia> X_cpu_xlarge = rand(180,180,180);

julia> @btime gcp(X_cpu_xlarge, 10);
  2.590 s (23370422 allocations: 2.59 GiB)

julia> X_cpu_2xlarge = rand(240,240,240);

julia> @btime gcp(X_cpu_2xlarge, 10);
  6.092 s (55341425 allocations: 4.94 GiB)

julia> X_cpu_3xlarge = rand(300,300,300);

julia> @btime gcp(X_cpu_3xlarge, 10);
  11.053 s (108045425 allocations: 8.26 GiB)

julia> X_cpu_4xlarge = rand(400,400,400);

julia> @btime gcp(X_cpu_4xlarge, 10);
  24.853 s (256045425 allocations: 16.32 GiB)

julia> X_cpu_5xlarge = rand(600,600,600);

julia> @btime gcp(X_cpu_5xlarge, 10);
  75.419 s (864048825 allocations: 44.14 GiB)

julia> X_cpu_6xlarge = rand(800,800,800);

julia> @btime gcp(X_cpu_6xlarge, 10);
  179.703 s (2048048825 allocations: 91.76 GiB)

@alexmul1114
Copy link
Contributor Author

The gcp calls on the cpu and gpu are not returning quite the same thing. Running through gcp step-by-step, it looks like the first place where they differ slightly is when computing V on line 167 in gcp-opt.jl (or line 37 in CUDAExt.jl):

julia> X_gpu = CUDA.rand(20,30,40);

julia> X_cpu = Array(X_gpu);

julia> Random.seed!(0);

julia> Mh_cpu = gcp(X_cpu, 10);

julia> typeof(Mh_cpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> Random.seed!(0);

julia> Mh_gpu = gcp(X_gpu, 10);

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0041753054f0

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu)) <= 1e-5
false

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0041753054f0

julia> mean(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0005158942f0

julia> T = Float32
Float32

julia> r = 10
10

julia> Random.seed!(0);

julia> M0_cpu = CPD(ones(T, r), rand.(T, size(X_cpu), r));

julia> Random.seed!(0);

julia> M0_gpu = CPD(ones(T, r), rand.(T, size(X_gpu), r));

julia> maximum(I -> abs(M0_cpu[I] - M0_gpu[I]), CartesianIndices(X_cpu))
0.0f0

julia> typeof(M0_cpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> typeof(M0_gpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> M0_cpu_norm = sqrt(sum(abs2, M0_cpu[I] for I in CartesianIndices(size(M0_cpu))))
219.85454f0

julia> M0_gpu_norm = sqrt(sum(abs2, M0_gpu[I] for I in CartesianIndices(size(M0_gpu))))
219.85454f0

julia> M0_cpu_norm == M0_gpu_norm
true

julia> Xnorm_cpu = sqrt(sum(abs2, skipmissing(X_cpu)))
89.38414f0

julia> Xnorm_gpu = sqrt(mapreduce(x -> isnan(x) ? 0 : abs2(x), +, X_gpu, init=0f0))
89.38414f0

julia> Xnorm_cpu == Xnorm_gpu
true

julia> N = 3;

julia> for k in Base.OneTo(N)
               M0_cpu.U[k] .*= (Xnorm_cpu / M0_cpu_norm)^(1 / N)
       end

julia> for k in Base.OneTo(N)
               M0_gpu.U[k] .*= (Xnorm_gpu / M0_gpu_norm)^(1 / N)
       end

julia> λ_cpu, U_cpu = M0_cpu.λ, collect(M0_cpu.U);

julia> λ_gpu, U_gpu = M0_gpu.λ, collect(M0_gpu.U);

julia> λ_gpu == λ_cpu
true

julia> U_cpu[1] == U_gpu[1]
true

julia> U_cpu[2] == U_gpu[2]
true

julia> U_cpu[3] == U_gpu[3]
true

julia> λ_gpu = CuArray(λ_gpu);

julia> U_gpu = [CuArray(U_gpu[i]) for i in 1:N];

julia> maximum(I -> abs(U_gpu[1][I] - U_cpu[1][I]), CartesianIndices(U_cpu[1]))
0.0f0

julia> maximum(I -> abs(U_gpu[2][I] - U_cpu[2][I]), CartesianIndices(U_cpu[2]))
0.0f0

julia> maximum(I -> abs(U_gpu[3][I] - U_cpu[3][I]), CartesianIndices(U_cpu[3]))
0.0f0

julia> maximum(I -> abs(U_gpu[3][I] - U_cpu[2][I]), CartesianIndices(U_cpu[3]));

julia> V_cpu = reduce(.*, U_cpu[i]'U_cpu[i] for i in setdiff(1:N, n));

julia> V_gpu = reduce(.*, U_gpu[i]'U_gpu[i] for i in setdiff(1:N, n));

julia> maximum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
1.5258789f-5

julia> mean(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
3.7574769f-6

julia> minimum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
0.0f0

@alexmul1114
Copy link
Contributor Author

It looks like the difference in the calculation of V comes from the U[i]'U[i] and not the element-wise multiply. Using the same set-up as above:

julia> V_cpu_intermediate = [U_cpu[i]'U_cpu[i] for i in setdiff(1:N, n)];

julia> V_gpu_intermediate = [U_gpu[i]'U_gpu[i] for i in setdiff(1:N, n)];

julia> maximum(I -> abs(V_gpu_intermediate[1][I] - V_cpu_intermediate[1][I]), CartesianIndices(V_cpu_intermediate[1]))
9.536743f-7

julia> maximum(I -> abs(V_gpu_intermediate[2][I] - V_cpu_intermediate[2][I]), CartesianIndices(V_cpu_intermediate[2]))
1.4305115f-6

julia> V_gpu_intermediate[1] = CuArray(V_cpu_intermediate[1]);

julia> V_gpu_intermediate[2] = CuArray(V_cpu_intermediate[2]);

julia> maximum(I -> abs(V_gpu_intermediate[2][I] - V_cpu_intermediate[2][I]), CartesianIndices(V_cpu_intermediate[2]))
0.0f0

julia> maximum(I -> abs(V_gpu_intermediate[1][I] - V_cpu_intermediate[1][I]), CartesianIndices(V_cpu_intermediate[1]))
0.0f0

julia> V_cpu = reduce(.*, V_cpu_intermediate);

julia> V_gpu = reduce(.*, V_gpu_intermediate);

julia> maximum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
0.0f0

Also, when using Float64 instead of Float32, the final results are much closer:

julia> X_gpu = CUDA.rand(Float64, 20,30,40);

julia> typeof(X_gpu)
CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}

julia> X_cpu = Array(X_gpu);

julia> typeof(X_cpu)
Array{Float64, 3}

julia> Random.seed!(0);

julia> Mh_cpu = gcp(X_cpu, 10);

julia> Random.seed!(0);

julia> Mh_gpu = gcp(X_gpu, 10);

julia> typeof(Mh_cpu)
CPD{Float64, 3, Vector{Float64}, Matrix{Float64}}

julia> typeof(Mh_gpu)
CPD{Float64, 3, Vector{Float64}, Matrix{Float64}}

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
5.079520137840632e-12

ext/CUDAExt.jl Outdated
for n in 1:N
V = reduce(.*, U[i]'U[i] for i in setdiff(1:N, n))
U[n] = GCPDecompositions.mttkrp(X, U, n) / V
λ = CuArray(CUDA.norm.(eachcol(U[n])))
Copy link
Contributor Author

@alexmul1114 alexmul1114 Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like line 39 here is much slower when having to move the data around devices:

julia> @btime norm.(eachcol(U_cpu[n]))
  1.408 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
 1.941741
 2.0177264
 1.904886
 1.636472
 1.4084961
 0.9433015
 1.0931745
 1.6607362
 1.273025
 1.8184978

julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
  440.045 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.94329995
 1.0931759
 1.6607368
 1.2730248
 1.8184997

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting norm.(eachcol(U[n])) as vec(sqrt.(sum(abs2, U_gpu[n]; dims=1))) prevents data from being transferred back to CPU and gets a 5x speedup for GPU version, similar time for CPU:

julia> @btime norm.(eachcol(U_cpu[n]))
  1.364 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
 1.941741
 2.0177264
 1.904886
 1.636472
 1.4084961
 0.9433015
 1.0931745
 1.6607362
 1.273025
 1.8184978

julia> @btime vec(sqrt.(sum(abs2, U_cpu[n]; dims=1)))
  1.212 μs (6 allocations: 304 bytes)
10-element Vector{Float32}:
 1.9417411
 2.0177264
 1.904886
 1.6364719
 1.408496
 0.9433015
 1.0931743
 1.6607362
 1.273025
 1.8184978

julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
  368.776 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.94329995
 1.0931759
 1.6607368
 1.2730248
 1.8184997

julia> @btime CUDA.@sync vec(sqrt.(sum(abs2, U_gpu[n]; dims=1)))
  67.685 μs (112 allocations: 5.52 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.9432999
 1.093176
 1.6607368
 1.2730248
 1.8184998

@alexmul1114
Copy link
Contributor Author

Testing on Caviness node with T4 GPU and CPU with 8 cores of 8GB each:

julia> using LinearAlgebra

julia> LinearAlgebra.peakflops()
1.7854915177491415e11

julia> using CUDA

julia> using BenchmarkTools

julia> using GCPDecompositions

julia> X_gpu_small = CUDA.rand(40,40,40);

julia> X_cpu_small = Array(X_gpu_small);

julia> X_gpu_med = CUDA.rand(80,80,80);

julia> X_cpu_med = Array(X_gpu_med);

julia> X_gpu_large = CUDA.rand(120,120,120);

julia> X_cpu_large = Array(X_gpu_large);

julia> @btime gcp($X_cpu_small, 10);
  41.223 ms (424623 allocations: 66.52 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_small, 10);
  508.272 ms (1048058 allocations: 39.40 MiB)

julia> @btime gcp($X_cpu_med, 10);
  440.675 ms (3114623 allocations: 279.40 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_med, 10);
  548.730 ms (3757483 allocations: 101.68 MiB)

julia> @btime gcp($X_cpu_large, 10);
  1.704 s (10410623 allocations: 695.66 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_large, 10);
  703.406 ms (11063857 allocations: 269.05 MiB)

julia> X_gpu_xlarge = CUDA.rand(180,180,180);

julia> X_cpu_xlarge = Array(X_gpu_xlarge);

julia> @btime gcp($X_cpu_xlarge, 10);
  6.928 s (35034623 allocations: 1.77 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_xlarge, 10);
  1.143 s (35694525 allocations: 832.90 MiB)

julia> X_gpu_2xlarge = CUDA.rand(240,240,240);

julia> X_cpu_2xlarge = Array(X_gpu_2xlarge);

julia> @btime gcp($X_cpu_2xlarge, 10);
  12.863 s (82986623 allocations: 3.60 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_2xlarge, 10);
  2.075 s (83641618 allocations: 1.89 GiB)

julia> X_gpu_3xlarge = CUDA.rand(300,300,300);

julia> X_cpu_3xlarge = Array(X_gpu_3xlarge);

julia> @btime gcp($X_cpu_3xlarge, 10);
  34.129 s (162042623 allocations: 6.34 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_3xlarge, 10);
  3.838 s (162679111 allocations: 3.65 GiB)

julia> X_gpu_4xlarge = CUDA.rand(400,400,400);

julia> X_cpu_4xlarge = Array(X_gpu_4xlarge);

julia> @btime gcp($X_cpu_4xlarge, 10);
  47.972 s (384042623 allocations: 13.41 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_4xlarge, 10);
  8.170 s (384673070 allocations: 8.61 GiB)

julia> X_gpu_5xlarge = CUDA.rand(600,600,600);

julia> X_cpu_5xlarge = Array(X_gpu_5xlarge);

julia> @btime gcp($X_cpu_5xlarge, 10);
  164.451 s (1296048626 allocations: 39.78 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_5xlarge, 10);
  25.027 s (1296697062 allocations: 29.00 GiB)

julia> X_gpu_6xlarge = CUDA.rand(800,800,800);

julia> X_cpu_6xlarge = Array(X_gpu_6xlarge);

julia> @btime gcp($X_cpu_6xlarge, 10);
  204.891 s (3072048626 allocations: 87.84 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_6xlarge, 10);
  57.469 s (3072719936 allocations: 68.70 GiB)

@dahong67 dahong67 changed the title CUDA Support CUDA Support for ALS Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support CUDA arrays
2 participants