CUDA Support for ALS #37

alexmul1114 · 2024-01-08T19:22:12Z

Fixes #36 (with simple MTTKRP implementation for now).

codecov · 2024-01-08T19:26:26Z

Codecov Report

Attention: Patch coverage is 52.17391% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 90.55%. Comparing base (bb701f2) to head (f4696c0).

❗ Current head f4696c0 differs from pull request most recent head d7f68ed. Consider uploading reports for the commit d7f68ed to get more accurate results

Files	Patch %	Lines
ext/CUDAExt.jl	0.00%	22 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##            master      #37      +/-   ##
===========================================
- Coverage   100.00%   90.55%   -9.45%     
===========================================
  Files           12        8       -4     
  Lines          258      233      -25     
===========================================
- Hits           258      211      -47     
- Misses           0       22      +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…i rao function from dahong67#34

alexmul1114 · 2024-01-10T21:09:05Z

ext/CUDAExt.jl

+    # Random initialization
+    M0 = CPD(ones(T, r), rand.(T, size(X), r))
+    #M0norm = sqrt(mapreduce(abs2, +, M0[I] for I in CartesianIndices(size(M0))))
+    M0norm = sqrt(sum(abs2, M0[I] for I in CartesianIndices(size(M0))))


Added CUDA as an extension, where the extension has a gcp definition for CuArray input. Right now M0 is created and normalized on the CPU then moved to the GPU for ALS (and moved back to the CPU to be returned at the end). Need to figure out how rewrite line 24 without scalar indexing so M0 can be created directly as a CuArray.

alexmul1114 · 2024-01-10T21:11:25Z

Some simple benchmarking with gcp call shows a good speed-up with larger tensors:

julia> using CUDA

julia> using BenchmarkTools

julia> using GCPDecompositions

julia> X_gpu_small = CUDA.rand(40,40,40);

julia> X_cpu_small = Array(X_gpu_small);

julia> X_gpu_med = CUDA.rand(80,80,80);

julia> X_cpu_med = Array(X_gpu_med);

julia> X_gpu_large = CUDA.rand(120,120,120);

julia> X_cpu_large = Array(X_gpu_large);

julia> @btime gcp(X_cpu_small, 10);
  2.629 s (429623 allocations: 607.02 MiB)

julia> @btime CUDA.@sync gcp(X_gpu_small, 10);
  5.924 s (816637 allocations: 29.37 MiB)

julia> @btime gcp(X_cpu_med, 10);
  20.161 s (3117623 allocations: 4.09 GiB)

julia> @btime CUDA.@sync gcp(X_gpu_med, 10);
  5.285 s (3505787 allocations: 90.92 MiB)

julia> @btime gcp(X_cpu_large, 10);
  55.513 s (10413623 allocations: 13.13 GiB)

julia> @btime CUDA.@sync gcp(X_gpu_large, 10);
  6.846 s (10803600 allocations: 257.98 MiB)

dahong67 · 2024-01-11T02:32:40Z

@alexmul1114 Looking great! Before I forget, since the code allows AbstractArray inputs now, we should make sure we handle custom indices properly. Read this: https://docs.julialang.org/en/v1/devdocs/offset-arrays/#man-custom-indices

For simplicity, let's just add Base.require_one_based_indexing(arrays...) for now. We can enable support other indices in a future PR.

alexmul1114 · 2024-01-12T17:58:19Z

For some reason the CPU gcp calls on the Caviness node are much slower than doing the same on my laptop, and benchmarking gcp on the CPU for the larger tensors on Caviness is very slow. Using 32G memory instead of 16 (laptop has 32) makes the difference a bit less but still significant. So these comparisons are against my laptop CPU. Compared to my laptop, the GPU (T4) is slower for the smaller tensors but scales better and is faster for the larger tensors: Here's the GPU times:

julia> X_gpu_xlarge = CUDA.rand(180,180,180);

julia> @btime CUDA.@sync gcp(X_gpu_xlarge, 10);
  7.933 s (35696825 allocations: 832.98 MiB)

julia> X_gpu_2xlarge = CUDA.rand(240,240,240);

julia> @btime CUDA.@sync gcp(X_gpu_2xlarge, 10);
  8.871 s (83643936 allocations: 1.89 GiB)

julia> X_gpu_3xlarge = CUDA.rand(300,300,300);

julia> @btime CUDA.@sync gcp(X_gpu_3xlarge, 10);
  10.782 s (162679584 allocations: 3.65 GiB)

julia> X_gpu_4xlarge = CUDA.rand(400,400,400);

julia> @btime CUDA.@sync gcp(X_gpu_4xlarge, 10);
  12.952 s (384673404 allocations: 8.61 GiB)

julia> X_gpu_5xlarge = CUDA.rand(600,600,600);

julia> @btime CUDA.@sync gcp(X_gpu_5xlarge, 10);
  27.484 s (1296697109 allocations: 29.00 GiB)

julia> X_gpu_6xlarge = CUDA.rand(800,800,800);

julia> @btime CUDA.@sync gcp(X_gpu_6xlarge, 10);
  70.759 s (3072720821 allocations: 68.70 GiB)

And the CPU times on my laptop:

julia> X_cpu_xlarge = rand(180,180,180);

julia> @btime gcp(X_cpu_xlarge, 10);
  2.590 s (23370422 allocations: 2.59 GiB)

julia> X_cpu_2xlarge = rand(240,240,240);

julia> @btime gcp(X_cpu_2xlarge, 10);
  6.092 s (55341425 allocations: 4.94 GiB)

julia> X_cpu_3xlarge = rand(300,300,300);

julia> @btime gcp(X_cpu_3xlarge, 10);
  11.053 s (108045425 allocations: 8.26 GiB)

julia> X_cpu_4xlarge = rand(400,400,400);

julia> @btime gcp(X_cpu_4xlarge, 10);
  24.853 s (256045425 allocations: 16.32 GiB)

julia> X_cpu_5xlarge = rand(600,600,600);

julia> @btime gcp(X_cpu_5xlarge, 10);
  75.419 s (864048825 allocations: 44.14 GiB)

julia> X_cpu_6xlarge = rand(800,800,800);

julia> @btime gcp(X_cpu_6xlarge, 10);
  179.703 s (2048048825 allocations: 91.76 GiB)

alexmul1114 · 2024-01-15T18:57:43Z

The gcp calls on the cpu and gpu are not returning quite the same thing. Running through gcp step-by-step, it looks like the first place where they differ slightly is when computing V on line 167 in gcp-opt.jl (or line 37 in CUDAExt.jl):

julia> X_gpu = CUDA.rand(20,30,40);

julia> X_cpu = Array(X_gpu);

julia> Random.seed!(0);

julia> Mh_cpu = gcp(X_cpu, 10);

julia> typeof(Mh_cpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> Random.seed!(0);

julia> Mh_gpu = gcp(X_gpu, 10);

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0041753054f0

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu)) <= 1e-5
false

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0041753054f0

julia> mean(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
0.0005158942f0

julia> T = Float32
Float32

julia> r = 10
10

julia> Random.seed!(0);

julia> M0_cpu = CPD(ones(T, r), rand.(T, size(X_cpu), r));

julia> Random.seed!(0);

julia> M0_gpu = CPD(ones(T, r), rand.(T, size(X_gpu), r));

julia> maximum(I -> abs(M0_cpu[I] - M0_gpu[I]), CartesianIndices(X_cpu))
0.0f0

julia> typeof(M0_cpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> typeof(M0_gpu)
CPD{Float32, 3, Vector{Float32}, Matrix{Float32}}

julia> M0_cpu_norm = sqrt(sum(abs2, M0_cpu[I] for I in CartesianIndices(size(M0_cpu))))
219.85454f0

julia> M0_gpu_norm = sqrt(sum(abs2, M0_gpu[I] for I in CartesianIndices(size(M0_gpu))))
219.85454f0

julia> M0_cpu_norm == M0_gpu_norm
true

julia> Xnorm_cpu = sqrt(sum(abs2, skipmissing(X_cpu)))
89.38414f0

julia> Xnorm_gpu = sqrt(mapreduce(x -> isnan(x) ? 0 : abs2(x), +, X_gpu, init=0f0))
89.38414f0

julia> Xnorm_cpu == Xnorm_gpu
true

julia> N = 3;

julia> for k in Base.OneTo(N)
               M0_cpu.U[k] .*= (Xnorm_cpu / M0_cpu_norm)^(1 / N)
       end

julia> for k in Base.OneTo(N)
               M0_gpu.U[k] .*= (Xnorm_gpu / M0_gpu_norm)^(1 / N)
       end

julia> λ_cpu, U_cpu = M0_cpu.λ, collect(M0_cpu.U);

julia> λ_gpu, U_gpu = M0_gpu.λ, collect(M0_gpu.U);

julia> λ_gpu == λ_cpu
true

julia> U_cpu[1] == U_gpu[1]
true

julia> U_cpu[2] == U_gpu[2]
true

julia> U_cpu[3] == U_gpu[3]
true

julia> λ_gpu = CuArray(λ_gpu);

julia> U_gpu = [CuArray(U_gpu[i]) for i in 1:N];

julia> maximum(I -> abs(U_gpu[1][I] - U_cpu[1][I]), CartesianIndices(U_cpu[1]))
0.0f0

julia> maximum(I -> abs(U_gpu[2][I] - U_cpu[2][I]), CartesianIndices(U_cpu[2]))
0.0f0

julia> maximum(I -> abs(U_gpu[3][I] - U_cpu[3][I]), CartesianIndices(U_cpu[3]))
0.0f0

julia> maximum(I -> abs(U_gpu[3][I] - U_cpu[2][I]), CartesianIndices(U_cpu[3]));

julia> V_cpu = reduce(.*, U_cpu[i]'U_cpu[i] for i in setdiff(1:N, n));

julia> V_gpu = reduce(.*, U_gpu[i]'U_gpu[i] for i in setdiff(1:N, n));

julia> maximum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
1.5258789f-5

julia> mean(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
3.7574769f-6

julia> minimum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
0.0f0

alexmul1114 · 2024-01-16T22:05:08Z

It looks like the difference in the calculation of V comes from the U[i]'U[i] and not the element-wise multiply. Using the same set-up as above:

julia> V_cpu_intermediate = [U_cpu[i]'U_cpu[i] for i in setdiff(1:N, n)];

julia> V_gpu_intermediate = [U_gpu[i]'U_gpu[i] for i in setdiff(1:N, n)];

julia> maximum(I -> abs(V_gpu_intermediate[1][I] - V_cpu_intermediate[1][I]), CartesianIndices(V_cpu_intermediate[1]))
9.536743f-7

julia> maximum(I -> abs(V_gpu_intermediate[2][I] - V_cpu_intermediate[2][I]), CartesianIndices(V_cpu_intermediate[2]))
1.4305115f-6

julia> V_gpu_intermediate[1] = CuArray(V_cpu_intermediate[1]);

julia> V_gpu_intermediate[2] = CuArray(V_cpu_intermediate[2]);

julia> maximum(I -> abs(V_gpu_intermediate[2][I] - V_cpu_intermediate[2][I]), CartesianIndices(V_cpu_intermediate[2]))
0.0f0

julia> maximum(I -> abs(V_gpu_intermediate[1][I] - V_cpu_intermediate[1][I]), CartesianIndices(V_cpu_intermediate[1]))
0.0f0

julia> V_cpu = reduce(.*, V_cpu_intermediate);

julia> V_gpu = reduce(.*, V_gpu_intermediate);

julia> maximum(I -> abs(V_gpu[I] - V_cpu[I]), CartesianIndices(V_cpu))
0.0f0

Also, when using Float64 instead of Float32, the final results are much closer:

julia> X_gpu = CUDA.rand(Float64, 20,30,40);

julia> typeof(X_gpu)
CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}

julia> X_cpu = Array(X_gpu);

julia> typeof(X_cpu)
Array{Float64, 3}

julia> Random.seed!(0);

julia> Mh_cpu = gcp(X_cpu, 10);

julia> Random.seed!(0);

julia> Mh_gpu = gcp(X_gpu, 10);

julia> typeof(Mh_cpu)
CPD{Float64, 3, Vector{Float64}, Matrix{Float64}}

julia> typeof(Mh_gpu)
CPD{Float64, 3, Vector{Float64}, Matrix{Float64}}

julia> maximum(I -> abs(Mh_cpu[I] - Mh_gpu[I]), CartesianIndices(X_cpu))
5.079520137840632e-12

alexmul1114 · 2024-01-18T18:43:44Z

ext/CUDAExt.jl

+        for n in 1:N
+            V = reduce(.*, U[i]'U[i] for i in setdiff(1:N, n))
+            U[n] = GCPDecompositions.mttkrp(X, U, n) / V
+            λ = CuArray(CUDA.norm.(eachcol(U[n])))


Looks like line 39 here is much slower when having to move the data around devices:

julia> @btime norm.(eachcol(U_cpu[n])) 1.408 μs (4 allocations: 192 bytes) 10-element Vector{Float32}: 1.941741 2.0177264 1.904886 1.636472 1.4084961 0.9433015 1.0931745 1.6607362 1.273025 1.8184978 julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n]))) 440.045 μs (173 allocations: 7.38 KiB) 10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 1.9417399 2.0177276 1.9048846 1.6364713 1.408497 0.94329995 1.0931759 1.6607368 1.2730248 1.8184997

Rewriting norm.(eachcol(U[n])) as vec(sqrt.(sum(abs2, U_gpu[n]; dims=1))) prevents data from being transferred back to CPU and gets a 5x speedup for GPU version, similar time for CPU:

julia> @btime norm.(eachcol(U_cpu[n])) 1.364 μs (4 allocations: 192 bytes) 10-element Vector{Float32}: 1.941741 2.0177264 1.904886 1.636472 1.4084961 0.9433015 1.0931745 1.6607362 1.273025 1.8184978 julia> @btime vec(sqrt.(sum(abs2, U_cpu[n]; dims=1))) 1.212 μs (6 allocations: 304 bytes) 10-element Vector{Float32}: 1.9417411 2.0177264 1.904886 1.6364719 1.408496 0.9433015 1.0931743 1.6607362 1.273025 1.8184978 julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n]))) 368.776 μs (173 allocations: 7.38 KiB) 10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 1.9417399 2.0177276 1.9048846 1.6364713 1.408497 0.94329995 1.0931759 1.6607368 1.2730248 1.8184997 julia> @btime CUDA.@sync vec(sqrt.(sum(abs2, U_gpu[n]; dims=1))) 67.685 μs (112 allocations: 5.52 KiB) 10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 1.9417399 2.0177276 1.9048846 1.6364713 1.408497 0.9432999 1.093176 1.6607368 1.2730248 1.8184998

alexmul1114 · 2024-02-02T17:40:42Z

Testing on Caviness node with T4 GPU and CPU with 8 cores of 8GB each:

julia> using LinearAlgebra

julia> LinearAlgebra.peakflops()
1.7854915177491415e11

julia> using CUDA

julia> using BenchmarkTools

julia> using GCPDecompositions

julia> X_gpu_small = CUDA.rand(40,40,40);

julia> X_cpu_small = Array(X_gpu_small);

julia> X_gpu_med = CUDA.rand(80,80,80);

julia> X_cpu_med = Array(X_gpu_med);

julia> X_gpu_large = CUDA.rand(120,120,120);

julia> X_cpu_large = Array(X_gpu_large);

julia> @btime gcp($X_cpu_small, 10);
  41.223 ms (424623 allocations: 66.52 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_small, 10);
  508.272 ms (1048058 allocations: 39.40 MiB)

julia> @btime gcp($X_cpu_med, 10);
  440.675 ms (3114623 allocations: 279.40 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_med, 10);
  548.730 ms (3757483 allocations: 101.68 MiB)

julia> @btime gcp($X_cpu_large, 10);
  1.704 s (10410623 allocations: 695.66 MiB)

julia> @btime CUDA.@sync gcp($X_gpu_large, 10);
  703.406 ms (11063857 allocations: 269.05 MiB)

julia> X_gpu_xlarge = CUDA.rand(180,180,180);

julia> X_cpu_xlarge = Array(X_gpu_xlarge);

julia> @btime gcp($X_cpu_xlarge, 10);
  6.928 s (35034623 allocations: 1.77 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_xlarge, 10);
  1.143 s (35694525 allocations: 832.90 MiB)

julia> X_gpu_2xlarge = CUDA.rand(240,240,240);

julia> X_cpu_2xlarge = Array(X_gpu_2xlarge);

julia> @btime gcp($X_cpu_2xlarge, 10);
  12.863 s (82986623 allocations: 3.60 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_2xlarge, 10);
  2.075 s (83641618 allocations: 1.89 GiB)

julia> X_gpu_3xlarge = CUDA.rand(300,300,300);

julia> X_cpu_3xlarge = Array(X_gpu_3xlarge);

julia> @btime gcp($X_cpu_3xlarge, 10);
  34.129 s (162042623 allocations: 6.34 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_3xlarge, 10);
  3.838 s (162679111 allocations: 3.65 GiB)

julia> X_gpu_4xlarge = CUDA.rand(400,400,400);

julia> X_cpu_4xlarge = Array(X_gpu_4xlarge);

julia> @btime gcp($X_cpu_4xlarge, 10);
  47.972 s (384042623 allocations: 13.41 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_4xlarge, 10);
  8.170 s (384673070 allocations: 8.61 GiB)

julia> X_gpu_5xlarge = CUDA.rand(600,600,600);

julia> X_cpu_5xlarge = Array(X_gpu_5xlarge);

julia> @btime gcp($X_cpu_5xlarge, 10);
  164.451 s (1296048626 allocations: 39.78 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_5xlarge, 10);
  25.027 s (1296697062 allocations: 29.00 GiB)

julia> X_gpu_6xlarge = CUDA.rand(800,800,800);

julia> X_cpu_6xlarge = Array(X_gpu_6xlarge);

julia> @btime gcp($X_cpu_6xlarge, 10);
  204.891 s (3072048626 allocations: 87.84 GiB)

julia> @btime CUDA.@sync gcp($X_gpu_6xlarge, 10);
  57.469 s (3072719936 allocations: 68.70 GiB)

Fixes dahong67#36 (with simple MTTKRP implementation)

d48f0f1

alexmul1114 added 9 commits January 8, 2024 14:40

Change to Arrays AbstractArrays

4596484

Replace skipmissing

6e2a476

Use ismissing in mapreduce instead of skipmissing

01e2ab9

Add new _gcp method for CuArray

1e6097e

Add CUDA gcp as extension, use simple MTTKRP implementation and khatr…

fd71ca4

…i rao function from dahong67#34

Add CUDA to deps for backward compatiblity

e5b30ea

Move CUDA to extras

c34cc3c

Change CUDA version compatibility

966e30b

Modify project.toml

9938c46

alexmul1114 commented Jan 10, 2024

View reviewed changes

alexmul1114 added 11 commits January 11, 2024 09:57

Test with new MTTKRP from dahong67#35

e8077a3

Change .= back to =

fb1d0ce

Fix typos

e368433

Use latest MTTKRP from dahong67#35

614929d

collect selectdim for now

7854ec8

Try reusing inner

df803e1

Use copy in MTTKRP middle mode Rn computation for now

2a7f0c2

Change abstract arrays back to arrays

b72f6e9

Remove typo

12f0004

Switch khatrirao to accept abstractmatrix

e9680f6

Fix typo

ebc5455

Temporarily use Float32 for comparison against GPU

ab40583

alexmul1114 commented Jan 18, 2024

View reviewed changes

Replace norm function

f4696c0

dahong67 changed the title ~~CUDA Support~~ CUDA Support for ALS Feb 28, 2024

Merge branch 'master' into pr/alexmul1114/37

d7f68ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Support for ALS #37

CUDA Support for ALS #37

alexmul1114 commented Jan 8, 2024

codecov bot commented Jan 8, 2024 •

edited

Loading

alexmul1114 Jan 10, 2024

alexmul1114 commented Jan 10, 2024

dahong67 commented Jan 11, 2024

alexmul1114 commented Jan 12, 2024

alexmul1114 commented Jan 15, 2024

alexmul1114 commented Jan 16, 2024

alexmul1114 Jan 18, 2024 •

edited

Loading

alexmul1114 Jan 18, 2024

alexmul1114 commented Feb 2, 2024

CUDA Support for ALS #37

Are you sure you want to change the base?

CUDA Support for ALS #37

Conversation

alexmul1114 commented Jan 8, 2024

codecov bot commented Jan 8, 2024 • edited Loading

Codecov Report

alexmul1114 Jan 10, 2024

Choose a reason for hiding this comment

alexmul1114 commented Jan 10, 2024

dahong67 commented Jan 11, 2024

alexmul1114 commented Jan 12, 2024

alexmul1114 commented Jan 15, 2024

alexmul1114 commented Jan 16, 2024

alexmul1114 Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

alexmul1114 Jan 18, 2024

Choose a reason for hiding this comment

alexmul1114 commented Feb 2, 2024

codecov bot commented Jan 8, 2024 •

edited

Loading

alexmul1114 Jan 18, 2024 •

edited

Loading