Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Support for ALS #37

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d48f0f1
Fixes #36 (with simple MTTKRP implementation)
alexmul1114 Jan 8, 2024
4596484
Change to Arrays AbstractArrays
alexmul1114 Jan 8, 2024
6e2a476
Replace skipmissing
alexmul1114 Jan 10, 2024
01e2ab9
Use ismissing in mapreduce instead of skipmissing
alexmul1114 Jan 10, 2024
1e6097e
Add new _gcp method for CuArray
alexmul1114 Jan 10, 2024
fd71ca4
Add CUDA gcp as extension, use simple MTTKRP implementation and khatr…
alexmul1114 Jan 10, 2024
e5b30ea
Add CUDA to deps for backward compatiblity
alexmul1114 Jan 10, 2024
c34cc3c
Move CUDA to extras
alexmul1114 Jan 10, 2024
966e30b
Change CUDA version compatibility
alexmul1114 Jan 10, 2024
9938c46
Modify project.toml
alexmul1114 Jan 10, 2024
e8077a3
Test with new MTTKRP from #35
alexmul1114 Jan 11, 2024
fb1d0ce
Change .= back to =
alexmul1114 Jan 11, 2024
e368433
Fix typos
alexmul1114 Jan 11, 2024
614929d
Use latest MTTKRP from #35
alexmul1114 Jan 11, 2024
7854ec8
collect selectdim for now
alexmul1114 Jan 11, 2024
df803e1
Try reusing inner
alexmul1114 Jan 11, 2024
2a7f0c2
Use copy in MTTKRP middle mode Rn computation for now
alexmul1114 Jan 11, 2024
b72f6e9
Change abstract arrays back to arrays
alexmul1114 Jan 12, 2024
12f0004
Remove typo
alexmul1114 Jan 12, 2024
e9680f6
Switch khatrirao to accept abstractmatrix
alexmul1114 Jan 12, 2024
ebc5455
Fix typo
alexmul1114 Jan 12, 2024
ab40583
Temporarily use Float32 for comparison against GPU
alexmul1114 Jan 15, 2024
f4696c0
Replace norm function
alexmul1114 Jan 18, 2024
d7f68ed
Merge branch 'master' into pr/alexmul1114/37
dahong67 Mar 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,29 @@ authors = ["David Hong <[email protected]> and contributors"]
version = "0.1.2"

[deps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
IntervalSets = "8197267c-284f-5f27-9208-e0e47529a953"
LBFGSB = "5be7bae1-8223-5378-bac3-9e7378a2f6e6"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
LossFunctions = "30fc2ffe-d236-52d8-8643-a9d8f7c094a7"

[weakdeps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
LossFunctions = "30fc2ffe-d236-52d8-8643-a9d8f7c094a7"

[extensions]
CUDAExt = "CUDA"
LossFunctionsExt = "LossFunctions"

[compat]
CUDA = ">= 4.4.1"
ForwardDiff = "0.10.36"
IntervalSets = "0.7.7"
LBFGSB = "0.4.1"
LinearAlgebra = "1.6"
LossFunctions = "0.11.1"
julia = "1.6"

[extras]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
47 changes: 47 additions & 0 deletions ext/CUDAExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
module CUDAExt

using GCPDecompositions, CUDA

GCPDecompositions.gcp(

Check warning on line 5 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L5

Added line #L5 was not covered by tests
X::CuArray,
r,
loss = LeastSquaresLoss();
constraints = (),
algorithm = GCPAlgorithms.ALS(),
) = _gcp(X, r, loss, constraints, algorithm)
function _gcp(

Check warning on line 12 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L11-L12

Added lines #L11 - L12 were not covered by tests
X::CuArray{TX,N},
r,
loss::LeastSquaresLoss,
constraints::Tuple{},
algorithm::GCPAlgorithms.ALS,
) where {TX<:Real,N}
T = promote_type(TX, Float32)

Check warning on line 19 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L19

Added line #L19 was not covered by tests

# Random initialization
M0 = CPD(ones(T, r), rand.(T, size(X), r))
M0norm = sqrt(sum(abs2, M0[I] for I in CartesianIndices(size(M0))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added CUDA as an extension, where the extension has a gcp definition for CuArray input. Right now M0 is created and normalized on the CPU then moved to the GPU for ALS (and moved back to the CPU to be returned at the end). Need to figure out how rewrite line 24 without scalar indexing so M0 can be created directly as a CuArray.

Xnorm = sqrt(mapreduce(x -> isnan(x) ? 0 : abs2(x), +, X, init=0f0))
for k in Base.OneTo(N)
M0.U[k] .*= (Xnorm / M0norm)^(1 / N)
end
λ, U = M0.λ, collect(M0.U)

Check warning on line 28 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L22-L28

Added lines #L22 - L28 were not covered by tests

# Move λ, U to gpu
λ = CuArray(λ)
U = [CuArray(U[i]) for i in 1:N]

Check warning on line 32 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L31-L32

Added lines #L31 - L32 were not covered by tests

# Inefficient but simple implementation
for _ in 1:algorithm.maxiters
for n in 1:N
V = reduce(.*, U[i]'U[i] for i in setdiff(1:N, n))
U[n] = GCPDecompositions.mttkrp(X, U, n) / V
λ = CuArray(CUDA.norm.(eachcol(U[n])))
Copy link
Contributor Author

@alexmul1114 alexmul1114 Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like line 39 here is much slower when having to move the data around devices:

julia> @btime norm.(eachcol(U_cpu[n]))
  1.408 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
 1.941741
 2.0177264
 1.904886
 1.636472
 1.4084961
 0.9433015
 1.0931745
 1.6607362
 1.273025
 1.8184978

julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
  440.045 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.94329995
 1.0931759
 1.6607368
 1.2730248
 1.8184997

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting norm.(eachcol(U[n])) as vec(sqrt.(sum(abs2, U_gpu[n]; dims=1))) prevents data from being transferred back to CPU and gets a 5x speedup for GPU version, similar time for CPU:

julia> @btime norm.(eachcol(U_cpu[n]))
  1.364 μs (4 allocations: 192 bytes)
10-element Vector{Float32}:
 1.941741
 2.0177264
 1.904886
 1.636472
 1.4084961
 0.9433015
 1.0931745
 1.6607362
 1.273025
 1.8184978

julia> @btime vec(sqrt.(sum(abs2, U_cpu[n]; dims=1)))
  1.212 μs (6 allocations: 304 bytes)
10-element Vector{Float32}:
 1.9417411
 2.0177264
 1.904886
 1.6364719
 1.408496
 0.9433015
 1.0931743
 1.6607362
 1.273025
 1.8184978

julia> @btime CUDA.@sync CuArray(norm.(eachcol(U_gpu[n])))
  368.776 μs (173 allocations: 7.38 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.94329995
 1.0931759
 1.6607368
 1.2730248
 1.8184997

julia> @btime CUDA.@sync vec(sqrt.(sum(abs2, U_gpu[n]; dims=1)))
  67.685 μs (112 allocations: 5.52 KiB)
10-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.9417399
 2.0177276
 1.9048846
 1.6364713
 1.408497
 0.9432999
 1.093176
 1.6607368
 1.2730248
 1.8184998

U[n] = U[n] ./ permutedims(λ)
end
end

Check warning on line 42 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L35-L42

Added lines #L35 - L42 were not covered by tests

return CPD(Array(λ), Tuple([Array(U[i]) for i in 1:N]))

Check warning on line 44 in ext/CUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/CUDAExt.jl#L44

Added line #L44 was not covered by tests
end

end
1 change: 1 addition & 0 deletions src/GCPDecompositions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ include("gcp-opt.jl")

if !isdefined(Base, :get_extension)
include("../ext/LossFunctionsExt.jl")
include("../ext/CUDAExt.jl")
end

end
53 changes: 41 additions & 12 deletions src/gcp-opt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,8 @@ function _gcp(
constraints::Tuple{},
algorithm::GCPAlgorithms.ALS,
) where {TX<:Real,N}
T = promote_type(TX, Float64)

#T = promote_type(TX, Float64)
T = promote_type(TX, Float32)
# Random initialization
M0 = CPD(ones(T, r), rand.(T, size(X), r))
M0norm = sqrt(sum(abs2, M0[I] for I in CartesianIndices(size(M0))))
Expand All @@ -174,21 +174,50 @@ function _gcp(
return CPD(λ, Tuple(U))
end

# inefficient but simple
"""
mttkrp(X, U, n) -> Rn

Algorithm for computing one mode of MTTKRP is from "Fast Alternating LS Algorithms
for High Order CANDECOMP/PARAFAC Tensor Factorizations" by Phan et al., specifically
section III-B.
"""
function mttkrp(X, U, n)

# Dimensions
N, I, r = length(U), Tuple(size.(U, 1)), (only∘unique)(size.(U, 2))
(N == ndims(X) && I == size(X)) || throw(DimensionMismatch("`X` and `U` do not have matching dimensions"))

# Matricized tensor (in mode n)
Xn = reshape(permutedims(X, [n; setdiff(1:N, n)]), size(X, n), :)

# Khatri-Rao product (in mode n)
Zn = similar(U[1], prod(I[setdiff(1:N, n)]), r)
for j in 1:r
Zn[:, j] = reduce(kron, [view(U[i], :, j) for i in reverse(setdiff(1:N, n))])
# See section III-B from "Fast Alternating LS Algorithms for High Order CANDECOMP/PARAFAC Tensor Factorizations" by Phan et al.
Rn = similar(U[n])
Jn = prod(size(X)[1:n])
Kn = prod(size(X)[n+1:end])

# Special cases are n = 1 and n = N (n = 1 has no outer tensor-vector products),
# n = N has no inner tensor-vector products
if n == 1
# Just inner tensor-vector products
kr_inner = khatrirao(U[reverse(2:N)]...)
mul!(Rn, reshape(X, size(X, 1), :), kr_inner)
elseif n == N
# Just outer tensor-vector products
kr_outer = khatrirao(U[reverse(1:N-1)]...)
mul!(Rn, transpose(reshape(X, prod(size(X)[1:N-1]), size(X)[N])), kr_outer)
else
kr_inner = khatrirao(U[reverse(n+1:N)]...)
kr_outer = khatrirao(U[reverse(1:n-1)]...)
inner = reshape(reshape(X, Jn, Kn) * kr_inner, (size(X)[1:n]..., r))
Jn_inner = prod(size(inner)[1:n-1])
Kn_inner = prod(size(inner)[n:end-1])
Rn = reduce(hcat, [copy(transpose(reshape(selectdim(inner, ndims(inner), j), Jn_inner, Kn_inner))) * kr_outer[:, j] for j in 1:r])
end
return Rn
end

# MTTKRP (in mode n)
return Xn * Zn
function khatrirao(A::Vararg{T,N}) where {T<:AbstractMatrix,N}
r = size(A[1],2)
R = ntuple(Val(N)) do k
dims = (ntuple(i->1,Val(N-k))..., :, ntuple(i->1,Val(k-1))..., r)
return reshape(A[k],dims)
end
return reshape(broadcast(*, R...),:,r)
end
Loading