Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of sweep operator by recursive tiling #9

Open
3 tasks
Hua-Zhou opened this issue May 9, 2022 · 0 comments
Open
3 tasks

Improve efficiency of sweep operator by recursive tiling #9

Hua-Zhou opened this issue May 9, 2022 · 0 comments

Comments

@Hua-Zhou
Copy link

Hua-Zhou commented May 9, 2022

The block form of sweep is blessed by level-3 BLAS. Coupled with recursive tiling, it may improve the efficiency of matrix inversion. Below is some prototype code.

versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = 8
using BenchmarkTools, LinearAlgebra, Random, SweepOperator
# create an nxn pd test matrix
Random.seed!(123)
n = 1024
A = randn(n, n)
A = A'A + I;

Method 1: invert by Cholesky

"""
    inv_by_chol!(A)

Invert a pd matrix `A` in-place by Cholesky decomposition.
"""
function inv_by_chol!(A::Matrix{T}) where T <: LinearAlgebra.BlasReal
    LAPACK.potrf!('U', A)
    LAPACK.potri!('U', A)
    A
end
inv_by_chol!
UpperTriangular(inv_by_chol!(copy(A)))  UpperTriangular(inv(A))
true

Method 2: invert by sweep

# sweep produce - A⁻¹
UpperTriangular(sweep!(copy(A), 1:n))  UpperTriangular(-inv(A))
true

Method 3: invert by tiled sweep

"""
    sweep_block_kernel!(A, krange, invsw)

Perform the block form of sweep on a contiguous range of indices `krange`.

[ Ak̲k̲ Ak̲k Ak̲k̅ ]          [ Ak̲k̲-Ak̲k*Akk⁻¹*Akk̲  Ak̲k*Akk⁻¹ Ak̲k̅-Ak̲k*Akk⁻¹*Akk̅ ] 
[  -  Akk Akk̅ ] -sweep-> [        -            -Akk⁻¹       Akk⁻¹*Akk̅     ]
[  -   -  Ak̅k̅ ]          [        -              -      Ak̅k̅-Ak̅k*Akk⁻¹*Akk̅ ]
"""
function sweep_block_kernel!(
        A      :: AbstractMatrix{T}, 
        krange :: AbstractUnitRange{<:Integer},
        invsw  :: Bool = false
        ) where {T <: LinearAlgebra.BlasReal}
    k̲range = 1:(krange[1] - 1)
    k̅range = (krange[end]+1):size(A, 2)
    Akk = view(A, krange, krange)
    Ak̲k = view(A, k̲range, krange)
    Ak̲k̲ = view(A, k̲range, k̲range)
    Akk̅ = view(A, krange, k̅range)
    Ak̅k̅ = view(A, k̅range, k̅range)
    Ak̲k̅ = view(A, k̲range, k̅range)
    # U = cholesky(Akk).U
    U, _ = LAPACK.potrf!('U', Akk)
    # Ak̲k = Ak̲k * U⁻¹
    BLAS.trsm!('R', 'U', 'N', 'N', one(T), U, Ak̲k)
    # Ak̲k̲ = Ak̲k̲ - Ak̲k * U⁻¹ * U⁻ᵀ * Akk̲
    BLAS.syrk!('U', 'N', -one(T), Ak̲k, one(T), Ak̲k̲)
    # Akk̅ = U⁻ᵀ * Akk̅
    BLAS.trsm!('L', 'U', 'T', 'N', one(T), U, Akk̅)
    # Ak̲k̅ = Ak̲k̅ - Ak̲k * U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.gemm!('N', 'N', -one(T), Ak̲k, Akk̅, one(T), Ak̲k̅)
    # Ak̅k̅ = Ak̅k̅ - Ak̅k * U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.syrk!('U', 'T', -one(T), Akk̅, one(T), Ak̅k̅)
    # Ak̲k = Ak̲k * Akk⁻¹ = Ak̲k * U⁻¹ * U⁻ᵀ
    s = ifelse(invsw, -one(T), one(T))
    BLAS.trsm!('R', 'U', 'T', 'N', s, Akk, Ak̲k)
    # Akk̅ = Akk⁻¹ * Akk̅ = U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.trsm!('L', 'U', 'N', 'N', s, Akk, Akk̅)
    # Akk = Akk⁻¹ = U⁻ᵀ U⁻¹
    LAPACK.potri!('U', U)
    UpperTriangular(Akk) .*= -1
    A
end
sweep_block_kernel!
const TILE = Ref(512) # this is now a length, in bytes!

function tile_maxiter(::Type{<:AbstractArray{T}}) where {T}
    isbitstype(T) || return TILE[] ÷ 8
    max(TILE[] ÷ sizeof(T), 4)
end
tile_maxiter(::Type{AT}) where {AT} = TILE[] ÷ 8 # treat anything unkown like Float64

@inline function findcleft(r::UnitRange, step::Int)
    if length(r) >= 2*step
        minimum(r) - 1 + step * div(length(r), step * 2)
    else
        # minimum(r) - 1 + div(length(r), 2, RoundNearest) # not in Julia 1.3
        minimum(r) - 1 + round(Int, length(r)/2)
    end
end

@inline function cleave(range::UnitRange, step::Int=4)
    cleft = findcleft(range, step)
    first(range):cleft, cleft+1:last(range)
end

# recursive tiling
function sweep_block!(
        A      :: AbstractMatrix{T}, 
        krange :: AbstractUnitRange{<:Integer},
        invsw  :: Bool = false
        ) where {T <: LinearAlgebra.BlasReal}
    klen = length(krange)
    maxL = tile_maxiter(T)
    if klen > maxL
        k1s, k2s = cleave(krange)
        sweep_block!(A, k1s, invsw)
        sweep_block!(A, k2s, invsw)
    else
        sweep_block_kernel!(A, krange, invsw)
    end
    return A
end
sweep_block! (generic function with 2 methods)
# upper triangular part should be same as sweep!
UpperTriangular(sweep_block!(copy(A), 1:n))  UpperTriangular(-inv(A))
true

Benchmark

At this particular size and machine, we see 3x higher performance of the tiled sweep than Cholesky in LAPACK.

@btime sweep_block!($(copy(A)), $(1:n)) evals=1;
  7.184 ms (16 allocations: 512 bytes)
@btime sweep!($(copy(A)), $(1:n)) evals=1;
  238.108 ms (1 allocation: 8.12 KiB)
@btime inv_by_chol!($(copy(A))) evals=1;
  25.904 ms (0 allocations: 0 bytes)

TODO

  • More extensive benchmarking at various matrix sizes.

  • Implement sweep_block_kernel! using Tullio.jl, so it accommodates more data types (half precision, BigFloat, ...), can be audo-diffed, and can be compiled by CUDA.jl.

  • sweep function can output determinant instead of the matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant