Improve efficiency of sweep operator by recursive tiling #9

Hua-Zhou · 2022-05-09T17:39:16Z

The block form of sweep is blessed by level-3 BLAS. Coupled with recursive tiling, it may improve the efficiency of matrix inversion. Below is some prototype code.

versioninfo()

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = 8

using BenchmarkTools, LinearAlgebra, Random, SweepOperator

# create an nxn pd test matrix
Random.seed!(123)
n = 1024
A = randn(n, n)
A = A'A + I;

Method 1: invert by Cholesky

"""
    inv_by_chol!(A)

Invert a pd matrix `A` in-place by Cholesky decomposition.
"""
function inv_by_chol!(A::Matrix{T}) where T <: LinearAlgebra.BlasReal
    LAPACK.potrf!('U', A)
    LAPACK.potri!('U', A)
    A
end

inv_by_chol!

UpperTriangular(inv_by_chol!(copy(A))) ≈ UpperTriangular(inv(A))

true

Method 2: invert by sweep

# sweep produce - A⁻¹
UpperTriangular(sweep!(copy(A), 1:n)) ≈ UpperTriangular(-inv(A))

true

Method 3: invert by tiled sweep

"""
    sweep_block_kernel!(A, krange, invsw)

Perform the block form of sweep on a contiguous range of indices `krange`.

[ Ak̲k̲ Ak̲k Ak̲k̅ ]          [ Ak̲k̲-Ak̲k*Akk⁻¹*Akk̲  Ak̲k*Akk⁻¹ Ak̲k̅-Ak̲k*Akk⁻¹*Akk̅ ] 
[  -  Akk Akk̅ ] -sweep-> [        -            -Akk⁻¹       Akk⁻¹*Akk̅     ]
[  -   -  Ak̅k̅ ]          [        -              -      Ak̅k̅-Ak̅k*Akk⁻¹*Akk̅ ]
"""
function sweep_block_kernel!(
        A      :: AbstractMatrix{T}, 
        krange :: AbstractUnitRange{<:Integer},
        invsw  :: Bool = false
        ) where {T <: LinearAlgebra.BlasReal}
    k̲range = 1:(krange[1] - 1)
    k̅range = (krange[end]+1):size(A, 2)
    Akk = view(A, krange, krange)
    Ak̲k = view(A, k̲range, krange)
    Ak̲k̲ = view(A, k̲range, k̲range)
    Akk̅ = view(A, krange, k̅range)
    Ak̅k̅ = view(A, k̅range, k̅range)
    Ak̲k̅ = view(A, k̲range, k̅range)
    # U = cholesky(Akk).U
    U, _ = LAPACK.potrf!('U', Akk)
    # Ak̲k = Ak̲k * U⁻¹
    BLAS.trsm!('R', 'U', 'N', 'N', one(T), U, Ak̲k)
    # Ak̲k̲ = Ak̲k̲ - Ak̲k * U⁻¹ * U⁻ᵀ * Akk̲
    BLAS.syrk!('U', 'N', -one(T), Ak̲k, one(T), Ak̲k̲)
    # Akk̅ = U⁻ᵀ * Akk̅
    BLAS.trsm!('L', 'U', 'T', 'N', one(T), U, Akk̅)
    # Ak̲k̅ = Ak̲k̅ - Ak̲k * U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.gemm!('N', 'N', -one(T), Ak̲k, Akk̅, one(T), Ak̲k̅)
    # Ak̅k̅ = Ak̅k̅ - Ak̅k * U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.syrk!('U', 'T', -one(T), Akk̅, one(T), Ak̅k̅)
    # Ak̲k = Ak̲k * Akk⁻¹ = Ak̲k * U⁻¹ * U⁻ᵀ
    s = ifelse(invsw, -one(T), one(T))
    BLAS.trsm!('R', 'U', 'T', 'N', s, Akk, Ak̲k)
    # Akk̅ = Akk⁻¹ * Akk̅ = U⁻¹ * U⁻ᵀ * Akk̅
    BLAS.trsm!('L', 'U', 'N', 'N', s, Akk, Akk̅)
    # Akk = Akk⁻¹ = U⁻ᵀ U⁻¹
    LAPACK.potri!('U', U)
    UpperTriangular(Akk) .*= -1
    A
end

sweep_block_kernel!

const TILE = Ref(512) # this is now a length, in bytes!

function tile_maxiter(::Type{<:AbstractArray{T}}) where {T}
    isbitstype(T) || return TILE[] ÷ 8
    max(TILE[] ÷ sizeof(T), 4)
end
tile_maxiter(::Type{AT}) where {AT} = TILE[] ÷ 8 # treat anything unkown like Float64

@inline function findcleft(r::UnitRange, step::Int)
    if length(r) >= 2*step
        minimum(r) - 1 + step * div(length(r), step * 2)
    else
        # minimum(r) - 1 + div(length(r), 2, RoundNearest) # not in Julia 1.3
        minimum(r) - 1 + round(Int, length(r)/2)
    end
end

@inline function cleave(range::UnitRange, step::Int=4)
    cleft = findcleft(range, step)
    first(range):cleft, cleft+1:last(range)
end

# recursive tiling
function sweep_block!(
        A      :: AbstractMatrix{T}, 
        krange :: AbstractUnitRange{<:Integer},
        invsw  :: Bool = false
        ) where {T <: LinearAlgebra.BlasReal}
    klen = length(krange)
    maxL = tile_maxiter(T)
    if klen > maxL
        k1s, k2s = cleave(krange)
        sweep_block!(A, k1s, invsw)
        sweep_block!(A, k2s, invsw)
    else
        sweep_block_kernel!(A, krange, invsw)
    end
    return A
end

sweep_block! (generic function with 2 methods)

# upper triangular part should be same as sweep!
UpperTriangular(sweep_block!(copy(A), 1:n)) ≈ UpperTriangular(-inv(A))

true

Benchmark

At this particular size and machine, we see 3x higher performance of the tiled sweep than Cholesky in LAPACK.

@btime sweep_block!($(copy(A)), $(1:n)) evals=1;

  7.184 ms (16 allocations: 512 bytes)

@btime sweep!($(copy(A)), $(1:n)) evals=1;

  238.108 ms (1 allocation: 8.12 KiB)

@btime inv_by_chol!($(copy(A))) evals=1;

  25.904 ms (0 allocations: 0 bytes)

TODO

More extensive benchmarking at various matrix sizes.
Implement sweep_block_kernel! using Tullio.jl, so it accommodates more data types (half precision, BigFloat, ...), can be audo-diffed, and can be compiled by CUDA.jl.
sweep function can output determinant instead of the matrix.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of sweep operator by recursive tiling #9

Improve efficiency of sweep operator by recursive tiling #9

Hua-Zhou commented May 9, 2022 •

edited

Loading

Improve efficiency of sweep operator by recursive tiling #9

Improve efficiency of sweep operator by recursive tiling #9

Comments

Hua-Zhou commented May 9, 2022 • edited Loading

Method 1: invert by Cholesky

Method 2: invert by sweep

Method 3: invert by tiled sweep

Benchmark

TODO

Hua-Zhou commented May 9, 2022 •

edited

Loading