Skip to content

Commit

Permalink
Merge pull request #19 from JuliaMath/ff/prefetch
Browse files Browse the repository at this point in the history
@explicit/@fusible + cache prefetching
  • Loading branch information
ffevotte authored Jun 29, 2020
2 parents 974593c + a284b4c commit 1fa3f88
Show file tree
Hide file tree
Showing 19 changed files with 15,621 additions and 1,449 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ os:
julia:
- 1.3
- 1.4
- 1.5
- nightly
matrix:
fast_finish: true
Expand Down
4 changes: 1 addition & 3 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name = "AccurateArithmetic"
uuid = "22286c92-06ac-501d-9306-4abd417d9753"
license = "MIT"
repo = "https://github.com/JuliaMath/AccurateArithmetic.jl.git"
version = "0.3.3"
version = "0.3.4"

[deps]
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
Expand All @@ -12,9 +12,7 @@ SIMDPirates = "21efa798-c60a-11e8-04d3-e1a92915a26a"
VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"

[compat]
BenchmarkTools = "0"
JSON = "0"
Plots = "0"
SIMDPirates = "0.2, 0.3, 0.5, 0.6.3, 0.7, 0.8"
VectorizationBase = "0"
julia = "1"
Expand Down
93 changes: 57 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,39 +56,48 @@ julia> sum(big.(x))
# But the standard summation algorithms computes this sum very inaccurately
# (not even the sign is correct)
julia> sum(x)
-136.0
-8.0


# Compensated summation algorithms should compute this more accurately
julia> using AccurateArithmetic

# Algorithm by Ogita, Rump and Oishi
julia> sum_oro(x)
0.9999999999999716
1.0000000000000084

# Algorithm by Kahan, Babuska and Neumaier
julia> sum_kbn(x)
0.9999999999999716
1.0000000000000084
```


![](test/figs/qual.svg)
![](test/figs/sum_accuracy.svg)
![](test/figs/dot_accuracy.svg)

In the graph above, we see the relative error vary as a function of the
In the graphs above, we see the relative error vary as a function of the
condition number, in a log-log scale. Errors lower than ϵ are arbitrarily set to
ϵ; conversely, when the relative error is more than 100% (i.e no digit is
correctly computed anymore), the error is capped there in order to avoid
affecting the scale of the graph too much. What we see is that the pairwise
affecting the scale of the graph too much. What we see on the left is that the pairwise
summation algorithm (as implemented in Base.sum) starts losing accuracy as soon
as the condition number increases, computing only noise when the condition
number exceeds 1/ϵ≃10¹⁶. In contrast, both compensated algorithms
number exceeds 1/ϵ≃10¹⁶. The same goes for the naive summation algorithm.
In contrast, both compensated algorithms
(Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the
result at this point, and start losing accuracy there, computing meaningless
results when the condition nuber reaches 1/ϵ²≃10³². In effect these (simply)
compensated algorithms produce the same results as if a naive summation had been
performed with twice the working precision (128 bits in this case), and then
rounded to 64-bit floats.

The same comments can be made for the dot product implementations shown on the
right. Uncompensated algorithms, as implemented in
`AccurateArithmetic.dot_naive` or `Base.dot` (which internally calls BLAS in
this case) exhibit typical loss of accuracy. In contrast, the implementation of
Ogita, Rump & Oishi's compentated algorithm effectively doubles the working
precision.

<br/>

Performancewise, compensated algorithms perform a lot better than alternatives
Expand All @@ -97,24 +106,31 @@ such as arbitrary precision (`BigFloat`) or rational arithmetic (`Rational`) :
```julia
julia> using BenchmarkTools

julia> length(x)
10001

julia> @btime sum($x)
1.305 μs (0 allocations: 0 bytes)
-136.0
1.320 μs (0 allocations: 0 bytes)
-8.0

julia> @btime sum_naive($x)
1.026 μs (0 allocations: 0 bytes)
-1.121325337906356

julia> @btime sum_oro($x)
3.421 μs (0 allocations: 0 bytes)
0.9999999999999716
3.348 μs (0 allocations: 0 bytes)
1.0000000000000084

julia> @btime sum_kbn($x)
3.792 μs (0 allocations: 0 bytes)
0.9999999999999716
3.870 μs (0 allocations: 0 bytes)
1.0000000000000084

julia> @btime sum(big.($x))
874.203 μs (20006 allocations: 1.14 MiB)
julia> @btime sum($(big.(x)))
437.495 μs (2 allocations: 112 bytes)
1.0

julia> @btime sum(Rational{BigInt}.(x))
22.702 ms (582591 allocations: 10.87 MiB)
julia> @btime sum($(Rational{BigInt}.(x)))
10.894 ms (259917 allocations: 4.76 MiB)
1//1
```

Expand All @@ -124,32 +140,37 @@ than their naive floating-point counterparts. As such, they usually perform
worse. However, leveraging the power of modern architectures via vectorization,
the slow down can be kept to a small value.

![](test/figs/perf.svg)

In the graph above, the time spent in the summation (renormalized per element)
is plotted against the vector size (the units in the y-axis label should be
“ns/elem”). What we see with the standard summation is that, once vectors start
having significant sizes (say, more than 1000 elements), the implementation is
memory bound (as expected of a typical BLAS1 operation). Which is why we see
significant decreases in the performance when the vector can’t fit into the L2
cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k
elements, or 3MB on y machine).

The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level
(ushift=2, i.e 2²=4 unrolled iterations), is CPU-bound when vectors fit inside
the cache. However, when vectors are to large to fit into the L3 cache, the
implementation becomes memory-bound again (on my system), which means we get the
same performance as the standard summation.
![](test/figs/sum_performance.svg)
![](test/figs/dot_performance.svg)

Benchmarks presented in the above graphs were obtained in an Intel® Xeon® Gold
6128 CPU @ 3.40GHz. The time spent in the summation (renormalized per element)
is plotted against the vector size. What we see with the standard summation is
that, once vectors start having significant sizes (say, more than a few
thousands of elements), the implementation is memory bound (as expected of a
typical BLAS1 operation). Which is why we see significant decreases in the
performance when the vector can’t fit into the L1, L2 or L3 cache.

On this AVX512-enabled system, the Kahan-Babuska-Neumaier implementation tends
to be a little more efficient than the Ogita-Rump-Oishi algorithm (this would
generally the opposite for AVX2 systems). When implemented with a suitable
unrolling level and cache prefetching, these implementations are CPU-bound when
vectors fit inside the L1 or L2 cache. However, when vectors are too large to
fit into the L2 cache, the implementation becomes memory-bound again (on this
system), which means we get the same performance as the standard
summation. Again, the same could be said as well for dot product calculations
(graph on the right), where the implementations from `AccurateArithmetic.jl`
compete against MKL's dot product.

In other words, the improved accuracy is free for sufficiently large
vectors. For smaller vectors, the accuracy comes with a slow-down that can reach
values slightly above 3 for vectors which fit in the L2 cache.
vectors. For smaller vectors, the accuracy comes with a slow-down by a factor of
approximately 3 in the L2 cache.


### Tests

The graphs above can be reproduced using the `test/perftests.jl` script in this
repository. Before running them, be aware that it takes around one hour to
repository. Before running them, be aware that it takes around tow hours to
generate the performance graph, during which the benchmark machine should be as
low-loaded as possible in order to avoid perturbing performance measurements.

Expand Down
95 changes: 83 additions & 12 deletions src/SIMDops.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,50 @@
module SIMDops
import SIMDPirates
using SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless
using SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless, vzero


# * Generic operations on SIMD packs

fma(x...) = Base.fma(x...)
fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)

fptype(::Type{Vec{W, T}}) where {W, T} = T


# * Macros rewriting mathematical operations

# ** Helper functions

replace_simd_ops(x, _, _) = x

function replace_simd_ops(sym::Symbol, ops, updates)
# Replace e.g: +
# => eadd
for (op, replacement) in ops
sym === op && return replacement
end
sym
end

function replace_simd_ops(expr::Expr, ops, updates)
# Replace e.g: lhs += rhs
# => lhs = eadd(lhs, rhs)
for (op, replacement) in updates
if expr.head === op
lhs = expr.args[1]
rhs = replace_simd_ops(expr.args[2], ops, updates)
return :($lhs = $replacement($lhs, $rhs))
end
end

newexpr = Expr(replace_simd_ops(expr.head, ops, updates))
for arg in expr.args
push!(newexpr.args, replace_simd_ops(arg, ops, updates))
end
newexpr
end

# ** Explicit/exact operations (non-fusible)

eadd(x...) = Base.:+(x...)
eadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.evadd(x, y)
Expand All @@ -13,20 +57,47 @@ esub(x::T, y::T) where {T<:NTuple} = SIMDPirates.evsub(x, y)
emul(x...) = Base.:*(x...)
emul(x::T, y::T) where {T<:NTuple} = SIMDPirates.evmul(x, y)

fma(x...) = Base.fma(x...)
fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)
function explicit(expr::Expr)
replace_simd_ops(expr,
(:+ => eadd,
:- => esub,
:* => emul,
:fma => fma),
(:+= => eadd,
:-= => esub,
:*= => emul))
end

macro explicit()
quote
$(esc(:+)) = eadd
$(esc(:-)) = esub
$(esc(:*)) = emul
end
macro explicit(expr)
explicit(expr) |> esc
end

vzero(x) = zero(x)
vzero(::Type{Vec{W, T}}) where {W, T} = SIMDPirates.vbroadcast(Vec{W, T}, 0)

fptype(::Type{Vec{W, T}}) where {W, T} = T
# * Fast/fusible operations

fadd(x...) = Base.:+(x...)
fadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.vadd(x, y)

fsub(x...) = Base.:-(x...)
fsub(x::NTuple) = broadcast(esub, x)
fsub(x::T, y::T) where {T<:NTuple} = SIMDPirates.vsub(x, y)

fmul(x...) = Base.:*(x...)
fmul(x::T, y::T) where {T<:NTuple} = SIMDPirates.vmul(x, y)

function fusible(expr::Expr)
replace_simd_ops(expr,
(:+ => fadd,
:- => fsub,
:* => fmul,
:fma => fma),
(:+= => fadd,
:-= => fsub,
:*= => fmul))
end

macro fusible(expr)
fusible(expr) |> esc
end

end
63 changes: 54 additions & 9 deletions src/Summation.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import VectorizationBase

import ..SIMDops
using ..SIMDops: Vec, vload, vsum, vzero, fptype
using SIMDPirates

using ..EFT: two_sum, fast_two_sum, two_prod

Expand All @@ -19,10 +20,11 @@ include("accumulators/compDot.jl")
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
@generated function accumulate(x::NTuple{A, AbstractArray{T}},
accType::F,
rem_handling = Val(:scalar),
::Val{Ushift} = Val(2)
) where {F, A, T <: Union{Float32,Float64}, Ushift}
accType::F,
rem_handling = Val(:scalar),
::Val{Ushift} = Val(2),
::Val{Prefetch} = Val(0),
) where {F, A, T <: Union{Float32,Float64}, Ushift, Prefetch}
@assert 0 Ushift < 6
U = 1 << Ushift

Expand All @@ -43,6 +45,10 @@ include("accumulators/compDot.jl")
Nshift = N >> $(shift + Ushift)
offset = 0
for n in 1:Nshift
if $Prefetch > 0
SIMDPirates.prefetch.(px.+offset.+$(Prefetch*WT), Val(3), Val(0))
end

Base.Cartesian.@nexprs $U u -> begin
xi = vload.($V, px.+offset)
add!(acc_u, xi...)
Expand Down Expand Up @@ -85,11 +91,50 @@ include("accumulators/compDot.jl")
end
end

sum_naive(x) = accumulate((x,), sumAcc, Val(:scalar), Val(3))
sum_kbn(x) = accumulate((x,), compSumAcc(fast_two_sum), Val(:scalar), Val(2))
sum_oro(x) = accumulate((x,), compSumAcc(two_sum), Val(:scalar), Val(2))
# Default values for unrolling
@inline default_ushift(::SumAcc) = Val(3)
@inline default_ushift(::CompSumAcc) = Val(2)
@inline default_ushift(::DotAcc) = Val(3)
@inline default_ushift(::CompDotAcc) = Val(2)
# dispatch
# either default_ushift(x, acc)
# or default_ushift((x,), acc)
@inline default_ushift(x::AbstractArray, acc) = default_ushift(acc(eltype(x)))
@inline default_ushift(x::NTuple, acc) = default_ushift(first(x), acc)


# Default values for cache prefetching
@inline default_prefetch(::SumAcc) = Val(0)
@inline default_prefetch(::CompSumAcc) = Val(35)
@inline default_prefetch(::DotAcc) = Val(0)
@inline default_prefetch(::CompDotAcc) = Val(20)
# dispatch
# either default_prefetch(x, acc)
# or default_prefetch((x,), acc)
@inline default_prefetch(x::AbstractArray, acc) = default_prefetch(acc(eltype(x)))
@inline default_prefetch(x::NTuple, acc) = default_prefetch(first(x), acc)


@inline _sum(x, acc) = if length(x) < 500
# no cache prefetching for small vectors
accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), Val(0))
else
accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
end

sum_naive(x) = _sum(x, sumAcc)
sum_kbn(x) = _sum(x, compSumAcc(fast_two_sum))
sum_oro(x) = _sum(x, compSumAcc(two_sum))


@inline _dot(x, y, acc) = if length(x) < 500
# no cache prefetching for small vectors
accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), Val(0))
else
accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
end

dot_naive(x, y) = accumulate((x,y), dotAcc, Val(:scalar), Val(3))
dot_oro(x, y) = accumulate((x,y), compDotAcc, Val(:scalar), Val(2))
dot_naive(x, y) = _dot(x, y, dotAcc)
dot_oro(x, y) = _dot(x, y, compDotAcc)

end
Loading

2 comments on commit 1fa3f88

@ffevotte
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register

Release notes:

This release introduces a @fusible counterpart to SIMDops.@explicit, which allows using SIMD instructions from SIMDPirates.jl in all cases, whether exact instructions are wanted, or fused operations are allowed to happen. EFTs now use this when possible. The API for @explicit changed: it now affects the following expressions (instead of the whole scope inside which it is placed).

This release also adds support for cache prefetching, which significantly improves performance for large vectors. This was actually the explanation for the performance discrepancies observed for large vectors and mentioned in #7. Default values for the cache prefetching mechanism are probably not optimal for all architectures. Users interested in the last 10% performance outside the cache are invited to customize this parameter.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/17162

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.3.4 -m "<description of version>" 1fa3f88dd50eb9f9e467fb6e71dff5ce865bf9e9
git push origin v0.3.4

Please sign in to comment.