Merge pull request #19 from JuliaMath/ff/prefetch

@explicit/@fusible + cache prefetching
JuliaMath · Jun 29, 2020 · 1fa3f88 · 1fa3f88 · ffevotte · Jun 29, 2020
2 parents 974593c + a284b4c
commit 1fa3f88
Show file tree

Hide file tree

Showing 19 changed files with 15,621 additions and 1,449 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -5,6 +5,7 @@ os:
 julia:
   - 1.3
   - 1.4
+  - 1.5
   - nightly
 matrix:
   fast_finish: true

diff --git a/Project.toml b/Project.toml
@@ -2,7 +2,7 @@ name = "AccurateArithmetic"
 uuid = "22286c92-06ac-501d-9306-4abd417d9753"
 license = "MIT"
 repo = "https://github.com/JuliaMath/AccurateArithmetic.jl.git"
-version = "0.3.3"
+version = "0.3.4"
 
 [deps]
 JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
@@ -12,9 +12,7 @@ SIMDPirates = "21efa798-c60a-11e8-04d3-e1a92915a26a"
 VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
 
 [compat]
-BenchmarkTools = "0"
 JSON = "0"
-Plots = "0"
 SIMDPirates = "0.2, 0.3, 0.5, 0.6.3, 0.7, 0.8"
 VectorizationBase = "0"
 julia = "1"

diff --git a/README.md b/README.md
@@ -56,39 +56,48 @@ julia> sum(big.(x))
 # But the standard summation algorithms computes this sum very inaccurately
 # (not even the sign is correct)
 julia> sum(x)
--136.0
+-8.0
 
 
 # Compensated summation algorithms should compute this more accurately
 julia> using AccurateArithmetic
 
 # Algorithm by Ogita, Rump and Oishi
 julia> sum_oro(x)
-0.9999999999999716
+1.0000000000000084
 
 # Algorithm by Kahan, Babuska and Neumaier
 julia> sum_kbn(x)
-0.9999999999999716
+1.0000000000000084
 ```
 
 
-![](test/figs/qual.svg)
+![](test/figs/sum_accuracy.svg)
+![](test/figs/dot_accuracy.svg)
 
-In the graph above, we see the relative error vary as a function of the
+In the graphs above, we see the relative error vary as a function of the
 condition number, in a log-log scale. Errors lower than ϵ are arbitrarily set to
 ϵ; conversely, when the relative error is more than 100% (i.e no digit is
 correctly computed anymore), the error is capped there in order to avoid
-affecting the scale of the graph too much. What we see is that the pairwise
+affecting the scale of the graph too much. What we see on the left is that the pairwise
 summation algorithm (as implemented in Base.sum) starts losing accuracy as soon
 as the condition number increases, computing only noise when the condition
-number exceeds 1/ϵ≃10¹⁶. In contrast, both compensated algorithms
+number exceeds 1/ϵ≃10¹⁶. The same goes for the naive summation algorithm.
+In contrast, both compensated algorithms
 (Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the
 result at this point, and start losing accuracy there, computing meaningless
 results when the condition nuber reaches 1/ϵ²≃10³². In effect these (simply)
 compensated algorithms produce the same results as if a naive summation had been
 performed with twice the working precision (128 bits in this case), and then
 rounded to 64-bit floats.
 
+The same comments can be made for the dot product implementations shown on the
+right. Uncompensated algorithms, as implemented in
+`AccurateArithmetic.dot_naive` or  `Base.dot` (which internally calls BLAS in
+this case) exhibit typical loss of accuracy. In contrast, the implementation of
+Ogita, Rump & Oishi's compentated algorithm effectively doubles the working
+precision.
+
 <br/>
 
 Performancewise, compensated algorithms perform a lot better than alternatives
@@ -97,24 +106,31 @@ such as arbitrary precision (`BigFloat`) or rational arithmetic (`Rational`) :
 ```julia
 julia> using BenchmarkTools
 
+julia> length(x)
+10001
+
 julia> @btime sum($x)
-  1.305 μs (0 allocations: 0 bytes)
--136.0
+  1.320 μs (0 allocations: 0 bytes)
+-8.0
+
+julia> @btime sum_naive($x)
+  1.026 μs (0 allocations: 0 bytes)
+-1.121325337906356
 
 julia> @btime sum_oro($x)
-  3.421 μs (0 allocations: 0 bytes)
-0.9999999999999716
+  3.348 μs (0 allocations: 0 bytes)
+1.0000000000000084
 
 julia> @btime sum_kbn($x)
-  3.792 μs (0 allocations: 0 bytes)
-0.9999999999999716
+  3.870 μs (0 allocations: 0 bytes)
+1.0000000000000084
 
-julia> @btime sum(big.($x))
-  874.203 μs (20006 allocations: 1.14 MiB)
+julia> @btime sum($(big.(x)))
+  437.495 μs (2 allocations: 112 bytes)
 1.0
 
-julia> @btime sum(Rational{BigInt}.(x))
-  22.702 ms (582591 allocations: 10.87 MiB)
+julia> @btime sum($(Rational{BigInt}.(x)))
+  10.894 ms (259917 allocations: 4.76 MiB)
 1//1
 ```
 
@@ -124,32 +140,37 @@ than their naive floating-point counterparts. As such, they usually perform
 worse. However, leveraging the power of modern architectures via vectorization,
 the slow down can be kept to a small value.
 
-![](test/figs/perf.svg)
-
-In the graph above, the time spent in the summation (renormalized per element)
-is plotted against the vector size (the units in the y-axis label should be
-“ns/elem”). What we see with the standard summation is that, once vectors start
-having significant sizes (say, more than 1000 elements), the implementation is
-memory bound (as expected of a typical BLAS1 operation). Which is why we see
-significant decreases in the performance when the vector can’t fit into the L2
-cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k
-elements, or 3MB on y machine).
-
-The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level
-(ushift=2, i.e 2²=4 unrolled iterations), is CPU-bound when vectors fit inside
-the cache. However, when vectors are to large to fit into the L3 cache, the
-implementation becomes memory-bound again (on my system), which means we get the
-same performance as the standard summation.
+![](test/figs/sum_performance.svg)
+![](test/figs/dot_performance.svg)
+
+Benchmarks presented in the above graphs were obtained in an Intel® Xeon® Gold
+6128 CPU @ 3.40GHz. The time spent in the summation (renormalized per element)
+is plotted against the vector size. What we see with the standard summation is
+that, once vectors start having significant sizes (say, more than a few
+thousands of elements), the implementation is memory bound (as expected of a
+typical BLAS1 operation). Which is why we see significant decreases in the
+performance when the vector can’t fit into the L1, L2 or L3 cache.
+
+On this AVX512-enabled system, the Kahan-Babuska-Neumaier implementation tends
+to be a little more efficient than the Ogita-Rump-Oishi algorithm (this would
+generally the opposite for AVX2 systems). When implemented with a suitable
+unrolling level and cache prefetching, these implementations are CPU-bound when
+vectors fit inside the L1 or L2 cache. However, when vectors are too large to
+fit into the L2 cache, the implementation becomes memory-bound again (on this
+system), which means we get the same performance as the standard
+summation. Again, the same could be said as well for dot product calculations
+(graph on the right), where the implementations from `AccurateArithmetic.jl`
+compete against MKL's dot product.
 
 In other words, the improved accuracy is free for sufficiently large
-vectors. For smaller vectors, the accuracy comes with a slow-down that can reach
-values slightly above 3 for vectors which fit in the L2 cache.
+vectors. For smaller vectors, the accuracy comes with a slow-down by a factor of
+approximately 3 in the L2 cache.
 
 
 ### Tests
 
 The graphs above can be reproduced using the `test/perftests.jl` script in this
-repository. Before running them, be aware that it takes around one hour to
+repository. Before running them, be aware that it takes around tow hours to
 generate the performance graph, during which the benchmark machine should be as
 low-loaded as possible in order to avoid perturbing performance measurements.
 

diff --git a/src/SIMDops.jl b/src/SIMDops.jl
@@ -1,6 +1,50 @@
 module SIMDops
 import SIMDPirates
-using  SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless
+using  SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless, vzero
+
+
+# * Generic operations on SIMD packs
+
+fma(x...) = Base.fma(x...)
+fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)
+
+fptype(::Type{Vec{W, T}}) where {W, T} = T
+
+
+# * Macros rewriting mathematical operations
+
+# ** Helper functions
+
+replace_simd_ops(x, _, _) = x
+
+function replace_simd_ops(sym::Symbol, ops, updates)
+    # Replace e.g:  +
+    #           =>  eadd
+    for (op, replacement) in ops
+        sym === op && return replacement
+    end
+    sym
+end
+
+function replace_simd_ops(expr::Expr, ops, updates)
+    # Replace e.g:  lhs += rhs
+    #           =>  lhs = eadd(lhs, rhs)
+    for (op, replacement) in updates
+        if expr.head === op
+            lhs = expr.args[1]
+            rhs = replace_simd_ops(expr.args[2], ops, updates)
+            return :($lhs = $replacement($lhs, $rhs))
+        end
+    end
+
+    newexpr = Expr(replace_simd_ops(expr.head, ops, updates))
+    for arg in expr.args
+        push!(newexpr.args, replace_simd_ops(arg, ops, updates))
+    end
+    newexpr
+end
+
+# ** Explicit/exact operations (non-fusible)
 
 eadd(x...) = Base.:+(x...)
 eadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.evadd(x, y)
@@ -13,20 +57,47 @@ esub(x::T, y::T) where {T<:NTuple} = SIMDPirates.evsub(x, y)
 emul(x...) = Base.:*(x...)
 emul(x::T, y::T) where {T<:NTuple} = SIMDPirates.evmul(x, y)
 
-fma(x...) = Base.fma(x...)
-fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)
+function explicit(expr::Expr)
+    replace_simd_ops(expr,
+                     (:+   => eadd,
+                      :-   => esub,
+                      :*   => emul,
+                      :fma => fma),
+                     (:+=  => eadd,
+                      :-=  => esub,
+                      :*=  => emul))
+end
 
-macro explicit()
-    quote
-        $(esc(:+)) = eadd
-        $(esc(:-)) = esub
-        $(esc(:*)) = emul
-    end
+macro explicit(expr)
+    explicit(expr) |> esc
 end
 
-vzero(x) = zero(x)
-vzero(::Type{Vec{W, T}}) where {W, T} = SIMDPirates.vbroadcast(Vec{W, T}, 0)
 
-fptype(::Type{Vec{W, T}}) where {W, T} = T
+# * Fast/fusible operations
+
+fadd(x...) = Base.:+(x...)
+fadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.vadd(x, y)
+
+fsub(x...) = Base.:-(x...)
+fsub(x::NTuple) = broadcast(esub, x)
+fsub(x::T, y::T) where {T<:NTuple} = SIMDPirates.vsub(x, y)
+
+fmul(x...) = Base.:*(x...)
+fmul(x::T, y::T) where {T<:NTuple} = SIMDPirates.vmul(x, y)
+
+function fusible(expr::Expr)
+    replace_simd_ops(expr,
+                     (:+   => fadd,
+                      :-   => fsub,
+                      :*   => fmul,
+                      :fma => fma),
+                     (:+=  => fadd,
+                      :-=  => fsub,
+                      :*=  => fmul))
+end
+
+macro fusible(expr)
+    fusible(expr) |> esc
+end
 
 end
diff --git a/src/Summation.jl b/src/Summation.jl
@@ -6,6 +6,7 @@ import VectorizationBase
 
 import ..SIMDops
 using  ..SIMDops: Vec, vload, vsum, vzero, fptype
+using SIMDPirates
 
 using ..EFT: two_sum, fast_two_sum, two_prod
 
@@ -19,10 +20,11 @@ include("accumulators/compDot.jl")
 # SIAM Journal on Scientific Computing, 6(26), 2005.
 # DOI: 10.1137/030601818
 @generated function accumulate(x::NTuple{A, AbstractArray{T}},
-                                 accType::F,
-                                 rem_handling = Val(:scalar),
-                                 ::Val{Ushift} = Val(2)
-                                 )  where {F, A, T <: Union{Float32,Float64}, Ushift}
+                               accType::F,
+                               rem_handling = Val(:scalar),
+                               ::Val{Ushift} = Val(2),
+                               ::Val{Prefetch} = Val(0),
+                               )  where {F, A, T <: Union{Float32,Float64}, Ushift, Prefetch}
     @assert 0 ≤ Ushift < 6
     U = 1 << Ushift
 
@@ -43,6 +45,10 @@ include("accumulators/compDot.jl")
         Nshift = N >> $(shift + Ushift)
         offset = 0
         for n in 1:Nshift
+            if $Prefetch > 0
+                SIMDPirates.prefetch.(px.+offset.+$(Prefetch*WT), Val(3), Val(0))
+            end
+
             Base.Cartesian.@nexprs $U u -> begin
                 xi = vload.($V, px.+offset)
                 add!(acc_u, xi...)
@@ -85,11 +91,50 @@ include("accumulators/compDot.jl")
     end
 end
 
-sum_naive(x) = accumulate((x,), sumAcc,                   Val(:scalar), Val(3))
-sum_kbn(x)   = accumulate((x,), compSumAcc(fast_two_sum), Val(:scalar), Val(2))
-sum_oro(x)   = accumulate((x,), compSumAcc(two_sum),      Val(:scalar), Val(2))
+# Default values for unrolling
+@inline default_ushift(::SumAcc)     = Val(3)
+@inline default_ushift(::CompSumAcc) = Val(2)
+@inline default_ushift(::DotAcc)     = Val(3)
+@inline default_ushift(::CompDotAcc) = Val(2)
+# dispatch
+#   either default_ushift(x,    acc)
+#   or     default_ushift((x,), acc)
+@inline default_ushift(x::AbstractArray, acc) = default_ushift(acc(eltype(x)))
+@inline default_ushift(x::NTuple, acc)        = default_ushift(first(x), acc)
+
+
+# Default values for cache prefetching
+@inline default_prefetch(::SumAcc)     = Val(0)
+@inline default_prefetch(::CompSumAcc) = Val(35)
+@inline default_prefetch(::DotAcc)     = Val(0)
+@inline default_prefetch(::CompDotAcc) = Val(20)
+# dispatch
+#   either default_prefetch(x,    acc)
+#   or     default_prefetch((x,), acc)
+@inline default_prefetch(x::AbstractArray, acc) = default_prefetch(acc(eltype(x)))
+@inline default_prefetch(x::NTuple, acc)        = default_prefetch(first(x), acc)
+
+
+@inline _sum(x, acc) = if length(x) < 500
+    # no cache prefetching for small vectors
+    accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), Val(0))
+else
+    accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
+end
+
+sum_naive(x) = _sum(x, sumAcc)
+sum_kbn(x)   = _sum(x, compSumAcc(fast_two_sum))
+sum_oro(x)   = _sum(x, compSumAcc(two_sum))
+
+
+@inline _dot(x, y, acc) = if length(x) < 500
+    # no cache prefetching for small vectors
+    accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), Val(0))
+else
+    accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
+end
 
-dot_naive(x, y) = accumulate((x,y), dotAcc,     Val(:scalar), Val(3))
-dot_oro(x, y)   = accumulate((x,y), compDotAcc, Val(:scalar), Val(2))
+dot_naive(x, y) = _dot(x, y, dotAcc)
+dot_oro(x, y)   = _dot(x, y, compDotAcc)
 
 end
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,6 +5,7 @@ os: @@
     julia:
       - 1.3
       - 1.4
+      - 1.5
       - nightly
     matrix:
       fast_finish: true
@@ Expand Down @@