Try VectorizedRNG.jl #304

ChrisRackauckas · 2020-05-18T01:24:34Z

I just seem to get segfaults though.

ChrisRackauckas · 2020-05-18T01:27:50Z

using StochasticDiffEq
using BenchmarkTools

α=1
β=1
u0=[1/2]
f(du,u,p,t) = (du .= u)
g(du,u,p,t) = (du .= u)
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u0,(0.0,1.0))
sol = solve(prob,EM(),dt=dt)
@code_warntype solve(prob,EM(),dt=dt)
@btime sol = solve(prob,EM(),dt=dt)

Is the test case

ChrisRackauckas · 2020-05-18T01:28:05Z

Note this is on top of #291

chriselrod · 2020-05-18T01:49:56Z

I just seem to get segfaults though.

Yikes. I'll look into it.
EDIT
Travis says:

signal (11): Segmentation fault
in expression starting at /home/travis/build/SciML/StochasticDiffEq.jl/test/first_rand_test.jl:23
macro expansion at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:131 [inlined]
vload at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:106 [inlined]
vloada at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:295 [inlined]
getstate at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/xoshiro.jl:98 [inlined]
random_sample! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:140
randn! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:180 [inlined]
wiener_randn! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:9 [inlined]
INPLACE_WHITE_NOISE_DIST at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:42
calculate_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:49 [inlined]
setup_next_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:44 [inlined]
setup_next_step! at /home/travis/build/SciML/StochasticDiffEq.jl/src/integrators/integrator_utils.jl:2 [inlined]
#__init#48 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:488
unknown function (ip: 0x7f60045dbaef)
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
#__solve#47 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6
unknown function (ip: 0x7f60045d7595)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2158 [inlined]

chriselrod · 2020-05-18T02:13:09Z

What is _seed?
My fault for not documenting

help?> local_rng()
  No documentation found.

But local_rng's argument is the thread id - 1. If you pass it an integer greater than or equal the number of threads (or <0), it'll segfault.

VectorizedRNG will allocate memory for Threads.nthreads() states, and then local_rng returns a state that just wraps a pointer:

julia> local_rng(11212351235412351)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x9f56a135d27e6d00)

julia> local_rng(0)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

julia> local_rng()
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

This is mostly to let it control alignment. Also means that, while the states are close together in memory, cache line boundaries should be aligned with different thread local parts, so it should be fine.

chriselrod · 2020-05-18T02:27:43Z

If I remove the _seed argument as a hack, tests Convergence Test on 2D Linear failed locally:

Test Summary:                    | Pass  Fail  Total
Two-dimensional Linear SDE Tests |    1    37     38

Benchmarks:
Latest release:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  23.88 KiB
  allocs estimate:  118
  --------------
  minimum time:     5.051 μs (0.00% GC)
  median time:      5.311 μs (0.00% GC)
  mean time:        6.844 μs (20.10% GC)
  maximum time:     503.748 μs (96.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

This branch:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.22 KiB
  allocs estimate:  95
  --------------
  minimum time:     3.863 μs (0.00% GC)
  median time:      4.177 μs (0.00% GC)
  mean time:        4.991 μs (14.98% GC)
  maximum time:     709.685 μs (99.14% GC)
  --------------
  samples:          10000
  evals/sample:     8

small_overhead (291):

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.17 KiB
  allocs estimate:  95
  --------------
  minimum time:     2.902 μs (0.00% GC)
  median time:      3.150 μs (0.00% GC)
  mean time:        3.923 μs (18.48% GC)
  maximum time:     636.300 μs (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     9

So it looks like this regresses performance.

What are the lengths of the involved arrays?
Any good way to investigate this regression?

ChrisRackauckas · 2020-05-18T02:48:20Z

Awesome, that got it working.

What are the lengths of the involved arrays?

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Any good way to investigate this regression?

Let me get that small optimizations branch merged and it should be easier. All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

ChrisRackauckas · 2020-05-18T02:48:50Z

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

chriselrod · 2020-05-18T03:35:50Z

Awesome, that got it working.
All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

I'm worried about the test failure that I got, and while not a failure, these warnings did not appear when testing the small_overhead branch:

Solve and Plot
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329

VectorizedRNG does reasonably well vs the smallcrush RNG test, but the sin approximation might need to be a little more accurate.
I'm sure a numerical analyst like @YingboMa can do better than the Remez algorithm. Or I could, given the time to find out how people normally come up with SIMD special function implementations.

Also, it's using a Xoshiro256plus. Maybe I should implement the Xoroshiro128Plus.

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Ah. 1 is a problem for performance.

julia> rng = Xorshifts.Xoroshiro128Plus()
RandomNumbers.Xorshifts.Xoroshiro128Plus(0xef4459b15a5cab2a, 0x33d487d5296eb0e2)

julia> x = [0.0];

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.429 ns (0.00% GC)
  median time:      9.852 ns (0.00% GC)
  mean time:        9.880 ns (0.00% GC)
  maximum time:     41.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.980 ns (0.00% GC)
  median time:      22.052 ns (0.00% GC)
  mean time:        22.084 ns (0.00% GC)
  maximum time:     48.035 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.235 ns (0.00% GC)
  median time:      11.645 ns (0.00% GC)
  mean time:        11.665 ns (0.00% GC)
  maximum time:     21.735 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> x = Vector{Float64}(undef, 15);

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     50.670 ns (0.00% GC)
  median time:      52.744 ns (0.00% GC)
  mean time:        52.787 ns (0.00% GC)
  maximum time:     88.798 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.716 ns (0.00% GC)
  median time:      29.022 ns (0.00% GC)
  mean time:        29.019 ns (0.00% GC)
  maximum time:     52.158 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     48.997 ns (0.00% GC)
  median time:      50.915 ns (0.00% GC)
  mean time:        50.944 ns (0.00% GC)
  maximum time:     67.167 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     989

That should be addressable though:

Checking array size to generate smaller batches:

julia> @btime randn(local_rng(), NTuple{1,NTuple{2,Core.VecElement{Float64}}})
  13.260 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.057132286810525984), VecElement{Float64}(0.7160102533566447)),)

For very short arrays. It can generate a single uniform at a time:

julia> @btime rand(local_rng(), NTuple{1,NTuple{1,Core.VecElement{Float64}}})
  2.269 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.30066402503431744),),)

But that isn't an option right now for normals because of the Box-Muller algoirthm. As the Box-Muller algorithm is a really slow way to generate 2 random normals anyways, we should be falling back to the Ziggurat algorithm for small (non-SIMD) batches of normals.

Alternatively, have buffers of perhaps 256 random numbers stored, and simply load from this buffer (refilling the buffer every time it reaches the end).

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

Currently, it only supports DenseArrays that let you get pointers.

It typically generates NTuple{N,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}} where N is 1, 2, or 4, and then uses the pointer to store them into the array (and mask off a remainder, which should be efficient with AVX [uses vmaskmov] and especially AVX512 [uses opmask registers]).

julia> randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
((VecElement{Float64}(-0.35070219008784215), VecElement{Float64}(-1.1500644689621624), VecElement{Float64}(-1.6140966900130058), VecElement{Float64}(-0.13218650868339507), VecElement{Float64}(-1.145717731296182), VecElement{Float64}(-1.4488616660755809), VecElement{Float64}(-1.219831252656312), VecElement{Float64}(0.26954048537845754)), (VecElement{Float64}(-0.8649042896705302), VecElement{Float64}(1.1491223266582307), VecElement{Float64}(-0.8821573749577012), VecElement{Float64}(0.19394881631501515), VecElement{Float64}(1.957643704171434), VecElement{Float64}(-2.927466363335935), VecElement{Float64}(-0.14172465407366736), VecElement{Float64}(-1.5318948796751526)), (VecElement{Float64}(-0.34632071461043773), VecElement{Float64}(0.17695680200558128), VecElement{Float64}(0.34350477272227437), VecElement{Float64}(1.2389961224532435), VecElement{Float64}(1.1252296319202757), VecElement{Float64}(1.8952112260924336), VecElement{Float64}(0.7779111642563302), VecElement{Float64}(-1.0682395626203594)), (VecElement{Float64}(-0.40499368481842707), VecElement{Float64}(-0.5697825884055833), VecElement{Float64}(0.9163026476719056), VecElement{Float64}(-1.6460932788182563), VecElement{Float64}(-1.071672933005311), VecElement{Float64}(0.6215884503640314), VecElement{Float64}(-0.09387081715702705), VecElement{Float64}(0.07085548151904485)))

julia> @benchmark randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.995 ns (0.00% GC)
  median time:      36.122 ns (0.00% GC)
  mean time:        36.191 ns (0.00% GC)
  maximum time:     76.804 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark randn(local_rng(), NTuple{2,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.060 ns (0.00% GC)
  median time:      21.076 ns (0.00% GC)
  mean time:        21.101 ns (0.00% GC)
  maximum time:     30.223 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark randn(local_rng(), NTuple{1,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.668 ns (0.00% GC)
  median time:      16.484 ns (0.00% GC)
  mean time:        16.315 ns (0.00% GC)
  maximum time:     44.946 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

So for out of place support, you just need another way to get these into your preferred data structure.
For something like a StaticArray, that would probably mean getindex and .value as necessary to turn these into regular NTuples. At small enough sizes, the compiler should elliminate all of that and more or less generate the code we wanted anyway.

It just seems to segfault though?

chriselrod · 2023-11-17T12:48:22Z

VectorizedRNG is only really good at batched sampling.
Even though Base's RNG is also offers a vectorized Xoshiro RNG, VectorizedRNG.jl is still a decent bit faster

julia> using VectorizedRNG, BenchmarkTools, Random

julia> x = Vector{Float64}(undef, 1024);

julia> @btime rand!($(Random.default_rng()), $x);
  274.723 ns (0 allocations: 0 bytes)

julia> @btime rand!($(VectorizedRNG.local_rng()), $x);
  178.311 ns (0 allocations: 0 bytes)

julia> @btime randn!($(Random.default_rng()), $x);
  1.660 μs (0 allocations: 0 bytes)

julia> @btime randn!($(VectorizedRNG.local_rng()), $x);
  1.213 μs (0 allocations: 0 bytes)

Here, just 1. It's of course problem dependent, from 1 to 100,000.

So if it makes sense to do larger batches, we could consider this again.

ChrisRackauckas · 2023-11-17T14:27:28Z

Yeah, this is just so stale and it's not generally the right thing to do, so no reason to keep the branch around. A special noise process can be made around this if big SPDEs run into this as the time limiting part.

ChrisRackauckas added 2 commits May 18, 2020 01:16

try VectorizedRNG.jl

03f8294

It just seems to segfault though?

fix VectorizedRNG to only the no seed case

e0771be

ChrisRackauckas force-pushed the vectorizedrng branch from 7bada79 to e0771be Compare May 18, 2020 05:16

ChrisRackauckas closed this Nov 17, 2023

ChrisRackauckas deleted the vectorizedrng branch November 17, 2023 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try VectorizedRNG.jl #304

Try VectorizedRNG.jl #304

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

chriselrod commented May 18, 2020 •

edited

Loading

chriselrod commented May 18, 2020 •

edited

Loading

chriselrod commented May 18, 2020 •

edited

Loading

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

chriselrod commented May 18, 2020

chriselrod commented Nov 17, 2023 •

edited

Loading

ChrisRackauckas commented Nov 17, 2023

Try VectorizedRNG.jl #304

Try VectorizedRNG.jl #304

Conversation

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

chriselrod commented May 18, 2020 • edited Loading

chriselrod commented May 18, 2020 • edited Loading

chriselrod commented May 18, 2020 • edited Loading

ChrisRackauckas commented May 18, 2020

ChrisRackauckas commented May 18, 2020

chriselrod commented May 18, 2020

chriselrod commented Nov 17, 2023 • edited Loading

ChrisRackauckas commented Nov 17, 2023

chriselrod commented May 18, 2020 •

edited

Loading

chriselrod commented May 18, 2020 •

edited

Loading

chriselrod commented May 18, 2020 •

edited

Loading

chriselrod commented Nov 17, 2023 •

edited

Loading