Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try VectorizedRNG.jl #304

Closed
wants to merge 2 commits into from
Closed

Try VectorizedRNG.jl #304

wants to merge 2 commits into from

Conversation

ChrisRackauckas
Copy link
Member

I just seem to get segfaults though.

@ChrisRackauckas
Copy link
Member Author

using StochasticDiffEq
using BenchmarkTools

α=1
β=1
u0=[1/2]
f(du,u,p,t) = (du .= u)
g(du,u,p,t) = (du .= u)
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u0,(0.0,1.0))
sol = solve(prob,EM(),dt=dt)
@code_warntype solve(prob,EM(),dt=dt)
@btime sol = solve(prob,EM(),dt=dt)

Is the test case

@ChrisRackauckas
Copy link
Member Author

Note this is on top of #291

@chriselrod
Copy link

chriselrod commented May 18, 2020

I just seem to get segfaults though.

Yikes. I'll look into it.
EDIT
Travis says:

signal (11): Segmentation fault
in expression starting at /home/travis/build/SciML/StochasticDiffEq.jl/test/first_rand_test.jl:23
macro expansion at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:131 [inlined]
vload at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:106 [inlined]
vloada at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:295 [inlined]
getstate at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/xoshiro.jl:98 [inlined]
random_sample! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:140
randn! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:180 [inlined]
wiener_randn! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:9 [inlined]
INPLACE_WHITE_NOISE_DIST at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:42
calculate_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:49 [inlined]
setup_next_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:44 [inlined]
setup_next_step! at /home/travis/build/SciML/StochasticDiffEq.jl/src/integrators/integrator_utils.jl:2 [inlined]
#__init#48 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:488
unknown function (ip: 0x7f60045dbaef)
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
#__solve#47 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6
unknown function (ip: 0x7f60045d7595)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2158 [inlined]

@chriselrod
Copy link

chriselrod commented May 18, 2020

What is _seed?
My fault for not documenting

help?> local_rng()
  No documentation found.

But local_rng's argument is the thread id - 1. If you pass it an integer greater than or equal the number of threads (or <0), it'll segfault.

VectorizedRNG will allocate memory for Threads.nthreads() states, and then local_rng returns a state that just wraps a pointer:

julia> local_rng(11212351235412351)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x9f56a135d27e6d00)

julia> local_rng(0)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

julia> local_rng()
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

This is mostly to let it control alignment. Also means that, while the states are close together in memory, cache line boundaries should be aligned with different thread local parts, so it should be fine.

@chriselrod
Copy link

chriselrod commented May 18, 2020

If I remove the _seed argument as a hack, tests Convergence Test on 2D Linear failed locally:

Test Summary:                    | Pass  Fail  Total
Two-dimensional Linear SDE Tests |    1    37     38

Benchmarks:
Latest release:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  23.88 KiB
  allocs estimate:  118
  --------------
  minimum time:     5.051 μs (0.00% GC)
  median time:      5.311 μs (0.00% GC)
  mean time:        6.844 μs (20.10% GC)
  maximum time:     503.748 μs (96.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

This branch:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.22 KiB
  allocs estimate:  95
  --------------
  minimum time:     3.863 μs (0.00% GC)
  median time:      4.177 μs (0.00% GC)
  mean time:        4.991 μs (14.98% GC)
  maximum time:     709.685 μs (99.14% GC)
  --------------
  samples:          10000
  evals/sample:     8

small_overhead (291):

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.17 KiB
  allocs estimate:  95
  --------------
  minimum time:     2.902 μs (0.00% GC)
  median time:      3.150 μs (0.00% GC)
  mean time:        3.923 μs (18.48% GC)
  maximum time:     636.300 μs (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     9

So it looks like this regresses performance.

What are the lengths of the involved arrays?
Any good way to investigate this regression?

@ChrisRackauckas
Copy link
Member Author

Awesome, that got it working.

What are the lengths of the involved arrays?

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Any good way to investigate this regression?

Let me get that small optimizations branch merged and it should be easier. All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

@ChrisRackauckas
Copy link
Member Author

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

@chriselrod
Copy link

Awesome, that got it working.
All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

I'm worried about the test failure that I got, and while not a failure, these warnings did not appear when testing the small_overhead branch:

Solve and Plot
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329

VectorizedRNG does reasonably well vs the smallcrush RNG test, but the sin approximation might need to be a little more accurate.
I'm sure a numerical analyst like @YingboMa can do better than the Remez algorithm. Or I could, given the time to find out how people normally come up with SIMD special function implementations.

Also, it's using a Xoshiro256plus. Maybe I should implement the Xoroshiro128Plus.

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Ah. 1 is a problem for performance.

julia> rng = Xorshifts.Xoroshiro128Plus()
RandomNumbers.Xorshifts.Xoroshiro128Plus(0xef4459b15a5cab2a, 0x33d487d5296eb0e2)

julia> x = [0.0];

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.429 ns (0.00% GC)
  median time:      9.852 ns (0.00% GC)
  mean time:        9.880 ns (0.00% GC)
  maximum time:     41.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.980 ns (0.00% GC)
  median time:      22.052 ns (0.00% GC)
  mean time:        22.084 ns (0.00% GC)
  maximum time:     48.035 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.235 ns (0.00% GC)
  median time:      11.645 ns (0.00% GC)
  mean time:        11.665 ns (0.00% GC)
  maximum time:     21.735 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> x = Vector{Float64}(undef, 15);

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     50.670 ns (0.00% GC)
  median time:      52.744 ns (0.00% GC)
  mean time:        52.787 ns (0.00% GC)
  maximum time:     88.798 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.716 ns (0.00% GC)
  median time:      29.022 ns (0.00% GC)
  mean time:        29.019 ns (0.00% GC)
  maximum time:     52.158 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     48.997 ns (0.00% GC)
  median time:      50.915 ns (0.00% GC)
  mean time:        50.944 ns (0.00% GC)
  maximum time:     67.167 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     989

That should be addressable though:

  1. Checking array size to generate smaller batches:
julia> @btime randn(local_rng(), NTuple{1,NTuple{2,Core.VecElement{Float64}}})
  13.260 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.057132286810525984), VecElement{Float64}(0.7160102533566447)),)

For very short arrays. It can generate a single uniform at a time:

julia> @btime rand(local_rng(), NTuple{1,NTuple{1,Core.VecElement{Float64}}})
  2.269 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.30066402503431744),),)

But that isn't an option right now for normals because of the Box-Muller algoirthm. As the Box-Muller algorithm is a really slow way to generate 2 random normals anyways, we should be falling back to the Ziggurat algorithm for small (non-SIMD) batches of normals.

Alternatively, have buffers of perhaps 256 random numbers stored, and simply load from this buffer (refilling the buffer every time it reaches the end).

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

Currently, it only supports DenseArrays that let you get pointers.

It typically generates NTuple{N,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}} where N is 1, 2, or 4, and then uses the pointer to store them into the array (and mask off a remainder, which should be efficient with AVX [uses vmaskmov] and especially AVX512 [uses opmask registers]).

julia> randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
((VecElement{Float64}(-0.35070219008784215), VecElement{Float64}(-1.1500644689621624), VecElement{Float64}(-1.6140966900130058), VecElement{Float64}(-0.13218650868339507), VecElement{Float64}(-1.145717731296182), VecElement{Float64}(-1.4488616660755809), VecElement{Float64}(-1.219831252656312), VecElement{Float64}(0.26954048537845754)), (VecElement{Float64}(-0.8649042896705302), VecElement{Float64}(1.1491223266582307), VecElement{Float64}(-0.8821573749577012), VecElement{Float64}(0.19394881631501515), VecElement{Float64}(1.957643704171434), VecElement{Float64}(-2.927466363335935), VecElement{Float64}(-0.14172465407366736), VecElement{Float64}(-1.5318948796751526)), (VecElement{Float64}(-0.34632071461043773), VecElement{Float64}(0.17695680200558128), VecElement{Float64}(0.34350477272227437), VecElement{Float64}(1.2389961224532435), VecElement{Float64}(1.1252296319202757), VecElement{Float64}(1.8952112260924336), VecElement{Float64}(0.7779111642563302), VecElement{Float64}(-1.0682395626203594)), (VecElement{Float64}(-0.40499368481842707), VecElement{Float64}(-0.5697825884055833), VecElement{Float64}(0.9163026476719056), VecElement{Float64}(-1.6460932788182563), VecElement{Float64}(-1.071672933005311), VecElement{Float64}(0.6215884503640314), VecElement{Float64}(-0.09387081715702705), VecElement{Float64}(0.07085548151904485)))

julia> @benchmark randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.995 ns (0.00% GC)
  median time:      36.122 ns (0.00% GC)
  mean time:        36.191 ns (0.00% GC)
  maximum time:     76.804 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark randn(local_rng(), NTuple{2,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.060 ns (0.00% GC)
  median time:      21.076 ns (0.00% GC)
  mean time:        21.101 ns (0.00% GC)
  maximum time:     30.223 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark randn(local_rng(), NTuple{1,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.668 ns (0.00% GC)
  median time:      16.484 ns (0.00% GC)
  mean time:        16.315 ns (0.00% GC)
  maximum time:     44.946 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

So for out of place support, you just need another way to get these into your preferred data structure.
For something like a StaticArray, that would probably mean getindex and .value as necessary to turn these into regular NTuples. At small enough sizes, the compiler should elliminate all of that and more or less generate the code we wanted anyway.

@chriselrod
Copy link

chriselrod commented Nov 17, 2023

VectorizedRNG is only really good at batched sampling.
Even though Base's RNG is also offers a vectorized Xoshiro RNG, VectorizedRNG.jl is still a decent bit faster

julia> using VectorizedRNG, BenchmarkTools, Random

julia> x = Vector{Float64}(undef, 1024);

julia> @btime rand!($(Random.default_rng()), $x);
  274.723 ns (0 allocations: 0 bytes)

julia> @btime rand!($(VectorizedRNG.local_rng()), $x);
  178.311 ns (0 allocations: 0 bytes)

julia> @btime randn!($(Random.default_rng()), $x);
  1.660 μs (0 allocations: 0 bytes)

julia> @btime randn!($(VectorizedRNG.local_rng()), $x);
  1.213 μs (0 allocations: 0 bytes)

Here, just 1. It's of course problem dependent, from 1 to 100,000.

So if it makes sense to do larger batches, we could consider this again.

@ChrisRackauckas
Copy link
Member Author

Yeah, this is just so stale and it's not generally the right thing to do, so no reason to keep the branch around. A special noise process can be made around this if big SPDEs run into this as the time limiting part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants