Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKMilCommute constant cache perform step #383

Merged
merged 14 commits into from
Jan 19, 2021

Conversation

frankschae
Copy link
Member

Solves #372 .

Also fixes two small bugs for RKMilGeneral.
Additionally, The iip_weak test file used a pre-defined SDE problem from DiffEqProblemLibrary before (which, however, is defined as oop). This is now also corrected.

RKMil constant cache perform step and tests
@ChrisRackauckas
Copy link
Member

Looks like the amount of time is just over 2 hours.

@ChrisRackauckas
Copy link
Member

WeakConvergence just needs CPU right? So we can put it in a different queue. All of the tests are already multithreaded?

@frankschae
Copy link
Member Author

All of the tests are already multithreaded?

Yes, I think all tests are already multithreaded, since they should all use EnsembleThreads() from
https://github.com/SciML/DiffEqDevTools.jl/blob/master/src/convergence.jl

WeakConvergence just needs CPU right? So we can put it in a different queue.

Yes.

@ChrisRackauckas
Copy link
Member

https://buildkite.com/julialang/stochasticdiffeq-dot-jl/builds/40#52ddc367-5e7c-4deb-a320-193cf981b835/359-869

Looks like a random failure. FWIW, it's great that we're finding out how to make these stable and not rely on just a seed. On another note, damn these take a lot of compute. I think we will want a semi-SIMD form @chriselrod mentioned before.

I'll look at what's going on with that GPU test. I think that's a regression.

@frankschae
Copy link
Member Author

frankschae commented Jan 15, 2021

yeah, indeed the tests get really expensive ..

I'll look at what's going on with that GPU test. I think that's a regression.

Yes.. I also wanted to ask you already about that. I thought it must be something almost trivial as the tests were passing before we added the weak convergence tests to the buildkite queue

@ChrisRackauckas
Copy link
Member

@vchuravy @jpsamaroo could you take a look at https://buildkite.com/julialang/stochasticdiffeq-dot-jl/builds/40#de077c9c-e46e-4a31-b7ac-1fdd10b45bf4/359-733 and see if you know what that is? I am confused since I can't recreate it. I locally grabbed the test and it ran fine:

using StochasticDiffEq, Test, Random
using DiffEqGPU
using CUDA

function weak_error(prob,alg,numtraj, batchsize, f_true,trange;abstol=1,reltol=0,ensemblealg=EnsembleCPUArray())
  sol = @time solve(prob,alg,ensemblealg,
    dt=0.01f0,adaptive=true,abstol=abstol,reltol=reltol,
    trajectories=numtraj,batch_size=batchsize,
    saveat = trange
    )
  computed_exp = (sol.u/numtraj)[1,:]
  true_exp = f_true.(trange)
  return sum((computed_exp-true_exp).^2)/length(trange)
end

function prob_func(prob, i, repeat)
    #(i%50000 == 0) && @show i
    remake(prob,seed=seeds[i])
end

function reduction(u,batch,I)
  u.+sum(batch),false
end

function output_func(sol,i)
  #h1(asinh(sol[end][1])),false
  h1.(asinh.(sol)),false
end

# prob 1
u₀ = [0.0f0]
tspan = (0.0f0,2.0f0)
h1(z) = z^3-6*z^2+8*z

function f1!(du,u,p,t)
 @inbounds begin
     du[1] = 1//2*u[1]+sqrt(u[1]^2 +1)
 end
 nothing
end

function g1!(du,u,p,t)
 @inbounds begin
     du[1] = sqrt(u[1]^2 +1)
 end
 nothing
end

f_true1(t) = t^3-3*t^2+2*t

prob1 = SDEProblem(f1!,g1!,u₀,tspan)
ensemble_prob1 = EnsembleProblem(prob1;
        output_func = output_func,
        prob_func = prob_func,
        reduction = reduction,
        u_init=Vector{eltype(prob1.u0)}([0.0])
        )


# prob 2
u₀ = [0.1f0,0.1f0]
function f2!(du,u,p,t)
  @inbounds begin
    du[1] = 3//2*u[1]
    du[2] = 3//2*u[2]
  end
  nothing
end
function g2!(du,u,p,t)
  @inbounds begin
    du[1] = 1//10*u[1]
    du[2] = 1//10*u[2]
  end
  nothing
end

f_true2(t) = 1//10*exp(3//2*t) #1//100*exp(301//100*t)

h2(z) = z #z^2

prob2 = SDEProblem(f2!,g2!,u₀,tspan)
ensemble_prob2 = EnsembleProblem(prob2;
        output_func = (sol,i) -> (h2.(sol),false),
        prob_func = prob_func,
        reduction = reduction,
        u_init=Vector{eltype(prob2.u0)}([0.0, 0.0])
        )



tsave = 0.0f0:0.05f0:2f0



probs = Vector{EnsembleProblem}(undef, 2)
probs[1] = ensemble_prob1
probs[2] = ensemble_prob2

ftrue = Vector{}(undef, 2)
ftrue[1] = f_true1
ftrue[2] = f_true2

numtraj = Int(1e5)
seed = 100
Random.seed!(seed)
seeds = rand(UInt, numtraj)

numtraj = Int(1e5)
seed = 100
Random.seed!(seed)
seeds = rand(UInt, numtraj)

for i in 1:2
  @show i

  err1 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=1f0,reltol=1f0, ensemblealg=EnsembleGPUArray())
  @show err1
  # err2 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=0.1f0,reltol=0.1f0, ensemblealg=EnsembleGPUArray())
  # @show err2
  err3 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=0.01f0,reltol=0.01f0, ensemblealg=EnsembleGPUArray())
  @show err3
  @test err1 > err3
  println("")
end

@chriselrod
Copy link

FWIW, it's great that we're finding out how to make these stable and not rely on just a seed. On another note, damn these take a lot of compute. I think we will want a semi-SIMD form @chriselrod mentioned before.

What do you want to SIMD?

Also, I could revive the PR on using VectorizedRNG. I've added a scalar SIMD RNG:

julia> x = Vector{Float64}(undef, 1024);

julia> using VectorizedRNG, Random

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.298 μs (0.00% GC)
  median time:      1.344 μs (0.00% GC)
  mean time:        1.347 μs (0.00% GC)
  maximum time:     2.637 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.062 μs (0.00% GC)
  median time:      2.176 μs (0.00% GC)
  mean time:        2.182 μs (0.00% GC)
  maximum time:     3.321 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark randn(local_rng())
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.695 ns (0.00% GC)
  median time:      3.050 ns (0.00% GC)
  mean time:        3.029 ns (0.00% GC)
  maximum time:     5.762 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.353 ns (0.00% GC)
  median time:      6.712 ns (0.00% GC)
  mean time:        6.736 ns (0.00% GC)
  maximum time:     23.688 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

But this computer has AVX512.
After the recent performance improvements in the base randn!, it might not actually be faster anymore without it.
If randn(local_rng()) isn't faster without avx512, switching to the Ziggurat algorithm base randn() uses probably would make it faster because rand holds a much larger advantage:

julia> @benchmark rand!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     168.105 ns (0.00% GC)
  median time:      168.276 ns (0.00% GC)
  mean time:        168.450 ns (0.00% GC)
  maximum time:     192.335 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     750

julia> @benchmark rand!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     594.022 ns (0.00% GC)
  median time:      597.654 ns (0.00% GC)
  mean time:        598.793 ns (0.00% GC)
  maximum time:     701.682 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     179

julia> @benchmark rand(local_rng())
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.855 ns (0.00% GC)
  median time:      1.927 ns (0.00% GC)
  mean time:        1.931 ns (0.00% GC)
  maximum time:     5.347 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rand()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.581 ns (0.00% GC)
  median time:      5.086 ns (0.00% GC)
  mean time:        5.099 ns (0.00% GC)
  maximum time:     17.237 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

and is thus probably faster on everything with AVX2.

But I wouldn't be surprised if RandomNumbers.jl wins if all you need is one scalar at a time.

@ChrisRackauckas
Copy link
Member

It's two pieces. One is the RNG. The other is that this is solving a ton of trajectories multithreaded, so instead of doing every one separate, we should do that semi-clumping we discussed before.

@ChrisRackauckas
Copy link
Member

We almost never need one scalar at time, so if you wanted to revive that effort for VectorizedRNGs I would be super happy.

@ChrisRackauckas
Copy link
Member

What's blocking this?

@frankschae
Copy link
Member Author

Mainly, the CUDA issue in buildkite/stochasticdiffeq-dot-jl/weakadaptive .

@vchuravy
Copy link

I will take a look at the failure. It seems a change to adapt has broken @Const.

@ChrisRackauckas
Copy link
Member

I'm going to merge this to separate the concerns. But let's definitely continue this thread. I'll open two issues on the two points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants