RKMilCommute constant cache perform step #383

frankschae · 2021-01-11T13:30:08Z

Solves #372 .

Also fixes two small bugs for RKMilGeneral.
Additionally, The iip_weak test file used a pre-defined SDE problem from DiffEqProblemLibrary before (which, however, is defined as oop). This is now also corrected.

RKMil constant cache perform step and tests

src/perform_step/low_order.jl

test/sde/sde_twodimlinear_tests.jl

test/weak_convergence/iip_weak.jl

ChrisRackauckas · 2021-01-13T06:56:57Z

Looks like the amount of time is just over 2 hours.

ChrisRackauckas · 2021-01-13T15:54:41Z

WeakConvergence just needs CPU right? So we can put it in a different queue. All of the tests are already multithreaded?

frankschae · 2021-01-13T16:10:53Z

All of the tests are already multithreaded?

Yes, I think all tests are already multithreaded, since they should all use EnsembleThreads() from
https://github.com/SciML/DiffEqDevTools.jl/blob/master/src/convergence.jl

WeakConvergence just needs CPU right? So we can put it in a different queue.

Yes.

ChrisRackauckas · 2021-01-15T14:07:01Z

https://buildkite.com/julialang/stochasticdiffeq-dot-jl/builds/40#52ddc367-5e7c-4deb-a320-193cf981b835/359-869

Looks like a random failure. FWIW, it's great that we're finding out how to make these stable and not rely on just a seed. On another note, damn these take a lot of compute. I think we will want a semi-SIMD form @chriselrod mentioned before.

I'll look at what's going on with that GPU test. I think that's a regression.

frankschae · 2021-01-15T14:14:19Z

yeah, indeed the tests get really expensive ..

I'll look at what's going on with that GPU test. I think that's a regression.

Yes.. I also wanted to ask you already about that. I thought it must be something almost trivial as the tests were passing before we added the weak convergence tests to the buildkite queue

ChrisRackauckas · 2021-01-15T14:17:14Z

@vchuravy @jpsamaroo could you take a look at https://buildkite.com/julialang/stochasticdiffeq-dot-jl/builds/40#de077c9c-e46e-4a31-b7ac-1fdd10b45bf4/359-733 and see if you know what that is? I am confused since I can't recreate it. I locally grabbed the test and it ran fine:

using StochasticDiffEq, Test, Random
using DiffEqGPU
using CUDA

function weak_error(prob,alg,numtraj, batchsize, f_true,trange;abstol=1,reltol=0,ensemblealg=EnsembleCPUArray())
  sol = @time solve(prob,alg,ensemblealg,
    dt=0.01f0,adaptive=true,abstol=abstol,reltol=reltol,
    trajectories=numtraj,batch_size=batchsize,
    saveat = trange
    )
  computed_exp = (sol.u/numtraj)[1,:]
  true_exp = f_true.(trange)
  return sum((computed_exp-true_exp).^2)/length(trange)
end

function prob_func(prob, i, repeat)
    #(i%50000 == 0) && @show i
    remake(prob,seed=seeds[i])
end

function reduction(u,batch,I)
  u.+sum(batch),false
end

function output_func(sol,i)
  #h1(asinh(sol[end][1])),false
  h1.(asinh.(sol)),false
end

# prob 1
u₀ = [0.0f0]
tspan = (0.0f0,2.0f0)
h1(z) = z^3-6*z^2+8*z

function f1!(du,u,p,t)
 @inbounds begin
     du[1] = 1//2*u[1]+sqrt(u[1]^2 +1)
 end
 nothing
end

function g1!(du,u,p,t)
 @inbounds begin
     du[1] = sqrt(u[1]^2 +1)
 end
 nothing
end

f_true1(t) = t^3-3*t^2+2*t

prob1 = SDEProblem(f1!,g1!,u₀,tspan)
ensemble_prob1 = EnsembleProblem(prob1;
        output_func = output_func,
        prob_func = prob_func,
        reduction = reduction,
        u_init=Vector{eltype(prob1.u0)}([0.0])
        )


# prob 2
u₀ = [0.1f0,0.1f0]
function f2!(du,u,p,t)
  @inbounds begin
    du[1] = 3//2*u[1]
    du[2] = 3//2*u[2]
  end
  nothing
end
function g2!(du,u,p,t)
  @inbounds begin
    du[1] = 1//10*u[1]
    du[2] = 1//10*u[2]
  end
  nothing
end

f_true2(t) = 1//10*exp(3//2*t) #1//100*exp(301//100*t)

h2(z) = z #z^2

prob2 = SDEProblem(f2!,g2!,u₀,tspan)
ensemble_prob2 = EnsembleProblem(prob2;
        output_func = (sol,i) -> (h2.(sol),false),
        prob_func = prob_func,
        reduction = reduction,
        u_init=Vector{eltype(prob2.u0)}([0.0, 0.0])
        )



tsave = 0.0f0:0.05f0:2f0



probs = Vector{EnsembleProblem}(undef, 2)
probs[1] = ensemble_prob1
probs[2] = ensemble_prob2

ftrue = Vector{}(undef, 2)
ftrue[1] = f_true1
ftrue[2] = f_true2

numtraj = Int(1e5)
seed = 100
Random.seed!(seed)
seeds = rand(UInt, numtraj)

numtraj = Int(1e5)
seed = 100
Random.seed!(seed)
seeds = rand(UInt, numtraj)

for i in 1:2
  @show i

  err1 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=1f0,reltol=1f0, ensemblealg=EnsembleGPUArray())
  @show err1
  # err2 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=0.1f0,reltol=0.1f0, ensemblealg=EnsembleGPUArray())
  # @show err2
  err3 = weak_error(probs[i],DRI1NM(),numtraj,Int(1e4),ftrue[i],tsave,abstol=0.01f0,reltol=0.01f0, ensemblealg=EnsembleGPUArray())
  @show err3
  @test err1 > err3
  println("")
end

chriselrod · 2021-01-15T16:17:53Z

FWIW, it's great that we're finding out how to make these stable and not rely on just a seed. On another note, damn these take a lot of compute. I think we will want a semi-SIMD form @chriselrod mentioned before.

What do you want to SIMD?

Also, I could revive the PR on using VectorizedRNG. I've added a scalar SIMD RNG:

julia> x = Vector{Float64}(undef, 1024);

julia> using VectorizedRNG, Random

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.298 μs (0.00% GC)
  median time:      1.344 μs (0.00% GC)
  mean time:        1.347 μs (0.00% GC)
  maximum time:     2.637 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.062 μs (0.00% GC)
  median time:      2.176 μs (0.00% GC)
  mean time:        2.182 μs (0.00% GC)
  maximum time:     3.321 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark randn(local_rng())
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.695 ns (0.00% GC)
  median time:      3.050 ns (0.00% GC)
  mean time:        3.029 ns (0.00% GC)
  maximum time:     5.762 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.353 ns (0.00% GC)
  median time:      6.712 ns (0.00% GC)
  mean time:        6.736 ns (0.00% GC)
  maximum time:     23.688 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

But this computer has AVX512.
After the recent performance improvements in the base randn!, it might not actually be faster anymore without it.
If randn(local_rng()) isn't faster without avx512, switching to the Ziggurat algorithm base randn() uses probably would make it faster because rand holds a much larger advantage:

julia> @benchmark rand!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     168.105 ns (0.00% GC)
  median time:      168.276 ns (0.00% GC)
  mean time:        168.450 ns (0.00% GC)
  maximum time:     192.335 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     750

julia> @benchmark rand!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     594.022 ns (0.00% GC)
  median time:      597.654 ns (0.00% GC)
  mean time:        598.793 ns (0.00% GC)
  maximum time:     701.682 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     179

julia> @benchmark rand(local_rng())
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.855 ns (0.00% GC)
  median time:      1.927 ns (0.00% GC)
  mean time:        1.931 ns (0.00% GC)
  maximum time:     5.347 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rand()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.581 ns (0.00% GC)
  median time:      5.086 ns (0.00% GC)
  mean time:        5.099 ns (0.00% GC)
  maximum time:     17.237 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

and is thus probably faster on everything with AVX2.

But I wouldn't be surprised if RandomNumbers.jl wins if all you need is one scalar at a time.

ChrisRackauckas · 2021-01-15T16:20:21Z

It's two pieces. One is the RNG. The other is that this is solving a ton of trajectories multithreaded, so instead of doing every one separate, we should do that semi-clumping we discussed before.

ChrisRackauckas · 2021-01-15T18:57:31Z

We almost never need one scalar at time, so if you wanted to revive that effort for VectorizedRNGs I would be super happy.

ChrisRackauckas · 2021-01-19T13:45:15Z

What's blocking this?

frankschae · 2021-01-19T13:48:54Z

Mainly, the CUDA issue in buildkite/stochasticdiffeq-dot-jl/weakadaptive .

vchuravy · 2021-01-19T15:37:02Z

I will take a look at the failure. It seems a change to adapt has broken @Const.

ChrisRackauckas · 2021-01-19T19:16:14Z

I'm going to merge this to separate the concerns. But let's definitely continue this thread. I'll open two issues on the two points.

frankschae added 2 commits January 11, 2021 14:23

bcd8ce2

RKMil constant cache perform step and tests

RKMil general small bug fix

3021adf

ChrisRackauckas reviewed Jan 11, 2021

View reviewed changes

src/perform_step/low_order.jl Outdated Show resolved Hide resolved

frankschae added 3 commits January 11, 2021 18:31

handle iterated integral and diagonal case

1044aab

add local (test error looks random)

4fd706a

try more trajectories

613e90f

frankschae commented Jan 11, 2021

View reviewed changes

test/sde/sde_twodimlinear_tests.jl Show resolved Hide resolved

ChrisRackauckas reviewed Jan 12, 2021

View reviewed changes

test/weak_convergence/iip_weak.jl Outdated Show resolved Hide resolved

frankschae added 2 commits January 12, 2021 20:06

fix week iip tests

b2090ec

double weak adaptive time out

fc11000

frankschae added 2 commits January 13, 2021 11:01

move WeakConvergence1 to buildkite pipeline with larger time window

f25420f

use in oop_weak same seed as in iip_weak

8b1cf16

Update pipeline.yml

15db871

ChrisRackauckas and others added 4 commits January 13, 2021 11:23

Update pipeline.yml

710f624

next try for SROCK methods

0e93505

update pipeline

be59cbc

increase trajectories

82a42a3

ChrisRackauckas merged commit 327bf80 into SciML:master Jan 19, 2021

ChrisRackauckas mentioned this pull request Jan 19, 2021

GPU Tests fail due to KernelAbstractions issue? #387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKMilCommute constant cache perform step #383

RKMilCommute constant cache perform step #383

frankschae commented Jan 11, 2021

ChrisRackauckas commented Jan 13, 2021

ChrisRackauckas commented Jan 13, 2021

frankschae commented Jan 13, 2021

ChrisRackauckas commented Jan 15, 2021

frankschae commented Jan 15, 2021 •

edited

Loading

ChrisRackauckas commented Jan 15, 2021

chriselrod commented Jan 15, 2021

ChrisRackauckas commented Jan 15, 2021

ChrisRackauckas commented Jan 15, 2021

ChrisRackauckas commented Jan 19, 2021

frankschae commented Jan 19, 2021

vchuravy commented Jan 19, 2021

ChrisRackauckas commented Jan 19, 2021

RKMilCommute constant cache perform step #383

RKMilCommute constant cache perform step #383

Conversation

frankschae commented Jan 11, 2021

ChrisRackauckas commented Jan 13, 2021

ChrisRackauckas commented Jan 13, 2021

frankschae commented Jan 13, 2021

ChrisRackauckas commented Jan 15, 2021

frankschae commented Jan 15, 2021 • edited Loading

ChrisRackauckas commented Jan 15, 2021

chriselrod commented Jan 15, 2021

ChrisRackauckas commented Jan 15, 2021

ChrisRackauckas commented Jan 15, 2021

ChrisRackauckas commented Jan 19, 2021

frankschae commented Jan 19, 2021

vchuravy commented Jan 19, 2021

ChrisRackauckas commented Jan 19, 2021

frankschae commented Jan 15, 2021 •

edited

Loading