-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
odeint is slow #5
Comments
@modichirag Also Can you check if these help? The experimental odeint has more problems. I wanted to try replacing it with the diffrax package. |
diffrax is nice for sure, but I don't think it will make a big difference (I may be wrong), It does give you more control over the ODE solver, and returns interesting info like how many steps have actually been needed |
Okay, the reason rk4 is about 20x faster here could be that @modichirag is scanning rk4 on nbody time integration steps (here probably 64 in total). I guess the bottleneck is due to the slowness of GPUs at solving small but sequential problems, With odeint I sometimes get nan. I wanted to try if diffrax can be more stable. |
For odeint, I had tried reducing the tolerance (atol and rtol) to 1e-3 instead of 1e-8. It made a difference of factor of 2 instead of 20. |
How long are 1e-8 and 1e-3 odeint timeit now? Nice. I am surprised that 64 steps is good enough for growth factors. |
I setup diffrax. It seems to be slower than odeint. For 1e-3 tolerance, it takes 1.27s for Boltzmann. Time for 1e-8 and 1e-3 with odeint is now 300ms and 250ms respeectively. I think it is different from previous numbers since I had other jobs running on GPU. Some other jobs are still running so this number can go down a bit more on a free GPU but it will certainly remain more time consuming than LPT step by a factor of 5-10x which is bad. |
What's the correct rk4 timing you got? @modichirag |
@modichirag Actually I misremembered this number, 170ms is for (512^3) 2LPT, growth integration takes 10ms. |
Executable Attachment is removed due to company policy
Here is the script I have been using to setup different boltzmann
integrations and compare them.
There are 2 main functions:
test_growth() which compares the accuracy of rk4 integral
and test_pm() which does the time tests
Let me know if I am doing something wrong.
…On Wed, May 11, 2022 at 1:10 PM Yin Li ***@***.***> wrote:
For me with 1e-8 tol it was ~170ms, still a lot slower than your rk4
number.
@modichirag <https://github.com/modichirag> Actually I misremembered this
number, 170ms is for (512^3) 2LPT, growth integration takes 10ms.
But with perfect scaling to 128^3, lpt will still be a lot faster than
boltzmann.
—
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADF45XSW7VG4OQN3TFOFR53VJPSYBANCNFSM5VGJOH7A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Chirag Modi
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics
Center for Computational Mathematics
Flatiron Institute, Simons Foundation
|
Executable Attachment is removed due to company policy
Actually sorry, that was the old version.
Attached is the new version of test_growth. It is compatible with the
attached conf.py.
The difference between this and your version is I have also added a
growth_anum parameters in conf.py.
I had to do this to decouple time steps for rk4 integration with timestamps
of simulation o/w it takes many steps upto *a_start* and then very small
number of steps from *a_start* to 1 which does not lead to stable
integration.
If this is getting confusing, I can create another branch and push there so
that you can simply run this without having to track multiple files. let me
know.
On Sun, May 15, 2022 at 5:00 PM Chirag Modi ***@***.***>
wrote:
… Here is the script I have been using to setup different boltzmann
integrations and compare them.
There are 2 main functions:
test_growth() which compares the accuracy of rk4 integral
and test_pm() which does the time tests
Let me know if I am doing something wrong.
On Wed, May 11, 2022 at 1:10 PM Yin Li ***@***.***> wrote:
> For me with 1e-8 tol it was ~170ms, still a lot slower than your rk4
> number.
>
> @modichirag <https://github.com/modichirag> Actually I misremembered
> this number, 170ms is for (512^3) 2LPT, growth integration takes 10ms.
> But with perfect scaling to 128^3, lpt will still be a lot faster than
> boltzmann.
>
> —
> Reply to this email directly, view it on GitHub
> <#5 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADF45XSW7VG4OQN3TFOFR53VJPSYBANCNFSM5VGJOH7A>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
Chirag Modi
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics
Center for Computational Mathematics
Flatiron Institute, Simons Foundation
--
Chirag Modi
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics
Center for Computational Mathematics
Flatiron Institute, Simons Foundation
|
@modichirag Okay.... have you tried to separate jit time from running time as I suggested earlier? I simply did the following in Jupyter >>> %time boltzmann(cosmo).growth.block_until_ready()
>>> %timeit boltzmann(cosmo).growth.block_until_ready()
CPU times: user 824 ms, sys: 22.4 ms, total: 846 ms
Wall time: 1.18 s
10.2 ms ± 7.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
So I am not using jupyter but instead had a python script. So there is no timeit but I was running a loop of 50 iterations and timing that with time.time() Here is the relevant lines of code from the script I had sent earlier (I was not explicitly printing compile time there, but was compiling nevertheless).
And the output is
Interestingly block_until_ready() does not seem to make that big a difference, it's 1.4ms-vs-1.6ms and 11.5ms-vs-12.5ms |
Okay. So you also gets boltzmann ~ 10ms. I think this will fine for most of our target use cases. |
@Yucheng-Zhang found that it's 10x faster to |
|
To speed up growth function integration, I did the following test def boltz_factory(backend):
return jax.jit(boltzmann, device=jax.devices(backend)[0])
boltz_gpu = boltz_factory('gpu')
boltz_cpu = boltz_factory('cpu')
conf = Configuration(1., (2,) * 3)
cosmo = SimpleLCDM(conf)
cosmo = boltzmann(cosmo, conf)
%time jax.block_until_ready(boltz_gpu(cosmo, conf))
%timeit jax.block_until_ready(boltz_gpu(cosmo, conf))
%time jax.block_until_ready(jax.device_put(boltz_cpu(cosmo, conf), device=jax.devices('gpu')[0]))
%timeit jax.block_until_ready(jax.device_put(boltz_cpu(cosmo, conf), device=jax.devices('gpu')[0])) The
So @adrianbayer you can probably do something like the following boltz = jax.jit(boltzmann, device=jax.devices('cpu')[0])
cosmo = boltz(cosmo, conf)
cosmo = jax.device_put(cosmo, device=jax.devices('gpu')[0]) |
I was experimenting with some time tests and find that odeint to calculate the growth functions is quite slow.
I have tried to hack and replace it with rk4 integration in the growth function itself which seems to be much faster.
Then I do time tests for 64^3 simulation wherein I pass the cosmology parameters, initial modes as input and calculate time for different outputs (just doing boltzmann solve vs boltzmann + LPT).
The time taken for each of these is
Time taken for boltzmann: 0.5971660375595093
Time taken for boltzmann rk4: 0.007928729057312012
Time taken for LPT: 0.0041596412658691405
Time taken for simulation (Boltzmann + LPT): 0.463437557220459
Time taken for simulation rk4 (Boltzmann + LPT): 0.04284675121307373
rk4 seems to be much faster than using odeint to generate growth rate.
If what I am doing in running the simulations is sensible and the timing numbers portray an accurate picture,
then we should figure a better way (jaxified) to code this?
I have attached the full script as txt file (copy paste in pmwd/pmwd folder, convert to py and it should run)
test_growth.txt
The text was updated successfully, but these errors were encountered: