Improved logic for deciding when to reorder atoms #3721

peastman · 2022-08-01T17:55:29Z

This is a possible fix for choderalab/perses#613. See the discussion in choderalab/perses#613 (comment) and following.

Normally it reorders atoms every 250 steps. The change here is that if it ever sees the size of the neighbor list has increased by more than 10% since the last reordering, then it immediately forces a new reordering to try to bring it back down. However, it still will never reorder more often than every 25 steps. That's to reduce the risk of catastrophic slowdowns in pathological cases.

This code should be considered experimental. Before we decide whether to merge it or try a different approach instead, we need to thoroughly test it to see how it affects performance on all the relevant use cases.

cc @zhang-ivy @ijpulidos @jchodera

ijpulidos · 2022-08-02T15:27:55Z

@peastman I'm not able to build this branch/fork of openmm. I'm facing the following errors when running make:

[ 61%] Linking CXX executable ../../../TestReferenceAndersenThermostat
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_vsym@GLIBC_PRIVATE'
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_addr@GLIBC_PRIVATE'
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_sym@GLIBC_PRIVATE'
collect2: error: ld returned 1 exit status
make[2]: *** [platforms/reference/tests/CMakeFiles/TestReferenceAndersenThermostat.dir/build.make:99: TestReferenceAndersenThermostat] Error 1
make[1]: *** [CMakeFiles/Makefile2:1671: platforms/reference/tests/CMakeFiles/TestReferenceAndersenThermostat.dir/all] Error 2

I am able to build the master branch of openmm just fine using the same commands and environment. For what it's worth, my build commands are the following:

cmake -DCMAKE_INSTALL_PREFIX=/home/user/workdir/software/openmm-install -DOPENMM_BUILD_AMOEBA_OPENCL_LIB=OFF -DOPENMM_BUILD_CUDA_TESTS=OFF -DOPENMM_BUILD_DRUDE_OPENCL_LIB=OFF -DOPENMM_BUILD_EXAMPLES=OFF -DOPENMM_BUILD_OPENCL_DOUBLE_PRE=OFF -DOPENMM_BUILD_OPENCL_LIB=OFF -DOPENMM_BUILD_OPENCL_TESTS=OFF -DOPENMM_BUILD_RPMD_OPENCL_LIB=OFF -DOPENMM_BUILD_STATIC_LIB=OFF -DOPENMM_GENERATE_API_DOCS=OFF ../
make -j 7

jchodera · 2022-08-02T15:40:34Z

This looks like a promising solution!

@ijpulidos : That's a confusing error, but this page suggests that it may be due to how ld has changed relatively recently to require the -ldl flag be added to ldlibs and not ldflags. This could potentially be an issue with the CMake files, or that you need to regenerate the CMakeLists.txt files from scratch, or with the version of CMake not knowing about this change in ld. Does this happen in a fresh new environment configured with the latest build instructions?

ijpulidos · 2022-08-02T17:35:19Z

I managed to build it now. There are some things we want to discuss about the building process, I think we are close to getting it "just right", especially when doing it with conda-forge packages and compilers. I'll be raising an issue about this for further discussion.

Does this happen in a fresh new environment configured with the latest build instructions?

It wasn't a fresh new environment since I want it to live alongside other packages, such as openmmtools, that makes me realize that we may want to refine the instructions on how to do this while also having a local built version of openmm.

ijpulidos · 2022-08-03T00:10:45Z

I run the tests again and the results are as follows

The ones marked with peastman in the label/legend are the new results from the branch in this PR. The others are the previous results reported in choderalab/perses#613 (comment)
I took the liberty to check the distributions and averages as well

# AVERAGES ITERATION TIMES (in seconds)
omm770-dev1-1GPU-localbuild	 9.0183
omm770-dev1-1GPU-localbuild-fix	 8.127790000000001
omm770-dev1-1GPU-localbuild-fix-blocking	 8.35379
omm770-dev1-2MPI	 9.308235
omm770-dev-1GPU-localbuild-peastman	 8.36401
omm770-dev-2MPI-localbuild-peastman	 8.94218

EDIT: Added previous data for 2MPI processes WITHOUT fix.

ijpulidos · 2022-08-03T00:14:50Z

It seems that using 2MPI processes is now a bit "worse" compared to the hotfix (triggering atoms reordering every time) I tried that last time. ~~Apologies for not having both plots there, I accidentally lost the previous data for 2MPI processes.~~ I actually found previous data for 2MPI processes without the hotfix for the previous time. Updated the plots accordingly.

zhang-ivy · 2022-08-03T00:30:24Z

@ijpulidos : in the experiments labeled peastman, are you using the blocking sync fix as well?

ijpulidos · 2022-08-03T00:30:31Z

It seems that using 2MPI processes is now a bit "worse" compared to the hotfix (triggering atoms reordering every time) I tried that last time.

Actually, never mind, it seems consistent with previous results, since that's just the difference of not using the blocking sync flag/option. Last time I didn't try the fix with blocking.

ijpulidos · 2022-08-03T00:31:16Z

@ijpulidos : in the experiments labeled peastman, are you using the blocking sync fix as well?

~~@zhang-ivy yes, that explains the difference, I am not using that in this case. Check my previous comment.~~

EDIT: Sorry for the noise, just to be explicity since it gets a bit confusing. These latest tests use platform.setPropertyDefaultValue('UseBlockingSync', 'false'). Sorry for the confusion.

ijpulidos · 2022-08-03T00:40:41Z

Then I think it does mean that we are now getting a bit worse result for 2 MPI processes, as per #3721 (comment). But we do have to take into account that the conda-forge package openmm-7.7.0-dev1 that was used for the previous tests is several commits different from the current master branch of openmm.

jchodera · 2022-08-03T06:08:22Z

Thanks for the great data, @ijpulidos!

@mikemhenry suggested it might be useful to run the OpenMM examples/benchmark.py script as well to compare the various builds on a few test systems.

ijpulidos · 2022-08-03T18:29:13Z

The results for the benchmark comparing master branch with the changes in this PR (labelled peastman in the plots) are as follows:

Normalizing performance values

EDIT: These results are comparing main branch with peastman fork, which is not a good comparison, please see following comment

mikemhenry · 2022-08-03T20:31:34Z

I don't think that this should be a blocker for this issue, but I am interested in improving the performance of the neighbor list by doing some automated tuning. The MD engine I used a lot in graduate school, hoomd blue, does some neat stuff with the neighbor list https://hoomd-blue.readthedocs.io/en/v3.3.0/module-md-nlist.html#md-nlist to help tune the neighbor list rebuild time. So I think we should fine a solution that is on the easy side to implement, but long term I think automated tuning will be the correct approach.

jchodera · 2022-08-04T03:47:49Z

The results for the benchmark comparing master branch with the changes in this PR (labelled peastman in the plots) are as follows

Wow! There is improvement in every benchmark! That's great! Thanks for the comprehensive assessment, @ijpulidos!

I am interested in improving the performance of the neighbor list by doing some automated tuning.

Sounds like it's worth opening a separate issue to discuss, @mikemhenry!

zhang-ivy · 2022-08-04T12:38:04Z

Wow! There is improvement in every benchmark! That's great! Thanks for the comprehensive assessment, @ijpulidos!

@jchodera : Th x axis is performance (ns/day), so I think its actually the opposite -- this PR actually causes a bit of a slowdown for all systems?

peastman · 2022-08-04T17:49:16Z

I tried running a couple of benchmarks both with this PR, and with revision ae34183, which is the revision on the main branch this branch was created from. I ran each benchmark three times. The speed in ns/day is shown below. First for apoa1pme.

This PR:

229.521
214.94
218.582

Main branch:

209.075
212.948
213.624

And for amoebapme:

This PR:

10.7994
10.9889
10.8875

Main branch:

10.9505
10.9559
10.8586

There are fluctuations from one run to another. Neither branch seems to be systematically faster than the other. And I can't see any reason there would be. I added a line to reorderAtoms() to print a message if forceNextReorder was ever set, and it never was. So there really shouldn't be any way for this change to have affected the speed.

Is it possible you're seeing some other effect? For example, the two sets of benchmarks were run on different nodes of a cluster, or something else was running in the background that slowed it down during one set?

ijpulidos · 2022-08-04T18:34:45Z

Is it possible you're seeing some other effect? For example, the two sets of benchmarks were run on different nodes of a cluster, or something else was running in the background that slowed it down during one set?

These were all run locally in my laptop using an RTX A4000 external GPU. I didn't use the revision you mention but whatever was the latest in the main branch at the time.

peastman · 2022-08-04T18:48:18Z

Try using that revision to be sure no other change is affecting it. It's also good to switch back and forth a few times. Run one version a couple of times, then the other, then switch back to the first and try it again, and so on. That reduces the risk of results being influenced by what else was going on at the time. To get reproducible benchmarks, it's also good to quit other applications and check top and nvidia-smi just before each benchmark to make sure nothing else is using significant CPU or GPU time.

ijpulidos · 2022-08-08T23:33:08Z

@peastman You were correct. The results are the same for all the benchmarks running from the revision ae34183 and the changes in this PR.

My previous results were not wrong, they were just not measuring the correct things, since they seemed to have extra performance improvements in the main branch, independent of the changes in this PR.

peastman · 2022-08-09T00:25:32Z

Probably from #3720. Thanks!

peastman · 2022-08-09T00:31:17Z

Here are some other good benchmarks to try.

Measure the speed of LocalEnergyMinimizer. It repeatedly calls setPositions() to evaluate candidate conformations. Be aware that it isn't deterministic, so we should measure it several times to get the average and variation.
Try loading a series of conformations saved from a simulation and compute the energy of each one. Try evaluating them both sequentially and in random order, since that may affect the speed.

peastman · 2022-08-09T20:32:44Z

Here are some more benchmarks. I started by simulating ApoA1 for 10 ns, saving a conformation to a DCD file every 100 ps. I then loaded in the file and evaluated the energies of all 100 conformations, either in order or in a random order. I tried both for ae34183 and for this PR and repeated each measurement three times. Here is the total time (in seconds) for the 100 energy evaluations with the CUDA platform.

Random

ae34183:

0.8552842140197754
0.8469550609588623
0.8429384231567383

This PR:

0.8529636859893799
0.833075761795044
0.8250143527984619

In order

ae34183:

0.9200880527496338
0.9235060214996338
0.922003984451294

this PR:

0.6829240322113037
0.6808676719665527
0.6837430000305176

When evaluating them in random order, this PR doesn't seem to have any significant effect on speed. When evaluating them in order, the version with this PR gets faster, which is what I'd expect. Successive frames have related atom positions, so the atom ordering should be better. But for the existing code, evaluating frames in order actually makes it slightly slower. I can't think of any reason for that. But I switched back and forth a few times and it's reproducible. I also tried OpenCL and the results are very similar. In order is faster than random with this PR, but slightly slower for the existing code. I don't understand that.

As another benchmark, I tried running an energy minimization from the final state in the trajectory. I repeated it ten times for both versions of the code. Here are the times in seconds for the CUDA platform.

ae34183:

17.01730704307556
10.366745948791504
13.943104267120361
14.002901554107666
11.451032638549805
12.502070903778076
13.595231056213379
11.10424518585205
19.358876943588257
11.82084608078003

This PR:

12.159980535507202
12.579627513885498
15.853049516677856
16.603999614715576
11.795461416244507
17.72627019882202
12.364694118499756
19.035810947418213
14.046435356140137
13.135895252227783

The difference between them is not statistically significant.

zhang-ivy · 2022-08-12T12:21:32Z

@peastman : Great to see! Any more experiments you'd like to see? Or can we merge this?

zhang-ivy · 2022-08-12T12:24:16Z

Also just curious, what is your timeline for releasing openmm 8? I know the beta is already on conda forge, but just curious when you think 8.0 itself will be released?

peastman · 2022-08-12T15:07:51Z

The original schedule was to have released the beta about a month ago, which would have meant doing a release candidate around now. But we've been stuck for weeks on trying to build conda packages for the plugins. :( Whenever we finally manage to get all of them built, I'll do a new updated beta build of OpenMM, then we can announce it and ask users to start testing it, and then follow with the release candidate about a month later.

Based on the tests above, I'm ok with merging this if everyone else is comfortable with it.

jchodera · 2022-08-12T16:41:24Z

Based on the tests above, I'm ok with merging this if everyone else is comfortable with it.

No objections! This seems like a robust improvement!

zhang-ivy · 2022-08-15T20:16:43Z

I just tested the openmm nightly dev (with this PR) on a large protein:protein system (RBD:ACE2) and am seeing the cycling problem is gone 🥳

The spikes every ~100 iterations are due to saving to the checkpoint file.

I will close the perses repex cycling issue now. Thanks everyone for your help in solving this!!

peastman · 2022-08-15T20:20:39Z

Yay!

Improved logic for deciding when to reorder atoms

347baf5

ijpulidos mentioned this pull request Aug 2, 2022

Streamlined building instructions using conda-forge packages #3722

Open

peastman mentioned this pull request Aug 5, 2022

Atom reordering does not correctly occur on CUDA/OpenCL platforms after setPositions() is called #3705

Closed

peastman merged commit 48664a1 into openmm:master Aug 12, 2022

peastman deleted the reorder branch August 12, 2022 19:27

zhang-ivy mentioned this pull request Aug 15, 2022

Replica propagation times systematically change in cyclic patterns that may slow simulations choderalab/perses#613

Closed

jchodera mentioned this pull request Aug 17, 2022

Revisit management of context caches choderalab/perses#1084

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved logic for deciding when to reorder atoms #3721

Improved logic for deciding when to reorder atoms #3721

peastman commented Aug 1, 2022

ijpulidos commented Aug 2, 2022

jchodera commented Aug 2, 2022

ijpulidos commented Aug 2, 2022

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

zhang-ivy commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022

jchodera commented Aug 3, 2022

ijpulidos commented Aug 3, 2022 •

edited

Loading

mikemhenry commented Aug 3, 2022

jchodera commented Aug 4, 2022

zhang-ivy commented Aug 4, 2022

peastman commented Aug 4, 2022

ijpulidos commented Aug 4, 2022 •

edited

Loading

peastman commented Aug 4, 2022

ijpulidos commented Aug 8, 2022 •

edited

Loading

peastman commented Aug 9, 2022

peastman commented Aug 9, 2022

peastman commented Aug 9, 2022

zhang-ivy commented Aug 12, 2022

zhang-ivy commented Aug 12, 2022

peastman commented Aug 12, 2022

jchodera commented Aug 12, 2022

zhang-ivy commented Aug 15, 2022

peastman commented Aug 15, 2022

Improved logic for deciding when to reorder atoms #3721

Improved logic for deciding when to reorder atoms #3721

Conversation

peastman commented Aug 1, 2022

ijpulidos commented Aug 2, 2022

jchodera commented Aug 2, 2022

ijpulidos commented Aug 2, 2022

ijpulidos commented Aug 3, 2022 • edited Loading

ijpulidos commented Aug 3, 2022 • edited Loading

zhang-ivy commented Aug 3, 2022 • edited Loading

ijpulidos commented Aug 3, 2022 • edited Loading

ijpulidos commented Aug 3, 2022 • edited Loading

ijpulidos commented Aug 3, 2022

jchodera commented Aug 3, 2022

ijpulidos commented Aug 3, 2022 • edited Loading

mikemhenry commented Aug 3, 2022

jchodera commented Aug 4, 2022

zhang-ivy commented Aug 4, 2022

peastman commented Aug 4, 2022

ijpulidos commented Aug 4, 2022 • edited Loading

peastman commented Aug 4, 2022

ijpulidos commented Aug 8, 2022 • edited Loading

peastman commented Aug 9, 2022

peastman commented Aug 9, 2022

peastman commented Aug 9, 2022

Random

In order

zhang-ivy commented Aug 12, 2022

zhang-ivy commented Aug 12, 2022

peastman commented Aug 12, 2022

jchodera commented Aug 12, 2022

zhang-ivy commented Aug 15, 2022

peastman commented Aug 15, 2022

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

zhang-ivy commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 3, 2022 •

edited

Loading

ijpulidos commented Aug 4, 2022 •

edited

Loading

ijpulidos commented Aug 8, 2022 •

edited

Loading