Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved logic for deciding when to reorder atoms #3721

Merged
merged 1 commit into from
Aug 12, 2022

Conversation

peastman
Copy link
Member

@peastman peastman commented Aug 1, 2022

This is a possible fix for choderalab/perses#613. See the discussion in choderalab/perses#613 (comment) and following.

Normally it reorders atoms every 250 steps. The change here is that if it ever sees the size of the neighbor list has increased by more than 10% since the last reordering, then it immediately forces a new reordering to try to bring it back down. However, it still will never reorder more often than every 25 steps. That's to reduce the risk of catastrophic slowdowns in pathological cases.

This code should be considered experimental. Before we decide whether to merge it or try a different approach instead, we need to thoroughly test it to see how it affects performance on all the relevant use cases.

cc @zhang-ivy @ijpulidos @jchodera

@ijpulidos
Copy link
Contributor

@peastman I'm not able to build this branch/fork of openmm. I'm facing the following errors when running make:

[ 61%] Linking CXX executable ../../../TestReferenceAndersenThermostat
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_vsym@GLIBC_PRIVATE'
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_addr@GLIBC_PRIVATE'
/home/user/miniconda3/envs/openmm-build-env/bin/ld: /home/user/miniconda3/envs/openmm-build-env/x86_64-conda-linux-gnu/sysroot/usr/lib/libdl.so: undefined reference to `_dl_sym@GLIBC_PRIVATE'
collect2: error: ld returned 1 exit status
make[2]: *** [platforms/reference/tests/CMakeFiles/TestReferenceAndersenThermostat.dir/build.make:99: TestReferenceAndersenThermostat] Error 1
make[1]: *** [CMakeFiles/Makefile2:1671: platforms/reference/tests/CMakeFiles/TestReferenceAndersenThermostat.dir/all] Error 2

I am able to build the master branch of openmm just fine using the same commands and environment. For what it's worth, my build commands are the following:

cmake -DCMAKE_INSTALL_PREFIX=/home/user/workdir/software/openmm-install -DOPENMM_BUILD_AMOEBA_OPENCL_LIB=OFF -DOPENMM_BUILD_CUDA_TESTS=OFF -DOPENMM_BUILD_DRUDE_OPENCL_LIB=OFF -DOPENMM_BUILD_EXAMPLES=OFF -DOPENMM_BUILD_OPENCL_DOUBLE_PRE=OFF -DOPENMM_BUILD_OPENCL_LIB=OFF -DOPENMM_BUILD_OPENCL_TESTS=OFF -DOPENMM_BUILD_RPMD_OPENCL_LIB=OFF -DOPENMM_BUILD_STATIC_LIB=OFF -DOPENMM_GENERATE_API_DOCS=OFF ../
make -j 7

@jchodera
Copy link
Member

jchodera commented Aug 2, 2022

This looks like a promising solution!

@ijpulidos : That's a confusing error, but this page suggests that it may be due to how ld has changed relatively recently to require the -ldl flag be added to ldlibs and not ldflags. This could potentially be an issue with the CMake files, or that you need to regenerate the CMakeLists.txt files from scratch, or with the version of CMake not knowing about this change in ld. Does this happen in a fresh new environment configured with the latest build instructions?

@ijpulidos
Copy link
Contributor

I managed to build it now. There are some things we want to discuss about the building process, I think we are close to getting it "just right", especially when doing it with conda-forge packages and compilers. I'll be raising an issue about this for further discussion.

Does this happen in a fresh new environment configured with the latest build instructions?

It wasn't a fresh new environment since I want it to live alongside other packages, such as openmmtools, that makes me realize that we may want to refine the instructions on how to do this while also having a local built version of openmm.

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 3, 2022

I run the tests again and the results are as follows
image
The ones marked with peastman in the label/legend are the new results from the branch in this PR. The others are the previous results reported in choderalab/perses#613 (comment)
I took the liberty to check the distributions and averages as well
image

# AVERAGES ITERATION TIMES (in seconds)
omm770-dev1-1GPU-localbuild	 9.0183
omm770-dev1-1GPU-localbuild-fix	 8.127790000000001
omm770-dev1-1GPU-localbuild-fix-blocking	 8.35379
omm770-dev1-2MPI	 9.308235
omm770-dev-1GPU-localbuild-peastman	 8.36401
omm770-dev-2MPI-localbuild-peastman	 8.94218

EDIT: Added previous data for 2MPI processes WITHOUT fix.

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 3, 2022

It seems that using 2MPI processes is now a bit "worse" compared to the hotfix (triggering atoms reordering every time) I tried that last time. Apologies for not having both plots there, I accidentally lost the previous data for 2MPI processes. I actually found previous data for 2MPI processes without the hotfix for the previous time. Updated the plots accordingly.

@zhang-ivy
Copy link
Contributor

zhang-ivy commented Aug 3, 2022

@ijpulidos : in the experiments labeled peastman, are you using the blocking sync fix as well?

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 3, 2022

It seems that using 2MPI processes is now a bit "worse" compared to the hotfix (triggering atoms reordering every time) I tried that last time.

Actually, never mind, it seems consistent with previous results, since that's just the difference of not using the blocking sync flag/option. Last time I didn't try the fix with blocking.

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 3, 2022

@ijpulidos : in the experiments labeled peastman, are you using the blocking sync fix as well?

@zhang-ivy yes, that explains the difference, I am not using that in this case. Check my previous comment.

EDIT: Sorry for the noise, just to be explicity since it gets a bit confusing. These latest tests use platform.setPropertyDefaultValue('UseBlockingSync', 'false'). Sorry for the confusion.

@ijpulidos
Copy link
Contributor

Then I think it does mean that we are now getting a bit worse result for 2 MPI processes, as per #3721 (comment). But we do have to take into account that the conda-forge package openmm-7.7.0-dev1 that was used for the previous tests is several commits different from the current master branch of openmm.

@jchodera
Copy link
Member

jchodera commented Aug 3, 2022

Thanks for the great data, @ijpulidos!

@mikemhenry suggested it might be useful to run the OpenMM examples/benchmark.py script as well to compare the various builds on a few test systems.

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 3, 2022

The results for the benchmark comparing master branch with the changes in this PR (labelled peastman in the plots) are as follows:
image
Normalizing performance values
image

EDIT: These results are comparing main branch with peastman fork, which is not a good comparison, please see following comment

@mikemhenry
Copy link
Contributor

I don't think that this should be a blocker for this issue, but I am interested in improving the performance of the neighbor list by doing some automated tuning. The MD engine I used a lot in graduate school, hoomd blue, does some neat stuff with the neighbor list https://hoomd-blue.readthedocs.io/en/v3.3.0/module-md-nlist.html#md-nlist to help tune the neighbor list rebuild time. So I think we should fine a solution that is on the easy side to implement, but long term I think automated tuning will be the correct approach.

@jchodera
Copy link
Member

jchodera commented Aug 4, 2022

The results for the benchmark comparing master branch with the changes in this PR (labelled peastman in the plots) are as follows

Wow! There is improvement in every benchmark! That's great! Thanks for the comprehensive assessment, @ijpulidos!

I am interested in improving the performance of the neighbor list by doing some automated tuning.

Sounds like it's worth opening a separate issue to discuss, @mikemhenry!

@zhang-ivy
Copy link
Contributor

Wow! There is improvement in every benchmark! That's great! Thanks for the comprehensive assessment, @ijpulidos!

@jchodera : Th x axis is performance (ns/day), so I think its actually the opposite -- this PR actually causes a bit of a slowdown for all systems?

@peastman
Copy link
Member Author

peastman commented Aug 4, 2022

I tried running a couple of benchmarks both with this PR, and with revision ae34183, which is the revision on the main branch this branch was created from. I ran each benchmark three times. The speed in ns/day is shown below. First for apoa1pme.

This PR:

229.521
214.94
218.582

Main branch:

209.075
212.948
213.624

And for amoebapme:

This PR:

10.7994
10.9889
10.8875

Main branch:

10.9505
10.9559
10.8586

There are fluctuations from one run to another. Neither branch seems to be systematically faster than the other. And I can't see any reason there would be. I added a line to reorderAtoms() to print a message if forceNextReorder was ever set, and it never was. So there really shouldn't be any way for this change to have affected the speed.

Is it possible you're seeing some other effect? For example, the two sets of benchmarks were run on different nodes of a cluster, or something else was running in the background that slowed it down during one set?

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 4, 2022

Is it possible you're seeing some other effect? For example, the two sets of benchmarks were run on different nodes of a cluster, or something else was running in the background that slowed it down during one set?

These were all run locally in my laptop using an RTX A4000 external GPU. I didn't use the revision you mention but whatever was the latest in the main branch at the time.

@peastman
Copy link
Member Author

peastman commented Aug 4, 2022

Try using that revision to be sure no other change is affecting it. It's also good to switch back and forth a few times. Run one version a couple of times, then the other, then switch back to the first and try it again, and so on. That reduces the risk of results being influenced by what else was going on at the time. To get reproducible benchmarks, it's also good to quit other applications and check top and nvidia-smi just before each benchmark to make sure nothing else is using significant CPU or GPU time.

@ijpulidos
Copy link
Contributor

ijpulidos commented Aug 8, 2022

@peastman You were correct. The results are the same for all the benchmarks running from the revision ae34183 and the changes in this PR.

image

My previous results were not wrong, they were just not measuring the correct things, since they seemed to have extra performance improvements in the main branch, independent of the changes in this PR.

@peastman
Copy link
Member Author

peastman commented Aug 9, 2022

Probably from #3720. Thanks!

@peastman
Copy link
Member Author

peastman commented Aug 9, 2022

Here are some other good benchmarks to try.

  • Measure the speed of LocalEnergyMinimizer. It repeatedly calls setPositions() to evaluate candidate conformations. Be aware that it isn't deterministic, so we should measure it several times to get the average and variation.
  • Try loading a series of conformations saved from a simulation and compute the energy of each one. Try evaluating them both sequentially and in random order, since that may affect the speed.

@peastman
Copy link
Member Author

peastman commented Aug 9, 2022

Here are some more benchmarks. I started by simulating ApoA1 for 10 ns, saving a conformation to a DCD file every 100 ps. I then loaded in the file and evaluated the energies of all 100 conformations, either in order or in a random order. I tried both for ae34183 and for this PR and repeated each measurement three times. Here is the total time (in seconds) for the 100 energy evaluations with the CUDA platform.

Random

ae34183:

0.8552842140197754
0.8469550609588623
0.8429384231567383

This PR:

0.8529636859893799
0.833075761795044
0.8250143527984619

In order

ae34183:

0.9200880527496338
0.9235060214996338
0.922003984451294

this PR:

0.6829240322113037
0.6808676719665527
0.6837430000305176

When evaluating them in random order, this PR doesn't seem to have any significant effect on speed. When evaluating them in order, the version with this PR gets faster, which is what I'd expect. Successive frames have related atom positions, so the atom ordering should be better. But for the existing code, evaluating frames in order actually makes it slightly slower. I can't think of any reason for that. But I switched back and forth a few times and it's reproducible. I also tried OpenCL and the results are very similar. In order is faster than random with this PR, but slightly slower for the existing code. I don't understand that.

As another benchmark, I tried running an energy minimization from the final state in the trajectory. I repeated it ten times for both versions of the code. Here are the times in seconds for the CUDA platform.

ae34183:

17.01730704307556
10.366745948791504
13.943104267120361
14.002901554107666
11.451032638549805
12.502070903778076
13.595231056213379
11.10424518585205
19.358876943588257
11.82084608078003

This PR:

12.159980535507202
12.579627513885498
15.853049516677856
16.603999614715576
11.795461416244507
17.72627019882202
12.364694118499756
19.035810947418213
14.046435356140137
13.135895252227783

The difference between them is not statistically significant.

@zhang-ivy
Copy link
Contributor

@peastman : Great to see! Any more experiments you'd like to see? Or can we merge this?

@zhang-ivy
Copy link
Contributor

Also just curious, what is your timeline for releasing openmm 8? I know the beta is already on conda forge, but just curious when you think 8.0 itself will be released?

@peastman
Copy link
Member Author

The original schedule was to have released the beta about a month ago, which would have meant doing a release candidate around now. But we've been stuck for weeks on trying to build conda packages for the plugins. :( Whenever we finally manage to get all of them built, I'll do a new updated beta build of OpenMM, then we can announce it and ask users to start testing it, and then follow with the release candidate about a month later.

Based on the tests above, I'm ok with merging this if everyone else is comfortable with it.

@jchodera
Copy link
Member

Based on the tests above, I'm ok with merging this if everyone else is comfortable with it.

No objections! This seems like a robust improvement!

@peastman peastman merged commit 48664a1 into openmm:master Aug 12, 2022
@peastman peastman deleted the reorder branch August 12, 2022 19:27
@zhang-ivy
Copy link
Contributor

I just tested the openmm nightly dev (with this PR) on a large protein:protein system (RBD:ACE2) and am seeing the cycling problem is gone 🥳
image

The spikes every ~100 iterations are due to saving to the checkpoint file.

I will close the perses repex cycling issue now. Thanks everyone for your help in solving this!!

@peastman
Copy link
Member Author

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants