-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved logic for deciding when to reorder atoms #3721
Conversation
@peastman I'm not able to build this branch/fork of openmm. I'm facing the following errors when running
I am able to build the cmake -DCMAKE_INSTALL_PREFIX=/home/user/workdir/software/openmm-install -DOPENMM_BUILD_AMOEBA_OPENCL_LIB=OFF -DOPENMM_BUILD_CUDA_TESTS=OFF -DOPENMM_BUILD_DRUDE_OPENCL_LIB=OFF -DOPENMM_BUILD_EXAMPLES=OFF -DOPENMM_BUILD_OPENCL_DOUBLE_PRE=OFF -DOPENMM_BUILD_OPENCL_LIB=OFF -DOPENMM_BUILD_OPENCL_TESTS=OFF -DOPENMM_BUILD_RPMD_OPENCL_LIB=OFF -DOPENMM_BUILD_STATIC_LIB=OFF -DOPENMM_GENERATE_API_DOCS=OFF ../
make -j 7 |
This looks like a promising solution! @ijpulidos : That's a confusing error, but this page suggests that it may be due to how |
I managed to build it now. There are some things we want to discuss about the building process, I think we are close to getting it "just right", especially when doing it with conda-forge packages and compilers. I'll be raising an issue about this for further discussion.
It wasn't a fresh new environment since I want it to live alongside other packages, such as |
I run the tests again and the results are as follows
EDIT: Added previous data for 2MPI processes WITHOUT fix. |
It seems that using 2MPI processes is now a bit "worse" compared to the hotfix (triggering atoms reordering every time) I tried that last time. |
@ijpulidos : in the experiments labeled |
Actually, never mind, it seems consistent with previous results, since that's just the difference of not using the blocking sync flag/option. Last time I didn't try the fix with blocking. |
EDIT: Sorry for the noise, just to be explicity since it gets a bit confusing. These latest tests use |
Then I think it does mean that we are now getting a bit worse result for 2 MPI processes, as per #3721 (comment). But we do have to take into account that the conda-forge package |
Thanks for the great data, @ijpulidos! @mikemhenry suggested it might be useful to run the OpenMM |
The results for the benchmark comparing EDIT: These results are comparing |
I don't think that this should be a blocker for this issue, but I am interested in improving the performance of the neighbor list by doing some automated tuning. The MD engine I used a lot in graduate school, hoomd blue, does some neat stuff with the neighbor list https://hoomd-blue.readthedocs.io/en/v3.3.0/module-md-nlist.html#md-nlist to help tune the neighbor list rebuild time. So I think we should fine a solution that is on the easy side to implement, but long term I think automated tuning will be the correct approach. |
Wow! There is improvement in every benchmark! That's great! Thanks for the comprehensive assessment, @ijpulidos!
Sounds like it's worth opening a separate issue to discuss, @mikemhenry! |
@jchodera : Th x axis is performance (ns/day), so I think its actually the opposite -- this PR actually causes a bit of a slowdown for all systems? |
I tried running a couple of benchmarks both with this PR, and with revision ae34183, which is the revision on the main branch this branch was created from. I ran each benchmark three times. The speed in ns/day is shown below. First for apoa1pme. This PR: 229.521 Main branch: 209.075 And for amoebapme: This PR: 10.7994 Main branch: 10.9505 There are fluctuations from one run to another. Neither branch seems to be systematically faster than the other. And I can't see any reason there would be. I added a line to Is it possible you're seeing some other effect? For example, the two sets of benchmarks were run on different nodes of a cluster, or something else was running in the background that slowed it down during one set? |
These were all run locally in my laptop using an RTX A4000 external GPU. I didn't use the revision you mention but whatever was the latest in the main branch at the time. |
Try using that revision to be sure no other change is affecting it. It's also good to switch back and forth a few times. Run one version a couple of times, then the other, then switch back to the first and try it again, and so on. That reduces the risk of results being influenced by what else was going on at the time. To get reproducible benchmarks, it's also good to quit other applications and check |
@peastman You were correct. The results are the same for all the benchmarks running from the revision ae34183 and the changes in this PR. My previous results were not wrong, they were just not measuring the correct things, since they seemed to have extra performance improvements in the |
Probably from #3720. Thanks! |
Here are some other good benchmarks to try.
|
Here are some more benchmarks. I started by simulating ApoA1 for 10 ns, saving a conformation to a DCD file every 100 ps. I then loaded in the file and evaluated the energies of all 100 conformations, either in order or in a random order. I tried both for ae34183 and for this PR and repeated each measurement three times. Here is the total time (in seconds) for the 100 energy evaluations with the CUDA platform. Random0.8552842140197754 This PR: 0.8529636859893799 In order0.9200880527496338 this PR: 0.6829240322113037 When evaluating them in random order, this PR doesn't seem to have any significant effect on speed. When evaluating them in order, the version with this PR gets faster, which is what I'd expect. Successive frames have related atom positions, so the atom ordering should be better. But for the existing code, evaluating frames in order actually makes it slightly slower. I can't think of any reason for that. But I switched back and forth a few times and it's reproducible. I also tried OpenCL and the results are very similar. In order is faster than random with this PR, but slightly slower for the existing code. I don't understand that. As another benchmark, I tried running an energy minimization from the final state in the trajectory. I repeated it ten times for both versions of the code. Here are the times in seconds for the CUDA platform. 17.01730704307556 This PR: 12.159980535507202 The difference between them is not statistically significant. |
@peastman : Great to see! Any more experiments you'd like to see? Or can we merge this? |
Also just curious, what is your timeline for releasing openmm 8? I know the beta is already on conda forge, but just curious when you think 8.0 itself will be released? |
The original schedule was to have released the beta about a month ago, which would have meant doing a release candidate around now. But we've been stuck for weeks on trying to build conda packages for the plugins. :( Whenever we finally manage to get all of them built, I'll do a new updated beta build of OpenMM, then we can announce it and ask users to start testing it, and then follow with the release candidate about a month later. Based on the tests above, I'm ok with merging this if everyone else is comfortable with it. |
No objections! This seems like a robust improvement! |
Yay! |
This is a possible fix for choderalab/perses#613. See the discussion in choderalab/perses#613 (comment) and following.
Normally it reorders atoms every 250 steps. The change here is that if it ever sees the size of the neighbor list has increased by more than 10% since the last reordering, then it immediately forces a new reordering to try to bring it back down. However, it still will never reorder more often than every 25 steps. That's to reduce the risk of catastrophic slowdowns in pathological cases.
This code should be considered experimental. Before we decide whether to merge it or try a different approach instead, we need to thoroughly test it to see how it affects performance on all the relevant use cases.
cc @zhang-ivy @ijpulidos @jchodera