-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup contexts upon calculation completion, failure #354
Conversation
This adds calls to `ContextCache.empty` at the end of repex execution, or in the case of execution failure, to avoid leaving behind stale contexts. This can be important for long-running processes executing multiple `ProtocolDAG`s in sequence.
Worked with @jchodera earlier today on this; he suggested this change due to behavior we were seeing in I'm seeing behavior on my local test box that suggests contexts are not being cleaned up currently; I have yet to test if this PR addresses this. |
Can confirm that we observe GPU memory exhaustion when running many RBFE
Also end up getting many of these:
|
Running a test now on my local box to see if the changes in this PR give stable memory use over time. Will report back on observations. |
This may not yet be solving the issue, but can't conclude yet; will let the test box run overnight and see what state it's in in the morning. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #354 +/- ##
==========================================
- Coverage 90.66% 90.59% -0.07%
==========================================
Files 92 92
Lines 5718 5741 +23
==========================================
+ Hits 5184 5201 +17
- Misses 534 540 +6
... and 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the try except down to only encapsulate the problematic section?
This is rather odd, I've ran multi-day long DAGs before and never saw that behaviour. |
energy_context_cache.empty() | ||
sampler_context_cache.empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like it should get (eventually) upstreamed as these context objects are used using a context manager pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the short term solution, it might be good to at least make sure that the context is cleanly removed in multistatesampler.__del__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, upstreaming here: choderalab/openmmtools#690
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc choderalab/openmmtools#688
@IAlibay re: your multiple long DAGs, were these in different Python evocations? I think this might be surfacing as dotson has a single long running process issuing these (which shouldn't be bad), whereas if you were quickrunning in succession you'd have been tearing down the memory footprint between each |
No, recently it was the one big Python loop, so should all be the same process. I'd say the most I tried is 12 tyk2 complex calculations in a loop, so if it's a large scale thing it might explain why we didn't see it. |
Running another testing round on local box with latest changes from power hour discussion. Will report back behavior. |
These changes still appear insufficient to avoid growing system and GPU memory use. As we saw from discussion we observe the problem after ~40 successful runs. |
Addressing your comments now @IAlibay. |
Running another round of local tests with latest changes, including what we propose to upstream in choderalab/openmmtools#690. |
@dotsdl - latest changes seem to be preventing memory leaks on my test loop, if you can test it on your end that'd be great. |
del cache.global_context_cache._lru._data[context] | ||
|
||
del sampler_context_cache, energy_context_cache | ||
if not dry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't delete the integrator on dry, one would hope that alchemiscale style services wouldn't be running in dry mode anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see a scenario where we would use dry
on alchemiscale
, so think we're good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems more of a user-host thing for debug.
now equally guilty of shaking the code until something happens
@IAlibay haha sometimes it's a good strategy. 😁 |
😅 you mean to say there are other strategies? In any case, I'll happily merge this & cut a release once we know it's stable. |
I think that was it @IAlibay! Great work! I am no longer seeing rising GPU memory usage, but still observe rising system memory usage. Do you observe the same? If you don't also observe rising system memory, this may be a symptom of the |
Awesome that's one thing gotten. Yeah the host memory leak tracks with what I'm seeing too :/ Off the top of my head it's possibly HTF or the generator objects. |
@dotsdl - so with these latest changes I do see host memory creep up, but it plateaus ~ 9 GB in total system memory for three concurrent dag execution loops on my workstation (with a gc.collect() call at the end of each loop). It's hard to know if this is an actual memory leak or just garbage collection being quite lazy when there's lots of RAM left (I still have another 23 GB of RAM free..). For now I'm going to leave it as-is since it seems stable and honestly if we manage to get 12 iterations of a complex calculation in 24h that would be rather amazing performance. Given it's so late in the day I'm going to just hold off on merging this whilst I quickly do a couple of actual simulations overnight. Just want to check we didn't drastically break stuff somewhere here. If everything looks fine I'll merge / release tomorrow in the US morning. |
@@ -34,7 +34,7 @@ def __init__(self, *args, hybrid_factory=None, **kwargs): | |||
self._hybrid_factory = hybrid_factory | |||
super(HybridCompatibilityMixin, self).__init__(*args, **kwargs) | |||
|
|||
def setup(self, reporter, platform, lambda_protocol, | |||
def setup(self, reporter, lambda_protocol, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating a context here was absolutely unecessary, so it's been removed
# we won't take any steps, so use a simple integrator | ||
integrator = openmm.VerletIntegrator(1.0) | ||
platform = openmm.Platform.getPlatformByName('CPU') | ||
dummy_cache = cache.DummyContextCache(platform=platform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a dummy context cache for the convenience of being able to pass a thermodynamic state object rather than doubling up efforts and doing the 3-4 extra lines here ourselves
) | ||
sampler_state.update_from_context(context) | ||
finally: | ||
del context, integrator, dummy_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a bit overly cautious but it'll do I think.
minimize(compound_thermostate_copy, sampler_state, | ||
max_iterations=minimization_steps) | ||
sampler_state_list.append(copy.deepcopy(sampler_state)) | ||
|
||
del compound_thermostate, sampler_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be fine.
All good @IAlibay! Will await your conclusions and |
Since we're now using the CPU platform for minimization, I need to set |
Running another burn test on my host here; will report back observations as well. |
Minimization is extremely slow on the CPU. I highly recommend doing this on the GPU. Some considerations for minimization of replicas:
|
Yeah I do observe an idle GPU on my test host for some time while minimizations are happening, which may be a bit of a bottleneck. However for production-length runs, this may not end up mattering much, since the minimization may still be very short compared to production runtime. The tests I'm running now are very short, so the minimization period is quite noticeable. If it's possible to take @jchodera's suggestions above and run the minimizations on the GPU, that may still be preferable though. The good news is that so far running 6 workers with a mix of complex and solvent transformation |
@jchodera - For context here, this is only a 100 step pre-minimization to get out of potentially really bad high energy states. The reason we're doing this here is that we were seeing cases of silent overflows when minimizing on GPUs. Admittedly I think this might have all been tied into the FIRE minimizer issues we recently reported, however our take was that being a bit extra cautious for now might only cost us an extra couple of minutes? Unless you think this might be unreasonably slow for larger systems (TYK2 seems to be fine, but it's not exactly the largest system)?
I note that the "proper" minimization (i.e. https://github.com/OpenFreeEnergy/openfe/blob/main/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L652) on should be assigned the GPU platform @dotsdl (or at least whatever platform was chosen by default in settings).
That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ). |
Observing a slow climb in host memory usage, but I still think this looks tolerable if it scales with task count, not task length, given how artificially short the test tasks are. I ran 240 tasks, 120 each of complex and solvent
This after running for nearly 4 hours, and an almost unnoticeable climb in memory usage from that observed over 2 hours ago. I believe this is consistent with what you observed as well @IAlibay. |
Yes, I think we're clear to proceed with this as-is. Thank you for all the solid sleuthing on this @IAlibay! I think in practice we'll get plenty of throughput out of the implementation as it stands, and we can work on greater efficiency when we're not under the current time pressure. |
Aha! I didn't know OpenMM's localenergyminimizer was this smart: https://github.com/openmm/openmm/blob/master/openmmapi/src/LocalEnergyMinimizer.cpp#L54 Seems like it'll always do a CPU fallback if there's a force clipping issue. In that case let's keep the old behaviour here and just stick to whatever the default platform is for this initial minimization. Once choderalab/openmmtools#672 is merged, we can probably just get rid of this pre-minimization completely. |
del reporter | ||
|
||
# clean up the analyzer | ||
if not dry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit of a cleanup, but I'll do it in the upcoming refactor.
Quick test update - we're getting reproducible results for TYK2's ejm_31_ejm_46 edge - within 0.1 kcal/mol. Everything seems fine. |
Alright, thanks for all the work here @dotsdl - let's move towards a release 🎉 |
@IAlibay : There are definitely deficiencies with the FIRE minimizer, and we're hoping to be able to address those at some point in the near future. But PME systems will get very slow on the CPU, even for 100 steps. Can you elaborate on the kinds of "silent overflows" you ran into? This was with |
This is a reasonable solution! The main concern here was ensuring there were no device resource leaks, not performance. |
I got muddled a bit here and I had completely forgotten for this PR - it turns out that localenergyminizer will do a default fallback to CPU if the forces overflow, so I ended up sticking with the old behaviour and just passing whatever the user defined platform is.
@jchodera: If I remember correctly the issue here was that the minimizer would essentially attempt to minimize but fail to do so. However rather than throwing an exception or directly giving a NaN energy, it would just take the system to an unreasonable energy state, from that point any further attempts to minimize would lead to a system that yields NaNs. I didn't manage to dig into this too much at the time, but avoiding the default I suspect the issue with the FIRE minimizer is that, similar to LocalEnergyMinimizer on non-CPU platforms, the forces overflow but not the energy accumulator leading to this weird behaviour. (I wish I had taken much better notes when I encountered it, I'll try to reproduce this again once things a bit less hectic). |
This adds calls to
ContextCache.empty
at the end of repex execution, or in the case of execution failure, to avoid leaving behind stale contexts. This can be important for long-running processes executing multipleProtocolDAG
s in sequence.