Improve robustness against OpenMMException (e.g. CUDA_ERROR_ILLEGAL_ADDRESS) #928

jchodera · 2022-01-30T03:46:26Z

We should implement some level of protection against these failures. One option is to wrap the whole simulation in a try...except block. That might be the most conservative, safest option, since we don't know how many CUDA contexts need to be invalidated if this occurs.

mikemhenry · 2022-03-01T16:45:10Z

Retry logic will go here:
https://github.com/choderalab/perses/blob/main/perses/app/setup_relative_calculation.py#L758

mikemhenry · 2022-03-10T22:56:55Z

When using a different code entry point than the yaml, PointMutationExecutor the user will need to add the restart logic (for now, until we fix the api). I'll update the documentation but it will be something like:

retry_attempt = 0
MAX_ATTEMPTS = 5
while retry_attempt < MAX_ATTEMPTS:
    try:
        hss.extend(n_cycles)
    except OpenMMException as err:
        _logger.error(f"OpenMMException! {err}")
        retry_attempt += 1
        _logger.error(f"retry attempt {retry_attempt}/{MAX_ATTEMPTS}")
else:
    _logger.error(f"Failed to retry simulation in {MAX_ATTEMPTS} attempts")
    _logger.error(f"Will try one last time and not catch the exception")
    hss.extend(n_cycles)

also some changes to the minimizer may help as well with these errors:
choderalab/openmmtools#557

jchodera added enhancement ✨ effort: low labels Jan 30, 2022

jchodera added this to the 0.10.0 Enhanced CLI with more stable input/output formats milestone Jan 30, 2022

mikemhenry self-assigned this Mar 1, 2022

mikemhenry mentioned this issue Mar 10, 2022

Feat/robust retry logic #960

Closed

ijpulidos modified the milestones: 0.10.0 Enhanced CLI with more stable input/output formats, 0.11.0: New top-level API Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness against OpenMMException (e.g. CUDA_ERROR_ILLEGAL_ADDRESS) #928

Improve robustness against OpenMMException (e.g. CUDA_ERROR_ILLEGAL_ADDRESS) #928

jchodera commented Jan 30, 2022

mikemhenry commented Mar 1, 2022

mikemhenry commented Mar 10, 2022

Improve robustness against OpenMMException (e.g. CUDA_ERROR_ILLEGAL_ADDRESS) #928

Improve robustness against OpenMMException (e.g. CUDA_ERROR_ILLEGAL_ADDRESS) #928

Comments

jchodera commented Jan 30, 2022

mikemhenry commented Mar 1, 2022

mikemhenry commented Mar 10, 2022