Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature restart #5

Merged
merged 28 commits into from
Nov 14, 2023
Merged

Feature restart #5

merged 28 commits into from
Nov 14, 2023

Conversation

mb2055
Copy link
Contributor

@mb2055 mb2055 commented Nov 9, 2023

Adds support for restarting from checkpoints created by previous SOMD2 simulations. This is achieved with a new restart config option, if True SOMD2 will search the output-directory for pre-existing configs and checkpoint files, using the checkpoint files as a starting point for the current run.
Features/requirements to note:

  • runtime should be set as the total desired runtime for each window. If, for example, runtime is set to 5ns and the pre-existing checkpoint file already has a runtime of 2ns, the simulation run from restart will run for 3ns (to bring the total time to 5ns).
  • In order for a simulation to restart from checkpoint the configuration must match that of the previous run (some settings can be changed, see this commit for a list). If no config.yaml is present in the output-directory then configurations will be checked against that of the lambda=0 checkpoint file.
  • Restart is False by default, any files that are already in the output-directory may be deleted. A list of files will be given as a logger warning and the user will be required to press enter to confirm deletion. This button press requirement can be turned off by adding supress-overwrite-warning to the command line (or setting supress-overwrite-warning = True if using python API).

mb2055 added 24 commits November 3, 2023 15:44
Adds a restart flag that checks the output_directory for pre-existing sire systems, if they are present they are loaded and used as a starting point for the runner. In this case runtime represents the total time for the system. Also adds a supress_overwrite_warning flag that supresses the need for the user to press enter before files are overwritten in the case of restart=False.
Fixes a bug in which the charge scale factor is referenced before being set during the setup of a charge scaled morph, this simply sets it to a temporary value which is then overwritten when charge scale factor is set
…t cannot be changed between runs throw the relevant error. Options that cannot be changed are the following: 'runtime', 'restart', 'temperature', 'minimise', 'max_threads', 'equilibration_time', 'equilibration_timestep', 'energy_frequency', 'save_trajectory', 'frame_frequency', 'save_velocities', 'checkpoint_frequency', 'platform', 'max_threads', 'max_gpus', 'run_parallel', 'restart', 'save_trajectories', 'write_config', 'log_level', 'log_file', 'supress_overwrite_warning'
…itself at the correct level.

Also adds a test for the writing of the logfile
Added extra parameter to config.as_dict to make it equivalent to sire outputs.

Restarts are checked against previous config, if this config doesn't exist the runner checks the checkpoint file for lambda=0 for the previous config.

yaml safe load throws a different exception on windows - this was causing the test to fail. Should now be fixed.
…o Path in as_dict, removing lambda symbols on windows, changed filenames in test_logfile_creation
Config encoded in to checkpoint files must match either the config.yaml file or the config from the lambda=0 window
@lohedges lohedges self-requested a review November 9, 2023 12:28
Copy link
Contributor

@lohedges lohedges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mb2055, this looks great. There are just a few minor comments and things to address.

Other than those, my only thought is whether we want to make the system command line option into an optional argument rather than a positional argument, which is required. That way you could simply omit the system if running a restart and we'll just check the output directory. In this case you would run somd2 as:

# First run.
somd2 --system some_system.s3

# Second run. (Will check and use output in the output directory.)
somd2 --restart --runtime 10ns

This avoids the confusion of the user passing in a different system to the one in the checkpoint files from the output directory. If they pass a system as well, then we could just warning that they are performing a restart and it'll be ignored.

What do you think?

src/somd2/config/_config.py Outdated Show resolved Hide resolved
src/somd2/config/_config.py Outdated Show resolved Hide resolved
src/somd2/runner/_dynamics.py Outdated Show resolved Hide resolved
src/somd2/runner/_runner.py Outdated Show resolved Hide resolved
src/somd2/runner/_runner.py Outdated Show resolved Hide resolved
src/somd2/runner/_runner.py Outdated Show resolved Hide resolved
tests/runner/test_config.py Outdated Show resolved Hide resolved
tests/runner/test_config.py Outdated Show resolved Hide resolved
tests/runner/test_restart.py Show resolved Hide resolved
@mb2055
Copy link
Contributor Author

mb2055 commented Nov 14, 2023

I considered making system an optional argument originally. I decided against it as, when using --restart, any lambda values for which a checkpoint file isn't present will start from scratch using the base system.
With that being said, there's no reason that the behaviour can't be changed to, say, start from the original system for missing lambda values if the --system argument is given, and only simulate the lambda values that are present if --system isn't present.

@lohedges
Copy link
Contributor

That's a good point. For now I think it's easiest to keep things as they are. I imagine the use case might be just re-running the same command multiple times in a loop, e.g. to re-run until all windows complete, in which case the command-line will be the same. We should probably document the behaviour somewhere, though. This would also mean that we probably need to do some basic system comparison checks too, i.e. same number of molecules, residues, atoms, etc.

Changes/removal of comments in both config and test files.

Removal of redundant  variable in config.

lam_sym is now _lam_sym and is only declared in runner.py, being imported yb dynamics.py.

Temperature can now no longer be changed between restarts, an appropriate test has been added to ensure this functionality.

Extra_args is no longer explicitly checked in the _compare_configs function, and it now ignored by the _to_dict function of _config.py.
…ed to raise a warning here as the user is warned in the runnner before any files are removed. [ci skip]
@lohedges
Copy link
Contributor

Thanks, the updates look good.

Is it possible to add a simple check that the checkpoint systems are the same as the one passed from the command line? you could first validate that the UID of each checkpoint is the same, then following this make sure that it has the same number of molecules, residues, and atoms as the one on the command line? If every lambda window has a checkpoint file, then you might want to do something like log a warning that the system that was passed via the command line will be ignored. (In this particular case there is probably no need to validate that it's the same as each of the checkpoints.)

…er of molecules, number of residues and number of atoms, returning true only if all match.

This funciton is used to check equivalence between checkpoint files and the original system
@mb2055
Copy link
Contributor Author

mb2055 commented Nov 14, 2023

The latest commit adds a check that checkpoint files match the original system.
A couple of minor points:

  • It assumes the new sire API is being used and calls system._system.uid().
  • There is no warning regarding the original system being ignored due to the location at which the equivalence is checked. The check is performed within each child process (when the checkpoint is loaded), meaning there is no easy way to verify whether all lambda values have been loaded from checkpoint. There are a couple of ways we could overcome this if the warning is really needed; we could either instantiate a new shared list that stores a boolean for each system (this would only print a warning after all systems have been loaded, which would be delayed due to them only being loaded as they are about to be simulated), or we could pre-load all of the systems in to temporary variables, check their equivalence with the base system and then delete them (I don't know how long it takes to load sire systems from binary files, but for large systems this could add some overhead)

@lohedges
Copy link
Contributor

Good point. I think it's okay to just perform the check when the checkpoint is loaded, rather than pre-loading everything. I think this would essentially never happen and we'd only be ensuring that we don't run any windows where the system doesn't match.

Copy link
Contributor

@lohedges lohedges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mb2055. Looks good!

@lohedges lohedges merged commit a963e1c into main Nov 14, 2023
6 checks passed
@lohedges lohedges deleted the feature_restart branch November 14, 2023 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants