Feature restart #5

mb2055 · 2023-11-09T11:58:42Z

Adds support for restarting from checkpoints created by previous SOMD2 simulations. This is achieved with a new restart config option, if True SOMD2 will search the output-directory for pre-existing configs and checkpoint files, using the checkpoint files as a starting point for the current run.
Features/requirements to note:

runtime should be set as the total desired runtime for each window. If, for example, runtime is set to 5ns and the pre-existing checkpoint file already has a runtime of 2ns, the simulation run from restart will run for 3ns (to bring the total time to 5ns).
In order for a simulation to restart from checkpoint the configuration must match that of the previous run (some settings can be changed, see this commit for a list). If no config.yaml is present in the output-directory then configurations will be checked against that of the lambda=0 checkpoint file.
Restart is False by default, any files that are already in the output-directory may be deleted. A list of files will be given as a logger warning and the user will be required to press enter to confirm deletion. This button press requirement can be turned off by adding supress-overwrite-warning to the command line (or setting supress-overwrite-warning = True if using python API).

Adds a restart flag that checks the output_directory for pre-existing sire systems, if they are present they are loaded and used as a starting point for the runner. In this case runtime represents the total time for the system. Also adds a supress_overwrite_warning flag that supresses the need for the user to press enter before files are overwritten in the case of restart=False.

…is used. [ci skip]

…point file is found

Fixes a bug in which the charge scale factor is referenced before being set during the setup of a charge scaled morph, this simply sets it to a temporary value which is then overwritten when charge scale factor is set

…t cannot be changed between runs throw the relevant error. Options that cannot be changed are the following: 'runtime', 'restart', 'temperature', 'minimise', 'max_threads', 'equilibration_time', 'equilibration_timestep', 'energy_frequency', 'save_trajectory', 'frame_frequency', 'save_velocities', 'checkpoint_frequency', 'platform', 'max_threads', 'max_gpus', 'run_parallel', 'restart', 'save_trajectories', 'write_config', 'log_level', 'log_file', 'supress_overwrite_warning'

…iting still exist.

…itself at the correct level. Also adds a test for the writing of the logfile

Added extra parameter to config.as_dict to make it equivalent to sire outputs. Restarts are checked against previous config, if this config doesn't exist the runner checks the checkpoint file for lambda=0 for the previous config. yaml safe load throws a different exception on windows - this was causing the test to fail. Should now be fixed.

…o Path in as_dict, removing lambda symbols on windows, changed filenames in test_logfile_creation

…sue?

…g class

Config encoded in to checkpoint files must match either the config.yaml file or the config from the lambda=0 window

lohedges

Thanks, @mb2055, this looks great. There are just a few minor comments and things to address.

Other than those, my only thought is whether we want to make the system command line option into an optional argument rather than a positional argument, which is required. That way you could simply omit the system if running a restart and we'll just check the output directory. In this case you would run somd2 as:

# First run.
somd2 --system some_system.s3

# Second run. (Will check and use output in the output directory.)
somd2 --restart --runtime 10ns

This avoids the confusion of the user passing in a different system to the one in the checkpoint files from the output directory. If they pass a system as well, then we could just warning that they are performing a restart and it'll be ignored.

What do you think?

src/somd2/config/_config.py

src/somd2/runner/_dynamics.py

src/somd2/runner/_runner.py

tests/runner/test_config.py

tests/runner/test_restart.py

mb2055 · 2023-11-14T10:22:55Z

I considered making system an optional argument originally. I decided against it as, when using --restart, any lambda values for which a checkpoint file isn't present will start from scratch using the base system.
With that being said, there's no reason that the behaviour can't be changed to, say, start from the original system for missing lambda values if the --system argument is given, and only simulate the lambda values that are present if --system isn't present.

lohedges · 2023-11-14T10:28:01Z

That's a good point. For now I think it's easiest to keep things as they are. I imagine the use case might be just re-running the same command multiple times in a loop, e.g. to re-run until all windows complete, in which case the command-line will be the same. We should probably document the behaviour somewhere, though. This would also mean that we probably need to do some basic system comparison checks too, i.e. same number of molecules, residues, atoms, etc.

Changes/removal of comments in both config and test files. Removal of redundant variable in config. lam_sym is now _lam_sym and is only declared in runner.py, being imported yb dynamics.py. Temperature can now no longer be changed between restarts, an appropriate test has been added to ensure this functionality. Extra_args is no longer explicitly checked in the _compare_configs function, and it now ignored by the _to_dict function of _config.py.

src/somd2/config/_config.py

…ed to raise a warning here as the user is warned in the runnner before any files are removed. [ci skip]

lohedges · 2023-11-14T14:01:49Z

Thanks, the updates look good.

Is it possible to add a simple check that the checkpoint systems are the same as the one passed from the command line? you could first validate that the UID of each checkpoint is the same, then following this make sure that it has the same number of molecules, residues, and atoms as the one on the command line? If every lambda window has a checkpoint file, then you might want to do something like log a warning that the system that was passed via the command line will be ignored. (In this particular case there is probably no need to validate that it's the same as each of the checkpoints.)

…er of molecules, number of residues and number of atoms, returning true only if all match. This funciton is used to check equivalence between checkpoint files and the original system

mb2055 · 2023-11-14T15:21:26Z

The latest commit adds a check that checkpoint files match the original system.
A couple of minor points:

It assumes the new sire API is being used and calls system._system.uid().
There is no warning regarding the original system being ignored due to the location at which the equivalence is checked. The check is performed within each child process (when the checkpoint is loaded), meaning there is no easy way to verify whether all lambda values have been loaded from checkpoint. There are a couple of ways we could overcome this if the warning is really needed; we could either instantiate a new shared list that stores a boolean for each system (this would only print a warning after all systems have been loaded, which would be delayed due to them only being loaded as they are about to be simulated), or we could pre-load all of the systems in to temporary variables, check their equivalence with the base system and then delete them (I don't know how long it takes to load sire systems from binary files, but for large systems this could add some overhead)

lohedges · 2023-11-14T15:46:24Z

Good point. I think it's okay to just perform the check when the checkpoint is loaded, rather than pre-loading everything. I think this would essentially never happen and we'd only be ensuring that we don't run any windows where the system doesn't match.

src/somd2/runner/_runner.py

lohedges

Thanks @mb2055. Looks good!

mb2055 added 24 commits November 3, 2023 15:44

Added test for the restart feature.

35ccd32

Added config filename to filename generator

16f7d86

Fixed config writing

c8e9f52

Merge branch 'main' into feature_restart

bb9606d

Remove duplicates from list of files to be deleted when --no-restart …

9e66360

…is used. [ci skip]

Simulations will now start from scratch if restart==True but no check…

7965730

…point file is found

Slight change to setting of charge scaled morph

b85be1b

Fixes a bug in which the charge scale factor is referenced before being set during the setup of a charge scaled morph, this simply sets it to a temporary value which is then overwritten when charge scale factor is set

Changed default behaviour to restart=False, warnings regarding overwr…

d4faed6

…iting still exist.

Re-factor of config to allow for the logger to be used within config …

257a99c

…itself at the correct level. Also adds a test for the writing of the logfile

Removing changed the writing of the output directory from PosixPath t…

595d86b

…o Path in as_dict, removing lambda symbols on windows, changed filenames in test_logfile_creation

platform not os.platform

0d73e87

Explicitly delete objects before tempfile cleanup to solve windows is…

845db4d

…sue?

Disable logfile check on windows

88e7c8d

Remove wrongly added test, actually disable windows on logfile test

8561f5c

Switch logfile test back on and add explicit logger deletion to confi…

499c529

…g class

Removed second half of logfile creation test

08dec8f

returned windows skip to test for now

471ee73

Proper naming for windows platform

ef71177

another attempt at fixing logging on windows

4b59cbb

Added checks for consistency when restarting from checkpoint

c794e7b

Config encoded in to checkpoint files must match either the config.yaml file or the config from the lambda=0 window

Added lambda value to properties [ci skip]

fe081d5

lohedges self-requested a review November 9, 2023 12:28

lohedges requested changes Nov 10, 2023

View reviewed changes

lohedges reviewed Nov 14, 2023

View reviewed changes

src/somd2/config/_config.py Outdated Show resolved Hide resolved

Removed mention of directory_existed from _config.py - there is no ne…

a832c92

…ed to raise a warning here as the user is warned in the runnner before any files are removed. [ci skip]

Adds a systems_are_same function to the runner that checks uids, numb…

e7b82c3

…er of molecules, number of residues and number of atoms, returning true only if all match. This funciton is used to check equivalence between checkpoint files and the original system

lohedges reviewed Nov 14, 2023

View reviewed changes

src/somd2/runner/_runner.py Outdated Show resolved Hide resolved

Made systems_are_same private.

71ee5df

lohedges approved these changes Nov 14, 2023

View reviewed changes

lohedges merged commit a963e1c into main Nov 14, 2023
6 checks passed

lohedges deleted the feature_restart branch November 14, 2023 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature restart #5

Feature restart #5

mb2055 commented Nov 9, 2023

lohedges left a comment

mb2055 commented Nov 14, 2023

lohedges commented Nov 14, 2023

lohedges commented Nov 14, 2023

mb2055 commented Nov 14, 2023

lohedges commented Nov 14, 2023

lohedges left a comment

Feature restart #5

Feature restart #5

Conversation

mb2055 commented Nov 9, 2023

lohedges left a comment

Choose a reason for hiding this comment

mb2055 commented Nov 14, 2023

lohedges commented Nov 14, 2023

lohedges commented Nov 14, 2023

mb2055 commented Nov 14, 2023

lohedges commented Nov 14, 2023

lohedges left a comment

Choose a reason for hiding this comment