Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEP tutorial comments #1

Closed
jmichel80 opened this issue Apr 16, 2021 · 18 comments
Closed

FEP tutorial comments #1

jmichel80 opened this issue Apr 16, 2021 · 18 comments

Comments

@jmichel80
Copy link
Collaborator

Hi @JenkeScheen ,

A few comments/questions after going through your setup notebook:

  • The notebook assumes the user has access to parameterised ligands (in vacuum) that have coordinates matching those of the parameterise protein input. We need separate BSS nodes to generate this type of input starting from 'sane' PDB/mol2 files.

  • The inputs haven't been energy minimised/equilibrated before starting the production runs. This may cause SOMD to crash. It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.

  • The slurm submission script will run each lambda value serially. It would be better to run the lambda windows in parallel to decrease time to answer.

@JenkeScheen
Copy link
Collaborator

Hi @jmichel80 ,
Thanks for the input, I'll get on this after the weekend.

  • Yes, I thought that this is what we agreed on. IIRC the idea was that in the introduction part of the tutorial, participants would have learned to do this and so parameterising ligands etc is out of scope for the FEP part?
  • Agreed, although I'm not sure how to do this high-level in BSS? @lohedges
  • yes, thanks for the input, I'm planning to adjust this part of the tutorial once the BSS code that can load a SOMD directory is in place, although I think the standard BSS approach is to run lambdas sequentially? Personally I usually just create a slurm job in every lambda folder, where the job consists of just running somd-freenrg on the input files in that dir. Although that circumvents the issue you refer to I would think that approach is not suitable for this tutorial? (seeing as we want to use .run()/.analyse())

@jmichel80
Copy link
Collaborator Author

Hi @JenkeScheen ,

  • Yes the intro part will cover how parameterisation works, but the necessary code should still be present in the FEP use case so that the FEP tutorial can be used standalone to process another dataset (if desired). We want these tutorials to also be useful to the users who "just want to get to the binding energies". Refer to the outline I posted previously here [TUTORIAL] FEP with perturbation network BioSimSpace#193 I'm not saying the code has to be structured exactly as shown there but it should be possible for a user to start with the input files specified red, and end up with the binding energies highlighed in red as a output, abstracting intermediate steps if desired.
  • You could use BSS energy minimisation/ equilibration protocol. But that means solvating first.
  • Will let @lohedges comment what options are available without too much work. I imagine we can get a process to only run a subset of windows and use job arrays. There is a bit more work to do to make sure all the relevant output is present before running the .analyse() step.

@lohedges
Copy link
Member

Hi both,

Here are a few comments:

The notebook assumes the user has access to parameterised ligands (in vacuum) that have coordinates matching those of the parameterise protein input. We need separate BSS nodes to generate this type of input starting from 'sane' PDB/mol2 files.

Yes, we'll have standalone nodes for parameterisation and solvation. For the purposes of the tutorial I am assuming that you'll just have the pre-prepared systems so you would just talk about those stages and what the input/output would be, rather than actually running them. I'm don't think we will be able to use nodes for minimisation or equilibration though, since those would lose the concept of a system containing a merged molecule when writing the output to file. (We could just used the coordinates from the minimised / equilibrated systems and run system._updateCoordinates to copy them into the FEP system, but this would need us to reconstruct the system before doing so.)

The inputs haven't been energy minimised/equilibrated before starting the production runs. This may cause SOMD to crash. It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.

You can run minimisation / equilibration on a system containing a merged molecule. It simply creates the actual system at lambda = 0, then copies the coordinates back into the merged system when you call process.getSystem(). As mentioned above, this would work if you ran minimisation and equilibration in a single script/node prior to running the FEP simulation, but wouldn't work in standalone scripts/nodes since you would lose the merged system. (Currently we can only write to file, rather than being able to reconstruct it.)

The slurm submission script will run each lambda value serially. It would be better to run the lambda windows in parallel to decrease time to answer.

If we want to run lambda windows in parallel then we are going to need to do this outside of BioSImSpace, i.e. just use it for the setup of the FEP simulation, as is done by prepareFEP. We could then provide example batch scripts (tested on UCB systems) to show how to run each window in parallel using a job array, or similar.
I also think it would be easier to also provide a script that runs analyse_freenrg for each ligand pair. This would avoid the need to go back into BioSimSpace for this single stage afterwards.

The ProcessRunner that is used by FreeEnergy.run can run jobs in parallel, but this has only been tested on multi CPU jobs, and not on an HPC system. I think it could work if there was a multi-GPU system configured using CUDA_VISIBLE_DEVICES, but I think UCB have single GPU nodes with exclusive access.

I'm still a little confused as to whether we will be assuming that the user will be setting up the FEP job on the system, or whether they are doing this locally and copying directories across. If the latter, then I don't think we're going to be able to reconstruct the FreeEnergy object, but it would be fine if we did the submission script with parallelised somd_freenrg jobs, which is the way most people actually run these things in practice.

@lohedges
Copy link
Member

Just reading this bit again...

It would be better to equlibrate the solvated ligand0:complex, and then use a snapshot after equilbration to generate the merged molecule.

Yes, that would work, i.e. equilibrating the complex prior to performing the merge. (This is what happens in your diagram, I think.) I imagine that this might confuse some people, i.e. solvating, equilibrating, extracting the molecules, merging, then solvating again. However, it's probably the easiest way if we want to decouple the stages.

@jmichel80
Copy link
Collaborator Author

jmichel80 commented Apr 18, 2021

Hi,
I have given more thoughts about the best way to implement the FEP pipeline. From a user standpoint once it has been decided how to process a dataset there shouldn’t be a need for further intervention until the results are available for processing. The complexity is that executing the pipeline efficiently requires using different types of resources at different stages.

I have sketched the execution model here

suggested_workflow-white

We can solve this in slum using a mixture of job, job arrays, job dependencies and a bit of bash scripting.

We can build in some degree of resilience by having each script do some basic error checking (is the expected input present?). Ideally the pipeline is structured such that it could be executed again, not repeating steps that completed successfully.

To keep it compatible with BSS philosophy it should be possible to execute the pipeline with different FEP engines (SOMD or Gromacs at it stands). I think this is possible as the user will specify the chosen executing engine in the planFEP notebook.

I have uploaded a skeleton SLURM bash script that implements the execution model on plato. I have verified that each step is executed with correct dependencies using dummy python scripts.
See https://github.com/michellab/BioSimSpaceTutorials/tree/fep/03_fep/execution_model

Next steps)

  • @lohedges can you advise whether the overall approach is sensible.
  • If we proceed that route which API changes are needed to
    • allow execution with a chosen FEP engine
    • allow running a single lambda window on a GPU at a time
    • allow analysis of a completed FEP run

If the above can be implemented, the next step would be for @JenkeScheen to implement the required calls in the four different BSS execution scripts. The setup notebook would also have to be modified to generate as output the files 'ligands.dat', 'network.dat' , 'protocol.dat' and write the slurm bash scripts.
If we can successfully deploy this approach on plato on a real dataset, the next step would be to write an equivalent LSF script that could work on UCB's cluster.

@lohedges
Copy link
Member

Hmm, there are a lot of things here that aren't currently supported and would need to be implemented / tested if we want to do this in a clean way.

  • Your BSSligprep.py script has equlibration stages with restraints on all non-solvent atoms. We currently only support restraints on backbone atoms (this was all that was ever asked for). I agree that it would be good to have other options, e.g. all heavy atoms, all non-water atoms, any atom indices the user likes, etc, but this would need to be implemented and tested. It's not too hard, but the different engines support this in very different ways and some rely on atom naming for the restraints, which makes things hard to generalise.

  • In BSSrunFEP.py, are we just running the pre-setup legs using the MD engine directly, i.e. writing command-line calls to somd_freenrgy or gmx mdrun directly? If so, does this need to be a Python script? If not, we won't be able to do this with BioSimSpace since we can't reconstruct the free energy object without signinficant work, i.e. we'd need to reload the original ligands and protein (with their original names), recreate the mapping (using whatever options the user chose), align (again with whatever options the user chose), merge, solvate, setup the entire object, etc.

  • Similar to the above BSSanalyseFEP.py seems to be using BioSimSpace to perform the analysis. Unless I can make the analysis calls static (they currently depend on the state of the FreeEnergy object) we can't go back inside BioSImSpace to do this having run the legs independently. It would be easier to add a single gmx mbar or analyse_freenrg command to the run scripts above that analyse the data for each leg following the run. We could then collate this information ready for processing with freenrg_workflows in a separate script.

@lohedges
Copy link
Member

To support multiple restraint types it would also make far more sense to add a restraint option that can take matching keywords, e.g. restraint = option, where option = None, backbone, heavy, all, etc, but this would break the current API, unless we also kept the (now redundant) restraint_backbone option and ignore it when anything else is set.

@jmichel80
Copy link
Collaborator Author

  • BSSrunFEP.py : Since we have previously written the input files for a perturbation it could be a direct call to an FEP engine. It would need logic to map $SLURM_ARRAY_TASK_ID to a lambda value for a free/bound leg. Parsing 'protocol.dat' would tell us which engine is to be used. We would still need to write down a simulation config file (see below).

  • BSSanalyseFEP.py: Could also do the same.

  • BSSprepFEP.py: Could write down the run and analysis commands based on the chosen engine. Effectively ending use of BSS in the pipeline at this stage.

  • That leaves us with supporting various types of restraints in BSSligprep.py. Experience suggests that a multi-staged equilibration where restraints are gradually removed is important for robust processing of inputs (see how FESetup does it). One option is to rely on a single MD engine for minimisation/equilibration (OpenMM or SOMD as they are bundled with BSS), and then switch to the chosen FEP engine at the BSSprepFEP.py stage. We could generalise later.

Let me know what you think.

@lohedges
Copy link
Member

Yes, the first three bullets were effectively what I was thinking, adding some logic to go from the SLURM indexes to the specific legs.

I agree that restraints are important and it would be good to do this properly. I would like a more robust approach similar to what I suggested in my previous post. I already have the logic to translate backbone restraints for all engines that support them, and I could start working on doing the same for other restraints too. (If we agree on what we want to add.) I don't think this would be too bad, although I think some refactoring of the code would be needed.

I would also much prefer a more flexible restraint keyword as described above. This would probably mean deprecating the old option, which would be fine if we document this change and we could always wrap a call to restraint = backbone if the user passes restrain_backbone = True.

Let me know if you want me to start playing around with additional restraints and I'll start that ASAP.

@jmichel80
Copy link
Collaborator Author

Ok let’s focus on the restraints. I will flesh out the run and analysis scripts tonight with SOMD specific commands. Will need some help with the Gromacs commands, presumably they can be extracted from what BSS would have written. We also need a good way to locate this BSS dependency. SOMD is easier since we can assume it may be found in $BSSHOME. You can follow the equilibration protocol of FESetup for the sequence of restraints if you need more specific details. Can post links to relevant parts of the code tonight if you need pointers.

@lohedges
Copy link
Member

No problem. BioSImSpace will autodetect the path to GROMACS so it can write it out for us when the user specifies the engine. We can get the path using free_nrg.runner.processes[0].exe() and can get the specific run arguments using free_nrg.runner.processes[0].getArgs().

@lohedges
Copy link
Member

If possible, it would be good to agree on what type of restaints we want to support. In your example, I see things like the following:

NVT equilibration with restraints on backbone atoms and ligand

I'm not sure that something this flexible would be easy to support across all engines, especially if just using keyword options for the restraint. Ideally we could use the search functionality to generate the list of atom indices involved in the restraint and pass those as an option instead. (Still allowing less flexible keywords too. So the restraint option could be keywords, or a list of atom indices.)

Just to point out how fiddly all this is. Since we preserve atom naming by default (one of our core principles) it makes writing general restraint files a real pain. For example, if you take an AMBER format system and use gmx genrstr to apply the restraints (as we do to make things easier), then you end up with errors such as the following:

'gmx grompp' reported the following warnings:
WARNING 1 [file gromacs.top, line 256]:
  1890 non-matching atom names
  atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.top will be
  used
  atom names from /home/lester/Code/BioSimSpace/demo/tmp/gromacs.gro will
  be ignored

It doesn't complain when not adding restraints, or when passing a regular GROMACS system and using restraints. This makes me think that we'll have to handle all the atom index finding / masking, etc. manually for all packages and do everything explicitly by index. This would mean that we would also need to come up with robust ways of using the search functionality for detecting all types of atoms that we would want to restrain.

@lohedges
Copy link
Member

I've created an issue thread for the discussion of equilibration restraints here.

@jmichel80
Copy link
Collaborator Author

The ligand restraint is useful in case the initial pose has steric clashes and we don't want an early equilibration to push the ligand away from its starting binding mode. Basically we aim to relax a bit the protein and solvent around the starting pose in the hope that it will remain in place once restraints are released.

If we cannot easily implement restraints that select ligand atoms we could skip this for now.

From your post for #203 it looks like it could be some work to support flexible restraints for all supported MD engines.
I suggest we prioritise support for OpenMM and Gromacs initially to allow delivery of the FEP use case, and fill in support for other engines later.

@jmichel80
Copy link
Collaborator Author

One downside of using a shell script for BSSanalyse.sh that just calls directly gmx mbar or analyse_freenrg is that the pipeline is not modular at this stage. We could have wanted to run an FEP with SOMD or GROMACS, and then use the same analysis tool for processing the previous outputs consistently to estimate a free energy. I will proceed with that route for now but we should make note of this issue.

@lohedges
Copy link
Member

The script could detect the format of the output in the directories and select the appropriate analysis tool. I guess this would provide the flexibility that we want in the short-term, i.e. being able to run different bits with different engines and combine results.

(Apologies if I've misunderstood the problem.)

@lohedges
Copy link
Member

I've now added some functionality that you can do static analysis of an existing simulation using BioSimSpace.FreeEnergy.analyse(work_dir). It will figure out everything from the information in the work_dir, i.e. the type of simulation, the engine that was used, etc. This means that you could run the simulations outside of BioSimSpace, e.g. on a cluster, then go back into BioSimSpace to perform the analysis afterwards.

I've also added a BioSimSpace.FreeEnergy.getData() method, which can be used to zip up just the files that are needed for analysis, while preserving the folder structure of the work_dir. This could be use when someone does use BioSimSpace.FreeEnergy.run() but wants to copy back a subset of the output files for analysis, i.e. excluding all the config files, trajectories, etc.

@lohedges
Copy link
Member

I've added another update when so getData is static so you can use it on an existing working directory without needing to instantiate another FreeEnergy object.

(N.B. The Sire macOS build from this morning timed out so all BioSimSpace macOS builds will fail until the re-run of the Sire build completes.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants