Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebooks for cookbook #84

Merged
merged 26 commits into from
Oct 10, 2023
Merged

Add notebooks for cookbook #84

merged 26 commits into from
Oct 10, 2023

Conversation

Yoshanuikabundi
Copy link
Contributor

As I fill out the cookbook in OpenFreeEnergy/openfe#538 I'll add new notebooks to this PR.

@github-actions
Copy link

github-actions bot commented Sep 4, 2023

Binder 👈 Launch a binder notebook on branch OpenFreeEnergy/ExampleNotebooks/cookbook

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review September 18, 2023 13:11
@Yoshanuikabundi
Copy link
Contributor Author

@dwhswenson This one is ready for review!

Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather that putting extra copies of these in the assets/ subdir here, how about making them fetchables? Possibly one fetch command for all the cookbooks, or possibly different for each cookbook.

Add more fetchables to the CLI by adding another entry in openfecli/fetchables.py. Then you can bring them into the current directory with

openfe fetch my-fetchable-name

Make that the first step of the notebook. Also makes it so that users can just download a single notebook to run it instead of pulling down the whole repo.

(That's just an initial big picture review; comments on notebooks coming later.)

Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review covering the first couple notebooks. Since my day is ending and @Yoshanuikabundi's is beginning, I figured I'd pass the baton.

  • load_molecules.ipynb: There's room to mention some of the other options of the SolventComponent (neutralize; SMILES) but I don't this we do/can use anything but the defaults there. I don't think it is realistic that we will any time soon, either. Maybe worth feedback from @richardjgowers or @IAlibay on whether they think those should be mentioned? But my intuition is that it would muddy the waters.
  • generate_ligand_network.ipynb: Maybe title should be "Generate a Ligand Network using a Network Planner"? (That is what we call those things.) See also a couple inline comments.

Still to review:

  • hand_write_ligand_network.ipynb
  • load_ligand_network.ipynb
  • ligandnetwork_vis.ipynb
  • choose_protocol.ipynb
  • create_alchemical_network.ipynb

cookbook/generate_ligand_network.ipynb Outdated Show resolved Hide resolved
cookbook/generate_ligand_network.ipynb Outdated Show resolved Hide resolved
"id": "10232f7b-7aac-4773-8c39-15f672a654b5",
"metadata": {},
"source": [
"Finally, combine the ligands, mapper and scorer with a planning function to generate the network. These planning functions usually generate a large number of candidate networks according to some criteria, score them by summing the scores of their constituent mappings, and return the network with the highest score. They differ by the criteria used to construct the candidate networks."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.. I don't have a specific suggestion here, but I think the description isn't quite accurate. Essentially, what's going on here is that more edges means more computational cost, but also better estimates (until the edges selected are really bad). So there's a trade-off of accuracy for computational time. Different network planners use different metrics of how to balance that trade-off.

MST is a strict "minimal compute" solution. It also isn't a very good solution, from what we've been seeing.

Radial is usually based on some external knowledge, and isn't even based on scores (doesn't require them).

We're getting into adding network planners with more complex heuristics now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good to know! I think I can see how I can rewrite this to be more helpful/accurate. I think when I wrote it I was trying to be vague to accommodate future planners - you can totally think of a radial network as being the network that optimises the criteria "every ligand is connected to the central ligand" 😅

But I would like to understand this more fully! So here's some stream of consciousness of how I understand this, and then I'll ask some focused questions in case of tl;dr


My intuition is that an MST and a radial network both have n-1 edges for n nodes - there seems to be an inductive argument that one node has no edges and adding a node must add an edge. Is that incorrect?

So I'm guessing you mean more edges between two nodes as well as more edges in the network makes things cheaper but less accurate? I'm guessing that an MST allows some edges to have a simpler/better mapping (and so the simulations converge faster?), but a radial network guarantees that any two nodes are only two edges apart. I can see that if all edges have similar precision then the more edges separating two nodes, the less precise the free energy estimate between them, as the error from each edge would compound. This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute? I can see that a mapping that destroys fewer atoms would have a small effect on the number of interactions in the simulation's hybrid states, but this seems negligible for performance. I can also see that extreme transformations just aren't going to work at all and will need some intermediate nodes, and there's a spectrum between "everything works" and "doesn't work at all" where convergence is slow and precision suffers.

I can see that having more edges in the network would improve precision at the cost of performance. Adding an edge is essentially adding a statistical replica, and adding an edge between two distant nodes could improve a lot of errors with relatively little computation. How common are redundant edges in practice?

I'm guessing your choice probably also depends on which edges you actually care about. If you want to compare a bunch of derivatives to a known binder, a radial network might be perfect. If you're, idk, studying the evolution of a natural product, maybe you want a more linear network. Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?

Are there other ways that the mapping and network layout affects computational expense?

Is there another sense in which the network layout and mappings affect accuracy, or is accuracy more or less synonymous with precision in this context? Can having the "wrong" edges introduce bias?

So I guess it seems like you usually want something close to a radial network, with all the ligands only a few steps from each other. The exception is when there are, say, two different classes of ligand, and a radial network will converge slowly or not at all. But using an MST for a bunch of simple benzene derivatives is misleading coz actually all the transformations have high scores, and an MST artificially spreads them out. So you might want something like a hub and spoke heuristic, where you group ligands into disconnected radial networks and then try to connect them with like a minimal spanning path.


Stream of consciousness over. Answers to those questions would be very useful but I know that's a lot to write. I'd appreciate any mistakes in the above being corrected. Here are some more focused questions:

  1. Does a radial network ever have a different number of edges than an MST?
  2. Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?
  3. Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?
  4. Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tag in @IAlibay to correct me on anything I'm wrong about here.

I'll start with the questions at the bottom:

Does a radial network ever have a different number of edges than an MST?

Both are the minimum number of edges to span the nodes, so no. At least, not as we define "radial network." There are other tools that start with the kind of radial network we have and then add in more edges (and still call it a "radial network"/"star map").

Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?

A bit of both? Convergence can definitely depend on the mapping. But also, mapping score itself may not be perfectly correlated with convergence. The scores are just heuristics that people have come up with. A really bad score will probably be a really bad edge, but two scores that are close to each other? May be hard to guess which will actually converge more quickly.

Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?

Yes, in fact most algorithms include cycles. The main reason: once you have a cycle, you know that the sum of the delta-Gs around that cycle should be 0. This can help you identify bad legs, and can be used to get a better maximum likelihood estimate.

Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says

Here I definitely want @IAlibay to check me, but I believe it is the case that, in the limit of infinite sampling, any edge should converge to whatever the force field predicts. But yeah, how close to "infinite" your sampling needs to be before you can trust the results. There are pathological cases (e.g., your bound pose for your ligand is completely incorrect) where you might be better off with plain MD.


Answering a couple other points:

This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

You argued well for the radial; the counter-argument for MST is that even radial can force you to use edges that only allow bad mappings.

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute?

When we talk about faster convergence, it's also "faster" wrt to MD time, not wall time. We'll worry about ns/day inside the protocol, but it is how much MD do you need to do. Two ways this relates to wall time: (1) we have an early-abort, so if your precision is good enough, you can stop early; (2) sometimes you need to continue a simulation for better stats (we haven't made that easy yet, but it is on the way).

Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?
Can having the "wrong" edges introduce bias?

For both of these, keep in mind that, in practice, we're calculating relative free energies and then using maximum likelihood to estimate the absolute free energies. So adding another good edge, that you don't care about, will improve your MLE result. And adding a really bad edge, especially if you have few cycles, can definitely bias you away from correctness.

"id": "f9ad85a0-a3b2-4e95-9784-0a88ba790683",
"metadata": {},
"source": [
"OpenFE provides utilities for ingesting networks from ordinary Python datastructures like `list[tuple[str, str]]` and `list[tuple[int, int]]`. Each string or integer respectively names or indexes a ligand, and tuples represent edges in the network. This allows specific networks to be written by hand, and may be helpful for writing loading functions from other file formats."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to put more emphasis that this is useful if you have a more complex network and you know the edges you want.

"id": "31872d6f-028f-4974-9724-2f4654a7b914",
"metadata": {},
"source": [
"As only the topology of the network is encoded, OpenFE must produce atom mappings automatically; for more information, see [Choose an Atom Mapper]:\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"As only the topology of the network is encoded, OpenFE must produce atom mappings automatically; for more information, see [Choose an Atom Mapper]:\n",
"As only the topology of the network is encoded, OpenFE needs an atom mapper to produce the atom mappings; for more information, see [Choose an Atom Mapper]:\n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put this in my own words:

As we will only specify the topology of the network, OpenFE must generate atom mappings for us. For this, it needs an atom mapper; for more information, see Choose an Atom Mapper:

cookbook/hand_write_ligand_network.ipynb Outdated Show resolved Hide resolved
cookbook/hand_write_ligand_network.ipynb Outdated Show resolved Hide resolved
cookbook/load_ligand_network.ipynb Outdated Show resolved Hide resolved
cookbook/load_ligand_network.ipynb Outdated Show resolved Hide resolved
cookbook/load_ligand_network.ipynb Outdated Show resolved Hide resolved
Comment on lines 81 to 82
"| Method | [openmmtools] [multistate] methods |\n",
"| Force Fields | [openmmforcefields] |\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're going for with these rows, but I don't think these are the categories we want to deal with -- it gets a little too deep in the details. We'll probably come up with other categories we need here at some point.

Comment on lines 336 to 337
" temperature=298.15 * unit.kelvin, # Maintain constant temperature with Langevin dynamics\n",
" pressure=1 * unit.bar, # Maintain constant pressure with a Monte Carlo barostat\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is it Langevin/MC barostat is particular to the protocol, not to the settings object.

"id": "10232f7b-7aac-4773-8c39-15f672a654b5",
"metadata": {},
"source": [
"Finally, combine the ligands, mapper and scorer with a planning function to generate the network. These planning functions usually generate a large number of candidate networks according to some criteria, score them by summing the scores of their constituent mappings, and return the network with the highest score. They differ by the criteria used to construct the candidate networks."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tag in @IAlibay to correct me on anything I'm wrong about here.

I'll start with the questions at the bottom:

Does a radial network ever have a different number of edges than an MST?

Both are the minimum number of edges to span the nodes, so no. At least, not as we define "radial network." There are other tools that start with the kind of radial network we have and then add in more edges (and still call it a "radial network"/"star map").

Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?

A bit of both? Convergence can definitely depend on the mapping. But also, mapping score itself may not be perfectly correlated with convergence. The scores are just heuristics that people have come up with. A really bad score will probably be a really bad edge, but two scores that are close to each other? May be hard to guess which will actually converge more quickly.

Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?

Yes, in fact most algorithms include cycles. The main reason: once you have a cycle, you know that the sum of the delta-Gs around that cycle should be 0. This can help you identify bad legs, and can be used to get a better maximum likelihood estimate.

Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says

Here I definitely want @IAlibay to check me, but I believe it is the case that, in the limit of infinite sampling, any edge should converge to whatever the force field predicts. But yeah, how close to "infinite" your sampling needs to be before you can trust the results. There are pathological cases (e.g., your bound pose for your ligand is completely incorrect) where you might be better off with plain MD.


Answering a couple other points:

This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

You argued well for the radial; the counter-argument for MST is that even radial can force you to use edges that only allow bad mappings.

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute?

When we talk about faster convergence, it's also "faster" wrt to MD time, not wall time. We'll worry about ns/day inside the protocol, but it is how much MD do you need to do. Two ways this relates to wall time: (1) we have an early-abort, so if your precision is good enough, you can stop early; (2) sometimes you need to continue a simulation for better stats (we haven't made that easy yet, but it is on the way).

Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?
Can having the "wrong" edges introduce bias?

For both of these, keep in mind that, in practice, we're calculating relative free energies and then using maximum likelihood to estimate the absolute free energies. So adding another good edge, that you don't care about, will improve your MLE result. And adding a really bad edge, especially if you have few cycles, can definitely bias you away from correctness.

Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through them all! These look pretty good, and definitely fill in a lot of gaps.

In general, I'm pushing for "create" over "load" to describe what we're doing with networks. We frequently use the word "load" when deserializing an object (loading from disk) that I don't want there to be any confusion. Even with the Orion NES/FEP+ things, we're creating a network based on partial information, not loading exactly what they did.

  • hand_write_ligand_network.ipynb: This is good, but not what I was expected from "hand write" -- there's still room for a cookbook entry that manually creates LigandAtomMappings and puts them into a LigandNetwork.
  • load_ligand_network.ipynb: might change the filename here -- I expected this to be be loading from graphml (although that's a pretty boring one-liner)
  • ligandnetwork_vis.ipynb: Should also include the 2D visualization. Maybe change name "Visualizing Nodes" => "Visualizing Ligand Overlap"? I never got the widget to work; might be good for someone else to give it a try.
  • choose_protocol.ipynb: At the end, add mention that after being associated with a Protocol, settings should not be changed. I'm a little concerned about keeping this notebook up to date as we're adding new settings related to analysis outputs; advice for that would be most welcome.
  • create_alchemical_network.ipynb: On the "Manually" section, maybe add a comment that the names are chosen to be identical to the other one? Also, any reason the equality check at the end recreates instead of uses alchemical_network_auto?

Additionally, a number of small suggestions/comments inline.

@Yoshanuikabundi
Copy link
Contributor Author

We discussed this synchronously at our last meeting, but just for posterity:

I'm a little concerned about keeping this notebook up to date as we're adding new settings related to analysis outputs; advice for that would be most welcome.

That's why the notebook asserts that the manually specified settings object is identical to the automatic one; changes to the defaults will be caught by CI (once the notebooks are actually run in CI). You pointed out that this will not catch the case when new settings are added to the object, and we agreed that incomplete documentation is better than incorrect documentation.

Also, any reason the equality check at the end recreates instead of uses alchemical_network_auto?

I wanted to protect users who changed the first section where this is created.

@Yoshanuikabundi
Copy link
Contributor Author

I never got the widget to work; might be good for someone else to give it a try.

Hmm... For me, the widgets all work running Jupyter Lab from an environment based on openfe's environment.yml with openfe installed from source (and gufe installed from main per that environment), but the interact widget fails in ExampleNotebook's .binder/environment.yml with Error displaying widget.

I can introduce the issue to the from-source build by adding a ipywidgets<8 pin. Changing the ipywidgets pin in the binder environment to ipywidgets>=8 causes the environment to fail to build:

    The following packages are incompatible
    ├─ ipywidgets >=8  is requested and can be installed;
    └─ openfe >=0.10.1  is not installable because it requires
       ├─ gufe >=0.9.3,<0.10 , which requires
       │  └─ openff-toolkit >=0.13,<0.14 , which requires
       │     └─ ipywidgets 7.* , which conflicts with any installable versions previously reported;
       └─ openff-toolkit [0.13 |>=0.13,<0.14 ], which cannot be installed (as previously explained).

And if I update the gufe and openfe pins in the binder environment:

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ gufe >=0.9.4  is installable and it requires
    │  └─ openff-toolkit >=0.13,<0.14 , which requires
    │     └─ ipywidgets 7.* , which can be installed;
    └─ ipywidgets >=8.0  is not installable because it conflicts with any installable versions previously reported.

So I think the problem is that some interaction involving interact and py3dmol relies on IPyWidgets 8. My use of the function seems to be allowed in 7.x, so I think this might be a Py3dmol issue.

OpenFF Toolkit v0.13 pins ipywidgets <8, but since v0.14 this pin has been removed. So when Gufe supports OFF Toolkit 0.14+, the widget should suddenly work.

Since the widget doesn't work in Sphinx anyway (because its interactivity relies on the notebook's Python runtime), I've removed it from the notebook for the time being. I think it does make it much easier to visualize different edges though, so it would be good to be able to restore it in the future.

@richardjgowers does Gufe still need to pin openff-toolkit<0.14? Also I noticed openfe 0.14.0 is released on GitHub but not conda forge - mind if I ask what's going on there?

Copy link
Member

@dwhswenson dwhswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look good to me; we'll merge them and make any further clean-up a problem for later. For example, see #87.

One thing I'd like to check factually with @IAlibay (possibly @RiesBen or @hannahbaumann might know): The section on loading from Orion or FEP+ suggest that Orion outputs these files with the .dat extension and FEP+ with the .edge extension. I want to double-check that these are default extensions actually used; I could imagine a game of telephone where someone gave us example files with an arbitrary extension, and now we're claiming that is the official one.

@dwhswenson dwhswenson merged commit e797390 into main Oct 10, 2023
@dwhswenson dwhswenson deleted the cookbook branch October 10, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants