Add notebooks for cookbook #84

Yoshanuikabundi · 2023-09-04T05:14:51Z

As I fill out the cookbook in OpenFreeEnergy/openfe#538 I'll add new notebooks to this PR.

github-actions · 2023-09-04T05:15:03Z

👈 Launch a binder notebook on branch OpenFreeEnergy/ExampleNotebooks/cookbook

Yoshanuikabundi · 2023-09-18T13:12:21Z

@dwhswenson This one is ready for review!

dwhswenson

Rather that putting extra copies of these in the assets/ subdir here, how about making them fetchables? Possibly one fetch command for all the cookbooks, or possibly different for each cookbook.

Add more fetchables to the CLI by adding another entry in openfecli/fetchables.py. Then you can bring them into the current directory with

openfe fetch my-fetchable-name

Make that the first step of the notebook. Also makes it so that users can just download a single notebook to run it instead of pulling down the whole repo.

(That's just an initial big picture review; comments on notebooks coming later.)

dwhswenson

Partial review covering the first couple notebooks. Since my day is ending and @Yoshanuikabundi's is beginning, I figured I'd pass the baton.

load_molecules.ipynb: There's room to mention some of the other options of the SolventComponent (neutralize; SMILES) but I don't this we do/can use anything but the defaults there. I don't think it is realistic that we will any time soon, either. Maybe worth feedback from @richardjgowers or @IAlibay on whether they think those should be mentioned? But my intuition is that it would muddy the waters.
generate_ligand_network.ipynb: Maybe title should be "Generate a Ligand Network using a Network Planner"? (That is what we call those things.) See also a couple inline comments.

Still to review:

hand_write_ligand_network.ipynb
load_ligand_network.ipynb
ligandnetwork_vis.ipynb
choose_protocol.ipynb
create_alchemical_network.ipynb

cookbook/generate_ligand_network.ipynb

dwhswenson · 2023-09-18T23:09:33Z

cookbook/generate_ligand_network.ipynb

+   "id": "10232f7b-7aac-4773-8c39-15f672a654b5",
+   "metadata": {},
+   "source": [
+    "Finally, combine the ligands, mapper and scorer with a planning function to generate the network. These planning functions usually generate a large number of candidate networks according to some criteria, score them by summing the scores of their constituent mappings, and return the network with the highest score. They differ by the criteria used to construct the candidate networks."


Hmmm.. I don't have a specific suggestion here, but I think the description isn't quite accurate. Essentially, what's going on here is that more edges means more computational cost, but also better estimates (until the edges selected are really bad). So there's a trade-off of accuracy for computational time. Different network planners use different metrics of how to balance that trade-off.

MST is a strict "minimal compute" solution. It also isn't a very good solution, from what we've been seeing.

Radial is usually based on some external knowledge, and isn't even based on scores (doesn't require them).

We're getting into adding network planners with more complex heuristics now.

Ok good to know! I think I can see how I can rewrite this to be more helpful/accurate. I think when I wrote it I was trying to be vague to accommodate future planners - you can totally think of a radial network as being the network that optimises the criteria "every ligand is connected to the central ligand" 😅

But I would like to understand this more fully! So here's some stream of consciousness of how I understand this, and then I'll ask some focused questions in case of tl;dr

My intuition is that an MST and a radial network both have n-1 edges for n nodes - there seems to be an inductive argument that one node has no edges and adding a node must add an edge. Is that incorrect?

So I'm guessing you mean more edges between two nodes as well as more edges in the network makes things cheaper but less accurate? I'm guessing that an MST allows some edges to have a simpler/better mapping (and so the simulations converge faster?), but a radial network guarantees that any two nodes are only two edges apart. I can see that if all edges have similar precision then the more edges separating two nodes, the less precise the free energy estimate between them, as the error from each edge would compound. This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute? I can see that a mapping that destroys fewer atoms would have a small effect on the number of interactions in the simulation's hybrid states, but this seems negligible for performance. I can also see that extreme transformations just aren't going to work at all and will need some intermediate nodes, and there's a spectrum between "everything works" and "doesn't work at all" where convergence is slow and precision suffers.

I can see that having more edges in the network would improve precision at the cost of performance. Adding an edge is essentially adding a statistical replica, and adding an edge between two distant nodes could improve a lot of errors with relatively little computation. How common are redundant edges in practice?

I'm guessing your choice probably also depends on which edges you actually care about. If you want to compare a bunch of derivatives to a known binder, a radial network might be perfect. If you're, idk, studying the evolution of a natural product, maybe you want a more linear network. Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?

Are there other ways that the mapping and network layout affects computational expense?

Is there another sense in which the network layout and mappings affect accuracy, or is accuracy more or less synonymous with precision in this context? Can having the "wrong" edges introduce bias?

So I guess it seems like you usually want something close to a radial network, with all the ligands only a few steps from each other. The exception is when there are, say, two different classes of ligand, and a radial network will converge slowly or not at all. But using an MST for a bunch of simple benzene derivatives is misleading coz actually all the transformations have high scores, and an MST artificially spreads them out. So you might want something like a hub and spoke heuristic, where you group ligands into disconnected radial networks and then try to connect them with like a minimal spanning path.

Stream of consciousness over. Answers to those questions would be very useful but I know that's a lot to write. I'd appreciate any mistakes in the above being corrected. Here are some more focused questions:

Does a radial network ever have a different number of edges than an MST?

Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?

Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?

Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says.

I'll tag in @IAlibay to correct me on anything I'm wrong about here.

I'll start with the questions at the bottom:

Does a radial network ever have a different number of edges than an MST?

Both are the minimum number of edges to span the nodes, so no. At least, not as we define "radial network." There are other tools that start with the kind of radial network we have and then add in more edges (and still call it a "radial network"/"star map").

Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?

A bit of both? Convergence can definitely depend on the mapping. But also, mapping score itself may not be perfectly correlated with convergence. The scores are just heuristics that people have come up with. A really bad score will probably be a really bad edge, but two scores that are close to each other? May be hard to guess which will actually converge more quickly.

Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?

Yes, in fact most algorithms include cycles. The main reason: once you have a cycle, you know that the sum of the delta-Gs around that cycle should be 0. This can help you identify bad legs, and can be used to get a better maximum likelihood estimate.

Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says

Here I definitely want @IAlibay to check me, but I believe it is the case that, in the limit of infinite sampling, any edge should converge to whatever the force field predicts. But yeah, how close to "infinite" your sampling needs to be before you can trust the results. There are pathological cases (e.g., your bound pose for your ligand is completely incorrect) where you might be better off with plain MD.

Answering a couple other points:

This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

You argued well for the radial; the counter-argument for MST is that even radial can force you to use edges that only allow bad mappings.

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute?

When we talk about faster convergence, it's also "faster" wrt to MD time, not wall time. We'll worry about ns/day inside the protocol, but it is how much MD do you need to do. Two ways this relates to wall time: (1) we have an early-abort, so if your precision is good enough, you can stop early; (2) sometimes you need to continue a simulation for better stats (we haven't made that easy yet, but it is on the way).

Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?
Can having the "wrong" edges introduce bias?

For both of these, keep in mind that, in practice, we're calculating relative free energies and then using maximum likelihood to estimate the absolute free energies. So adding another good edge, that you don't care about, will improve your MLE result. And adding a really bad edge, especially if you have few cycles, can definitely bias you away from correctness.

Co-authored-by: David W.H. Swenson <[email protected]>

dwhswenson · 2023-09-19T19:30:03Z

cookbook/hand_write_ligand_network.ipynb

+   "id": "f9ad85a0-a3b2-4e95-9784-0a88ba790683",
+   "metadata": {},
+   "source": [
+    "OpenFE provides utilities for ingesting networks from ordinary Python datastructures like `list[tuple[str, str]]` and `list[tuple[int, int]]`. Each string or integer respectively names or indexes a ligand, and tuples represent edges in the network. This allows specific networks to be written by hand, and may be helpful for writing loading functions from other file formats."


It might be nice to put more emphasis that this is useful if you have a more complex network and you know the edges you want.

dwhswenson · 2023-09-19T19:34:38Z

cookbook/hand_write_ligand_network.ipynb

+   "id": "31872d6f-028f-4974-9724-2f4654a7b914",
+   "metadata": {},
+   "source": [
+    "As only the topology of the network is encoded, OpenFE must produce atom mappings automatically; for more information, see [Choose an Atom Mapper]:\n",


Suggested change

"As only the topology of the network is encoded, OpenFE must produce atom mappings automatically; for more information, see [Choose an Atom Mapper]:\n",

"As only the topology of the network is encoded, OpenFE needs an atom mapper to produce the atom mappings; for more information, see [Choose an Atom Mapper]:\n",

I've put this in my own words:

As we will only specify the topology of the network, OpenFE must generate atom mappings for us. For this, it needs an atom mapper; for more information, see Choose an Atom Mapper:

cookbook/hand_write_ligand_network.ipynb

cookbook/load_ligand_network.ipynb

dwhswenson · 2023-09-19T20:37:57Z

cookbook/choose_protocol.ipynb

+    "| Method       | [openmmtools] [multistate] methods       |\n",
+    "| Force Fields | [openmmforcefields]                      |\n",


I see what you're going for with these rows, but I don't think these are the categories we want to deal with -- it gets a little too deep in the details. We'll probably come up with other categories we need here at some point.

dwhswenson · 2023-09-19T20:43:34Z

cookbook/choose_protocol.ipynb

+    "        temperature=298.15 * unit.kelvin,  # Maintain constant temperature with Langevin dynamics\n",
+    "        pressure=1 * unit.bar,             # Maintain constant pressure with a Monte Carlo barostat\n",


that is it Langevin/MC barostat is particular to the protocol, not to the settings object.

dwhswenson · 2023-09-19T21:42:03Z

cookbook/generate_ligand_network.ipynb

+   "id": "10232f7b-7aac-4773-8c39-15f672a654b5",
+   "metadata": {},
+   "source": [
+    "Finally, combine the ligands, mapper and scorer with a planning function to generate the network. These planning functions usually generate a large number of candidate networks according to some criteria, score them by summing the scores of their constituent mappings, and return the network with the highest score. They differ by the criteria used to construct the candidate networks."


I'll tag in @IAlibay to correct me on anything I'm wrong about here.

I'll start with the questions at the bottom:

Does a radial network ever have a different number of edges than an MST?

Both are the minimum number of edges to span the nodes, so no. At least, not as we define "radial network." There are other tools that start with the kind of radial network we have and then add in more edges (and still call it a "radial network"/"star map").

Is the performance and accuracy of a simulation very sensitive to the atom mapping, or is the mapping score more a proxy for "is this transformation going to work at all"?

A bit of both? Convergence can definitely depend on the mapping. But also, mapping score itself may not be perfectly correlated with convergence. The scores are just heuristics that people have come up with. A really bad score will probably be a really bad edge, but two scores that are close to each other? May be hard to guess which will actually converge more quickly.

Do you ever, in practice, want redundant edges (ie, a cyclic network), or are the more complex heuristics just about where to put your n-1 edges?

Yes, in fact most algorithms include cycles. The main reason: once you have a cycle, you know that the sum of the delta-Gs around that cycle should be 0. This can help you identify bad legs, and can be used to get a better maximum likelihood estimate.

Are there ways that the network layout and atom mapping affect the accuracy of the result directly, or is it just about convergence/precision? Obvs an extremely imprecise result is going to be inaccurate, I just want to confirm that in this context we're talking about how things converge rather than what the underlying model says

Here I definitely want @IAlibay to check me, but I believe it is the case that, in the limit of infinite sampling, any edge should converge to whatever the force field predicts. But yeah, how close to "infinite" your sampling needs to be before you can trust the results. There are pathological cases (e.g., your bound pose for your ligand is completely incorrect) where you might be better off with plain MD.

Answering a couple other points:

This seems to suggest for these examples with very simple transformations a radial graph might be better than an MST?

You argued well for the radial; the counter-argument for MST is that even radial can force you to use edges that only allow bad mappings.

It seems from documenting the ProtocolSettings that the compute time is fixed by default, so faster convergence would mean better precision rather than faster compute?

When we talk about faster convergence, it's also "faster" wrt to MD time, not wall time. We'll worry about ns/day inside the protocol, but it is how much MD do you need to do. Two ways this relates to wall time: (1) we have an early-abort, so if your precision is good enough, you can stop early; (2) sometimes you need to continue a simulation for better stats (we haven't made that easy yet, but it is on the way).

Is it ever worth adding a redundant edge to a connected graph if I don't care about the result of that edge?
Can having the "wrong" edges introduce bias?

For both of these, keep in mind that, in practice, we're calculating relative free energies and then using maximum likelihood to estimate the absolute free energies. So adding another good edge, that you don't care about, will improve your MLE result. And adding a really bad edge, especially if you have few cycles, can definitely bias you away from correctness.

dwhswenson

Got through them all! These look pretty good, and definitely fill in a lot of gaps.

In general, I'm pushing for "create" over "load" to describe what we're doing with networks. We frequently use the word "load" when deserializing an object (loading from disk) that I don't want there to be any confusion. Even with the Orion NES/FEP+ things, we're creating a network based on partial information, not loading exactly what they did.

hand_write_ligand_network.ipynb: This is good, but not what I was expected from "hand write" -- there's still room for a cookbook entry that manually creates LigandAtomMappings and puts them into a LigandNetwork.
load_ligand_network.ipynb: might change the filename here -- I expected this to be be loading from graphml (although that's a pretty boring one-liner)
ligandnetwork_vis.ipynb: Should also include the 2D visualization. Maybe change name "Visualizing Nodes" => "Visualizing Ligand Overlap"? I never got the widget to work; might be good for someone else to give it a try.
choose_protocol.ipynb: At the end, add mention that after being associated with a Protocol, settings should not be changed. I'm a little concerned about keeping this notebook up to date as we're adding new settings related to analysis outputs; advice for that would be most welcome.
create_alchemical_network.ipynb: On the "Manually" section, maybe add a comment that the names are chosen to be identical to the other one? Also, any reason the equality check at the end recreates instead of uses alchemical_network_auto?

Additionally, a number of small suggestions/comments inline.

Co-authored-by: David W.H. Swenson <[email protected]>

Yoshanuikabundi · 2023-09-20T04:59:41Z

We discussed this synchronously at our last meeting, but just for posterity:

I'm a little concerned about keeping this notebook up to date as we're adding new settings related to analysis outputs; advice for that would be most welcome.

That's why the notebook asserts that the manually specified settings object is identical to the automatic one; changes to the defaults will be caught by CI (once the notebooks are actually run in CI). You pointed out that this will not catch the case when new settings are added to the object, and we agreed that incomplete documentation is better than incorrect documentation.

Also, any reason the equality check at the end recreates instead of uses alchemical_network_auto?

I wanted to protect users who changed the first section where this is created.

Yoshanuikabundi · 2023-09-20T06:10:07Z

I never got the widget to work; might be good for someone else to give it a try.

Hmm... For me, the widgets all work running Jupyter Lab from an environment based on openfe's environment.yml with openfe installed from source (and gufe installed from main per that environment), but the interact widget fails in ExampleNotebook's .binder/environment.yml with Error displaying widget.

I can introduce the issue to the from-source build by adding a ipywidgets<8 pin. Changing the ipywidgets pin in the binder environment to ipywidgets>=8 causes the environment to fail to build:

    The following packages are incompatible
    ├─ ipywidgets >=8  is requested and can be installed;
    └─ openfe >=0.10.1  is not installable because it requires
       ├─ gufe >=0.9.3,<0.10 , which requires
       │  └─ openff-toolkit >=0.13,<0.14 , which requires
       │     └─ ipywidgets 7.* , which conflicts with any installable versions previously reported;
       └─ openff-toolkit [0.13 |>=0.13,<0.14 ], which cannot be installed (as previously explained).

And if I update the gufe and openfe pins in the binder environment:

error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ gufe >=0.9.4  is installable and it requires
    │  └─ openff-toolkit >=0.13,<0.14 , which requires
    │     └─ ipywidgets 7.* , which can be installed;
    └─ ipywidgets >=8.0  is not installable because it conflicts with any installable versions previously reported.

So I think the problem is that some interaction involving interact and py3dmol relies on IPyWidgets 8. My use of the function seems to be allowed in 7.x, so I think this might be a Py3dmol issue.

OpenFF Toolkit v0.13 pins ipywidgets <8, but since v0.14 this pin has been removed. So when Gufe supports OFF Toolkit 0.14+, the widget should suddenly work.

Since the widget doesn't work in Sphinx anyway (because its interactivity relies on the notebook's Python runtime), I've removed it from the notebook for the time being. I think it does make it much easier to visualize different edges though, so it would be good to be able to restore it in the future.

@richardjgowers does Gufe still need to pin openff-toolkit<0.14? Also I noticed openfe 0.14.0 is released on GitHub but not conda forge - mind if I ask what's going on there?

dwhswenson

These look good to me; we'll merge them and make any further clean-up a problem for later. For example, see #87.

One thing I'd like to check factually with @IAlibay (possibly @RiesBen or @hannahbaumann might know): The section on loading from Orion or FEP+ suggest that Orion outputs these files with the .dat extension and FEP+ with the .edge extension. I want to double-check that these are default extensions actually used; I could imagine a game of telephone where someone gave us example files with an arbitrary extension, and now we're claiming that is the official one.

Add smallligandcomponent->ligandnetwork notebook

315b441

Yoshanuikabundi added 7 commits September 5, 2023 18:33

More cookbooks

2a48344

Merge branch 'main' into cookbook

f5a32d5

Add LigandNetwork visualization notebook

c4a22f9

Hide cells with uncooperative widgets in sphinx

a9f3154

Add protocol creation cookbook

ea08271

Line length discipline

3bf91a5

Nbsphinx-compatible cookbook notebooks

bcea7c6

Yoshanuikabundi marked this pull request as ready for review September 18, 2023 13:11

Yoshanuikabundi mentioned this pull request Sep 18, 2023

Get ExampleNotebooks into docs OpenFreeEnergy/openfe#538

Merged

1 task

dwhswenson requested changes Sep 18, 2023

View reviewed changes

Yoshanuikabundi and others added 2 commits September 19, 2023 12:44

Update cookbook/generate_ligand_network.ipynb

1cc4f3d

Co-authored-by: David W.H. Swenson <[email protected]>

Update cookbook/generate_ligand_network.ipynb

d8ba75e

Co-authored-by: David W.H. Swenson <[email protected]>

dwhswenson reviewed Sep 19, 2023

View reviewed changes

dwhswenson requested changes Sep 19, 2023

View reviewed changes

Yoshanuikabundi and others added 12 commits September 20, 2023 14:34

Update cookbook/hand_write_ligand_network.ipynb

98c0a7f

Co-authored-by: David W.H. Swenson <[email protected]>

Update cookbook/hand_write_ligand_network.ipynb

be9126b

Co-authored-by: David W.H. Swenson <[email protected]>

Apply suggestions from code review

88ca303

Update cookbook/load_ligand_network.ipynb

68d2aaf

Co-authored-by: David W.H. Swenson <[email protected]>

Update cookbook/load_ligand_network.ipynb

2d2a943

Co-authored-by: David W.H. Swenson <[email protected]>

Update cookbook/load_ligand_network.ipynb

c40336c

Co-authored-by: David W.H. Swenson <[email protected]>

Tweak phrasing

6fed061

Apply suggestions from code review

fe9421e

Include 2d mapping visualization

a8ef38e

visualizing nodes -> visualizing ligand overlap

96f8c9f

Describe protocol immutability in choose_protocol.ipynb

1dd8ad9

Add comment describing name choice

f59f8ef

Rename load_ligand_network

0b778d5

Yoshanuikabundi added 2 commits September 20, 2023 16:09

Update binder environment

4331e66

Remove broken widget

88f5102

switch GHA Python to 3.11 as well

7c35aa9

dwhswenson mentioned this pull request Oct 10, 2023

Make cookbook materials fetchables #87

Open

dwhswenson approved these changes Oct 10, 2023

View reviewed changes

dwhswenson merged commit e797390 into main Oct 10, 2023

dwhswenson deleted the cookbook branch October 10, 2023 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebooks for cookbook #84

Add notebooks for cookbook #84

Yoshanuikabundi commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

Yoshanuikabundi commented Sep 18, 2023

dwhswenson left a comment

dwhswenson left a comment

dwhswenson Sep 18, 2023

Yoshanuikabundi Sep 19, 2023

dwhswenson Sep 19, 2023

dwhswenson Sep 19, 2023

dwhswenson Sep 19, 2023

Yoshanuikabundi Sep 20, 2023

dwhswenson Sep 19, 2023

dwhswenson Sep 19, 2023

dwhswenson Sep 19, 2023

dwhswenson left a comment

Yoshanuikabundi commented Sep 20, 2023

Yoshanuikabundi commented Sep 20, 2023

dwhswenson left a comment

	"As only the topology of the network is encoded, OpenFE must produce atom mappings automatically; for more information, see [Choose an Atom Mapper]:\n",
	"As only the topology of the network is encoded, OpenFE needs an atom mapper to produce the atom mappings; for more information, see [Choose an Atom Mapper]:\n",

		"\| Method \| [openmmtools] [multistate] methods \|\n",
		"\| Force Fields \| [openmmforcefields] \|\n",

		" temperature=298.15 * unit.kelvin, # Maintain constant temperature with Langevin dynamics\n",
		" pressure=1 * unit.bar, # Maintain constant pressure with a Monte Carlo barostat\n",

Add notebooks for cookbook #84

Add notebooks for cookbook #84

Conversation

Yoshanuikabundi commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

Yoshanuikabundi commented Sep 18, 2023

dwhswenson left a comment

Choose a reason for hiding this comment

dwhswenson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwhswenson left a comment

Choose a reason for hiding this comment

Yoshanuikabundi commented Sep 20, 2023

Yoshanuikabundi commented Sep 20, 2023

dwhswenson left a comment

Choose a reason for hiding this comment