Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] adding function that attempts to do some sort of clustering #21

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hannahbrucemacdonald
Copy link
Contributor

@hannahbrucemacdonald hannahbrucemacdonald commented Aug 21, 2020

This PR adds the function cluster_snapshots

which loads up N snapshots, does clustering based on the OLD LIGAND position, and then finds the index of the snapshot that is closest to the mean of the largest cluster.

This still needs work:

  • pick frames based at random based on what is on disk (currently hard-coded which isn't safe if it tries to open a clone/gen that hasn't run, or if the number of clones/gens changes in future iterations
  • is only running on the old-ligand, when in future may want to do for the new ligand too
  • optimising clustering parameters. I chose 0.5 as it looked ok for one example, something else might be better
  • add the ligand RMSD* and protein RMSD to the oemol/have it as an entry in the ligands.sdf file that is printed out
  • integrating into the main analysis pipeline --- for now, lets use this as well as extract_snapshot, saving to disk with a different filename before we replace it

* this is the ligand RMSD to itself, between the random snapshots, and not the RMSD of the scaffold-core to the crystallographic positions (which would also be interesting)

Copy link
Contributor

@mcwitt mcwitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good so far! Seems like the main things to finish are 1) add a call to cluster_snapshots from analyze_run, and 2) get a list of existing clones and gens for random selection (see my comment). EDIT: sorry, missed your checklist above!

clone = random.randint(0,99)
gen = random.randint(0,2)
if i == 0:
# TODO safeguard against trying to load output that doesn't exist --- chose n_snapshots randomly from those that exists on disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is to be called from analyze_run? (as we do for save_representative_snapshots here?) If so, we'll have access to the Works extracted from the globals.csv files.

Assuming that the existence of a globals.csv implies that a trajectory was also written, you should be able to get the existing trajectories by passing a works: List[Work] argument down to this function and using something like

clones = [work.path.clone for work in works]
gens = [work.path.gen for work in works]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It should be (eventually) in the place of save_snapshots as it's just a slightly more comprehensive way to do the same thing. Do you think works is a better kwargs, or a list of clones and gens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe it would be better to take the clones/gens. If we want to do it at random, we can put random values into the function, but we may also want to cluster ALL of GEN0, so maybe it makes better sense to have the choosing clone/gen logic outside the function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing just a list of clones/gens sounds reasonable to me (especially if we're not actually going to use the works here). The list_results function in lib.py gets a listing of clones and gens given a project path and run.

Actually, it might make sense to change things around a little bit to call list_path just once in analyze_run, and pass the result to both extract_works and to this function. (We can always clean this up later, too)

covid_moonshot/analysis/structures.py Outdated Show resolved Hide resolved
covid_moonshot/analysis/structures.py Outdated Show resolved Hide resolved
covid_moonshot/analysis/structures.py Outdated Show resolved Hide resolved
Base automatically changed from compile-output to master August 22, 2020 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants