Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T010 review #44

Merged
merged 7 commits into from
Oct 5, 2020
Merged

T010 review #44

merged 7 commits into from
Oct 5, 2020

Conversation

dominiquesydow
Copy link
Collaborator

@dominiquesydow dominiquesydow commented Sep 21, 2020

Details

  • Talktorial ID: 010
  • Title: Binding site similarity and off-target prediction
  • Original authors: Angelika Szengel, Marvis Sydow, Richard Gowers, Jaime Rodríguez-Guerra, Dominique Sydow
  • Reviewer(s): Dominique Sydow, Mareike Leja
  • Date of review: 2020-09-21
  • ReviewNB link

Content review

  • Potential labels or categories (e.g. machine learning, small molecules, online APIs): binding site comparison, binding site similarity, structural alignment, off-target prediction, EGFR, kinase, imatinib, PDB
  • One line summary: Similar binding sites are likely to bind the same ligand, causing unwanted side affects of a drug. Binding site similarity can be used to predict potential off-targets or alterate applications of a drug (drug repositioning).
  • The table of contents reflects the talktorial story-line; order of #, ##, ### headers is correct
  • URLs are linked with meaningful words, instead of pasting the URL directly or linking words like here.
  • I have spell-checked the notebook
  • Images have enough resolution to be rendered with quality, without being too heavy.
  • All figures have a description
  • Markdown cell content is still in-line with code cell output (whenever results are discussed)
  • I have checked that cell outputs are not incredibly long (this applies also to DataFrames)
  • Formatting looks correctly on the Sphinx render (bold, italics, figure placing)

Code review

  • Time it took to execute (approx.): About 4 minutes (calculating RMSD between 8 proteins takes 3 min (calc_rmsd_matrix))
  • Variable and function names follow snake case rules (e.g. a_variable_name vs aVariableName)
  • Spacing follows PEP8 (run Black on the code cells if needed)
  • Code line are under 99 characters each (run black -l 99) > will be checked for all notebooks at the end.
  • Comments are useful and well placed
  • There are no unpythonic idioms like for i in range(len(list)) (see slides)
  • All 3rd party dependencies are listed at the top of the notebook
  • I have marked all code cell with output referenced in markdown cells with the label # TODO: CI
  • I have identified potential candidates for a code refactor / useful functions
  • All import ... lines are at the top (practice part) cell, ordered by standard library / 3rd party packages / our own (teachopencadd.*)
  • I have update the relative paths to absolute paths.
    Please add the following lines to the top of the notebook and use those paths
    HERE = Path(_dh[-1])
    DATA = HERE / "data"
  • List here unfamiliar libraries you find in the imports and their intended use: None.

Questions

  • This comes from opencadd.structure.superposition > MDAnalysis - shall we silence this or is this indeed something that needs to be updated? > Add warnings.simplefilter("ignore")
    image
  • This warning comes when clustering the binding site RMDS matrix - seaborn specific. > Add warnings.simplefilter("ignore")
    image

@dominiquesydow dominiquesydow changed the title T006 review T010 review Sep 21, 2020
@dominiquesydow dominiquesydow self-assigned this Sep 22, 2020
@dominiquesydow
Copy link
Collaborator Author

dominiquesydow commented Sep 28, 2020

Closes #20.

@dominiquesydow
Copy link
Collaborator Author

Hi @Mika-Le, this talktorial is ready for your review. Thank you!

Copy link
Collaborator Author

You can use ReviewNB: https://app.reviewnb.com//pull/44/

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 30, 2020

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:50Z
----------------------------------------------------------------

The Wikipedia links could be put behind their respective topic (RMSD, structural superposition) as it is already mentioned that those are wikipedia articles


dominiquesydow commented on 2020-10-01T11:11:51Z
----------------------------------------------------------------

Absolutely, I missed that, thanks!

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 30, 2020

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:51Z
----------------------------------------------------------------

The variables pdb_ids1 and pdb_ids2 could be named more descriptive eg pdb_ids_sti and pdb_ids_imatinib or something along those lines.

This might be a bit overkill though as they are only used once in the following cell


dominiquesydow commented on 2020-10-01T11:14:17Z
----------------------------------------------------------------

Great idea, I will rename the variables as suggested.

dominiquesydow commented on 2020-10-01T14:37:15Z
----------------------------------------------------------------

I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).

Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code.

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 30, 2020

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:53Z
----------------------------------------------------------------

Is it possible to update this to 2020 and include IDs added in 2019 to make the notebook more current when being re-released


dominiquesydow commented on 2020-10-01T11:43:52Z
----------------------------------------------------------------

Good point. No structures were added after 2019, so the results do not change.

Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.

Added at the top of the practical part:

FROZEN_PDB_IDS = ["4CSV", "1XBB", "3HEC", "3FW1", "4R7I", "1T46", "6JOL", "2PL0"]

Then in step 4, we print all pdb_ids that we would find with our pipeline.

Afterwards, we set the variable to the frozen set of PDB IDs:

__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.

pdb_ids = FROZEN_PDB_IDS

print("Final set of PDB IDs:")

print(*pdb_ids)

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 30, 2020

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:54Z
----------------------------------------------------------------

This only applies if I understood the function correctly and the dict 'values' can be worked like a matrix:

Instead of deliberately throwing and catching errors we could keep indices and only calculate the upper matrix where i < j and write 0 when i == j without comparing the actual structures. Something along the lines of this (not tested!):

for i, (A, name_i) in enumerate(zip(structures, names):
  for j, (B, name_j) in enumerate(zip(structures, names): 
    if (i == j):
      values[name_i][name_j] = 0.0
      continue
    if (i < j):
      rmsd = calc_rmsd(A,B)
      values[name_i][name_j] = rmsd
      values[name_j][name_i] = rmsd
      continue
....


jaimergp commented on 2020-10-01T13:27:33Z
----------------------------------------------------------------

This is an awesome suggestion! Thanks!

One more thing:

values = {name: {} for name in names} can be replaced with values = collections.defaultdict(dict) , collections being part of the standard library.

jaimergp commented on 2020-10-01T13:29:34Z
----------------------------------------------------------------

An alternative implementation, since the matrix is symmetrical:

values[name_i][name_j] = values[name_j][name_i] = calc_rmsd(A, B)

dominiquesydow commented on 2020-10-02T14:00:53Z
----------------------------------------------------------------

@Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!

Copy link
Collaborator Author

Absolutely, I missed that, thanks!


View entire conversation on ReviewNB

Copy link
Collaborator Author

Great idea, I will rename the variables as suggested.


View entire conversation on ReviewNB

Copy link
Collaborator Author

Good point. No structures were added after 2019, so the results do not change.

Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.

Added at the top of the practical part:

FROZEN_PDB_IDS = ["4CSV", "1XBB", "3HEC", "3FW1", "4R7I", "1T46", "6JOL", "2PL0"]

Then in step 4, we print all pdb_ids that we would find with our pipeline.

Afterwards, we set the variable to the frozen set of PDB IDs:

__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.

pdb_ids = FROZEN_PDB_IDS

print("Final set of PDB IDs:")

print(*pdb_ids)


View entire conversation on ReviewNB

Copy link
Contributor

jaimergp commented Oct 1, 2020

This is an awesome suggestion! Thanks!

One more thing:

values = {name: {} for name in names} can be replaced with values = collections.defaultdict(dict) , collections being part of the standard library.


View entire conversation on ReviewNB

Copy link
Contributor

jaimergp commented Oct 1, 2020

An alternative implementation, since the matrix is symmetrical:

values[name_i][name_j] = values[name_j][name_i] = calc_rmsd(A, B)


View entire conversation on ReviewNB

Copy link
Collaborator Author

I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).

Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code.


View entire conversation on ReviewNB

@dominiquesydow
Copy link
Collaborator Author

Hi @jaimergp, is it possible that the structural alignment is not deterministic?

master RMSD matrix:
image
https://github.com/volkamerlab/TeachOpenCADD/blob/reviews/teachopencadd/talktorials/010_binding_site_comparison/talktorial.ipynb

ds-010-review RMSD matrix:
image
https://github.com/volkamerlab/TeachOpenCADD/blob/b33118cabe5bf81addc636b536440d26eb543053/teachopencadd/talktorials/010_binding_site_comparison/talktorial.ipynb
Some values match, others don't.

If it is deterministic, I will need to go on an error hunt.

@jaimergp
Copy link
Contributor

jaimergp commented Oct 2, 2020

We might want to seed the random module here just in case:

def seed_everything(seed=1234):
    import random
    import os
    import numpy as np
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)

@dominiquesydow
Copy link
Collaborator Author

Thanks!

Where do I call that function? At the beginning of my Practical section?

We decided on 22 as our default seed - or did we end with 1234 (in case we care)?

Copy link
Collaborator Author

@Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!


View entire conversation on ReviewNB

@jaimergp
Copy link
Contributor

jaimergp commented Oct 2, 2020

We decided on 22 as our default seed - or did we end with 1234 (in case we care)?

It doesn't matter as long as it's the same in every notebook run (it could be different across notebooks). That said, let's use 22. 1234 comes from the library I "borrowed" that code from 😬

@jaimergp
Copy link
Contributor

jaimergp commented Oct 2, 2020

In issue #20 variable names say pbd, not pdb. I haven't checked if this has been addressed here, but just in case!

@dominiquesydow
Copy link
Collaborator Author

@jaimergp, I added the random module - the matrix stayed the same to before.
The notebook is ready to go.

@jaimergp
Copy link
Contributor

jaimergp commented Oct 5, 2020

Nice. I'll add a note to move that function to the library and reuse it in other notebooks too.

@jaimergp jaimergp merged commit 9c86946 into reviews Oct 5, 2020
@jaimergp jaimergp deleted the ds-010-review branch October 5, 2020 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants