T010 review #44

dominiquesydow · 2020-09-21T09:48:46Z

Details

Talktorial ID: 010
Title: Binding site similarity and off-target prediction
Original authors: Angelika Szengel, Marvis Sydow, Richard Gowers, Jaime Rodríguez-Guerra, Dominique Sydow
Reviewer(s): Dominique Sydow, Mareike Leja
Date of review: 2020-09-21
ReviewNB link

Content review

Potential labels or categories (e.g. machine learning, small molecules, online APIs): binding site comparison, binding site similarity, structural alignment, off-target prediction, EGFR, kinase, imatinib, PDB
One line summary: Similar binding sites are likely to bind the same ligand, causing unwanted side affects of a drug. Binding site similarity can be used to predict potential off-targets or alterate applications of a drug (drug repositioning).
The table of contents reflects the talktorial story-line; order of #, ##, ### headers is correct
URLs are linked with meaningful words, instead of pasting the URL directly or linking words like here.
I have spell-checked the notebook
Images have enough resolution to be rendered with quality, without being too heavy.
All figures have a description
Markdown cell content is still in-line with code cell output (whenever results are discussed)
I have checked that cell outputs are not incredibly long (this applies also to DataFrames)
Formatting looks correctly on the Sphinx render (bold, italics, figure placing)

Code review

Questions

This comes from opencadd.structure.superposition > MDAnalysis - shall we silence this or is this indeed something that needs to be updated? > Add warnings.simplefilter("ignore")
This warning comes when clustering the binding site RMDS matrix - seaborn specific. > Add warnings.simplefilter("ignore")

Use cprofile to profile the RMSD calculation > MDAnalysis.core.selection.StringSelection is slow (fnmatch used to select atoms); I added an issue on opencadd.structure.superposition, maybe we can optimize our code to be a bit faster here: Slow performance of atom selection with MDAnalysis opencadd#47

dominiquesydow · 2020-09-28T12:41:46Z

Closes #20.

dominiquesydow · 2020-09-28T12:51:55Z

Hi @Mika-Le, this talktorial is ready for your review. Thank you!

dominiquesydow · 2020-09-30T11:15:17Z

You can use ReviewNB: https://app.reviewnb.com//pull/44/

review-notebook-app · 2020-09-30T12:10:50Z

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:50Z
----------------------------------------------------------------

The Wikipedia links could be put behind their respective topic (RMSD, structural superposition) as it is already mentioned that those are wikipedia articles

dominiquesydow commented on 2020-10-01T11:11:51Z
----------------------------------------------------------------

Absolutely, I missed that, thanks!

review-notebook-app · 2020-09-30T12:10:51Z

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:51Z
----------------------------------------------------------------

The variables pdb_ids1 and pdb_ids2 could be named more descriptive eg pdb_ids_sti and pdb_ids_imatinib or something along those lines.

This might be a bit overkill though as they are only used once in the following cell

dominiquesydow commented on 2020-10-01T11:14:17Z
----------------------------------------------------------------

Great idea, I will rename the variables as suggested.

dominiquesydow commented on 2020-10-01T14:37:15Z
----------------------------------------------------------------

I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).

Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code.

review-notebook-app · 2020-09-30T12:10:53Z

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:53Z
----------------------------------------------------------------

Is it possible to update this to 2020 and include IDs added in 2019 to make the notebook more current when being re-released

dominiquesydow commented on 2020-10-01T11:43:52Z
----------------------------------------------------------------

Good point. No structures were added after 2019, so the results do not change.

Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.

Added at the top of the practical part:

FROZEN_PDB_IDS = ["4CSV", "1XBB", "3HEC", "3FW1", "4R7I", "1T46", "6JOL", "2PL0"]

Then in step 4, we print all pdb_ids that we would find with our pipeline.

Afterwards, we set the variable to the frozen set of PDB IDs:

__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.

pdb_ids = FROZEN_PDB_IDS

print("Final set of PDB IDs:")

print(*pdb_ids)

review-notebook-app · 2020-09-30T12:10:54Z

View / edit / reply to this conversation on ReviewNB

Mika-Le commented on 2020-09-30T12:10:54Z
----------------------------------------------------------------

This only applies if I understood the function correctly and the dict 'values' can be worked like a matrix:

Instead of deliberately throwing and catching errors we could keep indices and only calculate the upper matrix where i < j and write 0 when i == j without comparing the actual structures. Something along the lines of this (not tested!):

for i, (A, name_i) in enumerate(zip(structures, names):
  for j, (B, name_j) in enumerate(zip(structures, names): 
    if (i == j):
      values[name_i][name_j] = 0.0
      continue
    if (i < j):
      rmsd = calc_rmsd(A,B)
      values[name_i][name_j] = rmsd
      values[name_j][name_i] = rmsd
      continue
....

jaimergp commented on 2020-10-01T13:27:33Z
----------------------------------------------------------------

This is an awesome suggestion! Thanks!

One more thing:

values = {name: {} for name in names} can be replaced with values = collections.defaultdict(dict) , collections being part of the standard library.

jaimergp commented on 2020-10-01T13:29:34Z
----------------------------------------------------------------

An alternative implementation, since the matrix is symmetrical:

values[name_i][name_j] = values[name_j][name_i] = calc_rmsd(A, B)

dominiquesydow commented on 2020-10-02T14:00:53Z
----------------------------------------------------------------

@Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!

dominiquesydow · 2020-10-01T11:11:52Z

Absolutely, I missed that, thanks!

View entire conversation on ReviewNB

dominiquesydow · 2020-10-01T11:14:17Z

Great idea, I will rename the variables as suggested.

View entire conversation on ReviewNB

dominiquesydow · 2020-10-01T11:43:53Z

Good point. No structures were added after 2019, so the results do not change.

Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.

Added at the top of the practical part:

FROZEN_PDB_IDS = ["4CSV", "1XBB", "3HEC", "3FW1", "4R7I", "1T46", "6JOL", "2PL0"]

Then in step 4, we print all pdb_ids that we would find with our pipeline.

Afterwards, we set the variable to the frozen set of PDB IDs:

__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.

pdb_ids = FROZEN_PDB_IDS

print("Final set of PDB IDs:")

print(*pdb_ids)

View entire conversation on ReviewNB

jaimergp · 2020-10-01T13:27:34Z

This is an awesome suggestion! Thanks!

One more thing:

values = {name: {} for name in names} can be replaced with values = collections.defaultdict(dict) , collections being part of the standard library.

View entire conversation on ReviewNB

jaimergp · 2020-10-01T13:29:34Z

An alternative implementation, since the matrix is symmetrical:

values[name_i][name_j] = values[name_j][name_i] = calc_rmsd(A, B)

View entire conversation on ReviewNB

dominiquesydow · 2020-10-01T14:37:16Z

I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).

Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code.

View entire conversation on ReviewNB

dominiquesydow · 2020-10-01T15:02:58Z

Hi @jaimergp, is it possible that the structural alignment is not deterministic?

master RMSD matrix:

https://github.com/volkamerlab/TeachOpenCADD/blob/reviews/teachopencadd/talktorials/010_binding_site_comparison/talktorial.ipynb

ds-010-review RMSD matrix:

https://github.com/volkamerlab/TeachOpenCADD/blob/b33118cabe5bf81addc636b536440d26eb543053/teachopencadd/talktorials/010_binding_site_comparison/talktorial.ipynb
Some values match, others don't.

If it is deterministic, I will need to go on an error hunt.

jaimergp · 2020-10-02T10:07:29Z

We might want to seed the random module here just in case:

def seed_everything(seed=1234):
    import random
    import os
    import numpy as np
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)

dominiquesydow · 2020-10-02T13:41:27Z

Thanks!

Where do I call that function? At the beginning of my Practical section?

We decided on 22 as our default seed - or did we end with 1234 (in case we care)?

dominiquesydow · 2020-10-02T14:00:54Z

@Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!

View entire conversation on ReviewNB

jaimergp · 2020-10-02T14:51:58Z

We decided on 22 as our default seed - or did we end with 1234 (in case we care)?

It doesn't matter as long as it's the same in every notebook run (it could be different across notebooks). That said, let's use 22. 1234 comes from the library I "borrowed" that code from 😬

jaimergp · 2020-10-02T16:41:24Z

In issue #20 variable names say pbd, not pdb. I haven't checked if this has been addressed here, but just in case!

dominiquesydow · 2020-10-05T10:00:04Z

@jaimergp, I added the random module - the matrix stayed the same to before.
The notebook is ready to go.

jaimergp · 2020-10-05T10:08:17Z

Nice. I'll add a note to move that function to the library and reuse it in other notebooks too.

Start branch

91a804b

dominiquesydow changed the title ~~T006 review~~ T010 review Sep 21, 2020

Refactor (pythonic code) and remove biopandas

c5501d0

dominiquesydow self-assigned this Sep 22, 2020

dominiquesydow added 2 commits September 22, 2020 17:22

Delete output plot files (not needed)

38dcdf6

Check text/links/refs

9223073

dominiquesydow mentioned this pull request Sep 28, 2020

Slow performance of atom selection with MDAnalysis volkamerlab/opencadd#47

Open

Silence warnings

b0166d6

dominiquesydow requested a review from Mika-Le September 28, 2020 12:51

Apply Mika-Le's suggestions

b33118c

Add random module

0254d56

jaimergp merged commit 9c86946 into reviews Oct 5, 2020

jaimergp deleted the ds-010-review branch October 5, 2020 10:08

jaimergp mentioned this pull request Oct 22, 2020

Talktorial 10 - Input Cell #10 #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T010 review #44

T010 review #44

dominiquesydow commented Sep 21, 2020 •

edited

Loading

dominiquesydow commented Sep 28, 2020 •

edited

Loading

dominiquesydow commented Sep 28, 2020

dominiquesydow commented Sep 30, 2020

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

jaimergp commented Oct 1, 2020

jaimergp commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

jaimergp commented Oct 2, 2020

dominiquesydow commented Oct 2, 2020

dominiquesydow commented Oct 2, 2020

jaimergp commented Oct 2, 2020

jaimergp commented Oct 2, 2020

dominiquesydow commented Oct 5, 2020

jaimergp commented Oct 5, 2020

T010 review #44

T010 review #44

Conversation

dominiquesydow commented Sep 21, 2020 • edited Loading

Details

Content review

Code review

Questions

dominiquesydow commented Sep 28, 2020 • edited Loading

dominiquesydow commented Sep 28, 2020

dominiquesydow commented Sep 30, 2020

review-notebook-app bot commented Sep 30, 2020 • edited Loading

review-notebook-app bot commented Sep 30, 2020 • edited Loading

review-notebook-app bot commented Sep 30, 2020 • edited Loading

review-notebook-app bot commented Sep 30, 2020 • edited Loading

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

jaimergp commented Oct 1, 2020

jaimergp commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

dominiquesydow commented Oct 1, 2020

jaimergp commented Oct 2, 2020

dominiquesydow commented Oct 2, 2020

dominiquesydow commented Oct 2, 2020

jaimergp commented Oct 2, 2020

jaimergp commented Oct 2, 2020

dominiquesydow commented Oct 5, 2020

jaimergp commented Oct 5, 2020

dominiquesydow commented Sep 21, 2020 •

edited

Loading

dominiquesydow commented Sep 28, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading

review-notebook-app bot commented Sep 30, 2020 •

edited

Loading