-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T010 review #44
T010 review #44
Conversation
Closes #20. |
Hi @Mika-Le, this talktorial is ready for your review. Thank you! |
You can use ReviewNB: https://app.reviewnb.com//pull/44/
|
View / edit / reply to this conversation on ReviewNB Mika-Le commented on 2020-09-30T12:10:50Z The Wikipedia links could be put behind their respective topic (RMSD, structural superposition) as it is already mentioned that those are wikipedia articles dominiquesydow commented on 2020-10-01T11:11:51Z Absolutely, I missed that, thanks! |
View / edit / reply to this conversation on ReviewNB Mika-Le commented on 2020-09-30T12:10:51Z The variables pdb_ids1 and pdb_ids2 could be named more descriptive eg pdb_ids_sti and pdb_ids_imatinib or something along those lines. This might be a bit overkill though as they are only used once in the following cell dominiquesydow commented on 2020-10-01T11:14:17Z Great idea, I will rename the variables as suggested.
dominiquesydow commented on 2020-10-01T14:37:15Z I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).
Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code. |
View / edit / reply to this conversation on ReviewNB Mika-Le commented on 2020-09-30T12:10:53Z Is it possible to update this to 2020 and include IDs added in 2019 to make the notebook more current when being re-released
dominiquesydow commented on 2020-10-01T11:43:52Z Good point. No structures were added after 2019, so the results do not change.
Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.
Added at the top of the practical part:
Then in step 4, we print all
Afterwards, we set the variable to the frozen set of PDB IDs:
__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.
|
View / edit / reply to this conversation on ReviewNB Mika-Le commented on 2020-09-30T12:10:54Z This only applies if I understood the function correctly and the dict 'values' can be worked like a matrix:
Instead of deliberately throwing and catching errors we could keep indices and only calculate the upper matrix where i < j and write 0 when i == j without comparing the actual structures. Something along the lines of this (not tested!):
for i, (A, name_i) in enumerate(zip(structures, names): for j, (B, name_j) in enumerate(zip(structures, names): if (i == j): values[name_i][name_j] = 0.0 continue if (i < j): rmsd = calc_rmsd(A,B) values[name_i][name_j] = rmsd values[name_j][name_i] = rmsd continue ....
jaimergp commented on 2020-10-01T13:27:33Z This is an awesome suggestion! Thanks!
One more thing:
jaimergp commented on 2020-10-01T13:29:34Z An alternative implementation, since the matrix is symmetrical:
dominiquesydow commented on 2020-10-02T14:00:53Z @Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!
|
Absolutely, I missed that, thanks! View entire conversation on ReviewNB |
Great idea, I will rename the variables as suggested.
View entire conversation on ReviewNB |
Good point. No structures were added after 2019, so the results do not change.
Looking at it, I decided to remove the filtering step by year and added a frozen set of PDB IDs instead because it is more in-line with what is done in talktorial 008 and 001.
Added at the top of the practical part:
Then in step 4, we print all
Afterwards, we set the variable to the frozen set of PDB IDs:
__Note__: The next step has technical reasons only. In order to maintain the notebooks automatically, we define here a certain set of PDB IDs to work with. Thus, even if new PDB structures matching our filtering criteria are added to the PDB, they will not be included in the downstream analysis and therefore the output will stay the same. Remote this step if you want to work with the full set of PDB structures.
View entire conversation on ReviewNB |
This is an awesome suggestion! Thanks!
One more thing:
View entire conversation on ReviewNB |
An alternative implementation, since the matrix is symmetrical:
View entire conversation on ReviewNB |
I just remembered that we wanted to avoid EGFR-specific variable names because the notebooks can be run also with other targets (and ligands).
Thus, I will use pdb_ids_by_ligand_name and pdb_ids_by_ligand_code. View entire conversation on ReviewNB |
Hi @jaimergp, is it possible that the structural alignment is not deterministic?
If it is deterministic, I will need to go on an error hunt. |
We might want to seed the random module here just in case: def seed_everything(seed=1234):
import random
import os
import numpy as np
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed) |
Thanks! Where do I call that function? At the beginning of my Practical section? We decided on 22 as our default seed - or did we end with 1234 (in case we care)? |
@Mika-Le, I thought I commented on this - excellent. So much cleaner! I have already updated the code accordingly. Will add @jaimergp's suggestions later, too. Thanks!
View entire conversation on ReviewNB |
It doesn't matter as long as it's the same in every notebook run (it could be different across notebooks). That said, let's use 22. |
In issue #20 variable names say |
@jaimergp, I added the random module - the matrix stayed the same to before. |
Nice. I'll add a note to move that function to the library and reuse it in other notebooks too. |
Details
Content review
here
.DataFrames
)Code review
calc_rmsd_matrix
))a_variable_name
vsaVariableName
)black -l 99
) > will be checked for all notebooks at the end.for i in range(len(list))
(see slides)# TODO: CI
import ...
lines are at the top (practice part) cell, ordered by standard library / 3rd party packages / our own (teachopencadd.*
)Please add the following lines to the top of the notebook and use those paths
Questions
opencadd.structure.superposition
>MDAnalysis
- shall we silence this or is this indeed something that needs to be updated? > Addwarnings.simplefilter("ignore")
seaborn
specific. > Addwarnings.simplefilter("ignore")
cprofile
to profile the RMSD calculation >MDAnalysis.core.selection.StringSelection
is slow (fnmatch
used to select atoms); I added an issue onopencadd.structure.superposition
, maybe we can optimize our code to be a bit faster here: Slow performance of atom selection with MDAnalysis opencadd#47