Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T03 ma proteinfolding #5

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b86cca2
Initial commit
Introvertuoso Jun 5, 2023
a42f2c8
Add practical section
Introvertuoso Jun 5, 2023
cd292aa
Update data README.md
Introvertuoso Jun 5, 2023
88988a2
Add contents of practical (sections)
Introvertuoso Jun 5, 2023
23c5023
Change input sequence
Jun 6, 2023
4e4f871
Update intro
Introvertuoso Jun 6, 2023
5dbbfe7
Add output
Jun 6, 2023
3ea034c
Merge remote-tracking branch 'origin/T03-MA_proteinfolding' into T03-…
Introvertuoso Jun 6, 2023
6367e1f
Fix writing
Introvertuoso Jun 6, 2023
07cc9e1
Update notebook
Introvertuoso Jun 6, 2023
d4341d2
Update notebook
Introvertuoso Jun 7, 2023
6ad268c
Update talktorial
Introvertuoso Jun 14, 2023
b271ab9
Add images
Introvertuoso Jun 14, 2023
afa245d
Update README.md
Introvertuoso Jun 14, 2023
a611dcc
Update talktorial
Introvertuoso Jun 14, 2023
9cf66c6
Update OmegaFold
Introvertuoso Jun 15, 2023
4772bee
Update OmegaFold
Introvertuoso Jun 15, 2023
4618d75
Check grammar
Introvertuoso Jun 16, 2023
a57e2aa
Add new protein
Introvertuoso Jun 16, 2023
99e8188
Add new protein
Introvertuoso Jun 22, 2023
4576eea
Update notebook
Introvertuoso Jun 26, 2023
b003170
Little cleanup
Introvertuoso Jun 27, 2023
46138ca
Update the notebook (practical part)
Introvertuoso Jun 30, 2023
c3eb1b5
Update the notebook (practical part)
Introvertuoso Jun 30, 2023
ac81bc7
Update the notebook (practical part)
Introvertuoso Jun 30, 2023
ba65cfd
Update the notebook
Introvertuoso Jul 1, 2023
5d70e66
Add requirements.txt
Introvertuoso Jul 1, 2023
0500197
Update README.md and T03_proteinfolding.ipynb
Introvertuoso Jul 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add practical section
  • Loading branch information
Introvertuoso committed Jun 5, 2023
commit a42f2c840c0c45a5896df9883abd2f61bc661ac4
46 changes: 42 additions & 4 deletions notebooks/T03_proteinfolding/T03_proteinfolding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@
"\n",
"OmegaFold is built atop a deep transformer-based protein language model (PLM) called OmegaPLM. It is trained on a large collection of unaligned and unlabeled protein sequences, to learn single- and pairwise-residue embeddings as powerful features that model the distribution of sequences; OmegaPLM is able to capture struuctural and functional information encoded in the amino-acid sequences through the embeddings. These are then fed into Geoformer, a new geometry-inspired transformer neural network, to further distill the structural and physical pairwise relationships between amino acids. Lastly, a structural module predicts the 3D coordinates of all heavy atoms.\n",
"\n",
"<img src=\"../images/figure1.png\" width=\"1800\">\n",
"<img src=\"./images/figure1.png\" width=\"1800\">\n",
"\n",
"**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
"\n",
Expand All @@ -179,15 +179,15 @@
"\n",
"OmegaFold's ability to predict protein structures was assessed by testing on 2 benchmarks: a CASP set with 29 of the most challenging proteins from the free-modelling category in 2 recent CASP experiments, and a CAMEO dataset with 146 of the most recent single-chain proteins (appearing in the first 6 months of the 2022 CAMEO evaluation), spanning a wide range of prediction difficulty levels. For comparison, we computed predictions as compared to other SOTAs run in their default mode with MSAs as input. Remarkably, the structures predicted by OmegaFold, with a single sequence as input, were as accurate as the advanced MSA-based methods (**Fig. 1B**). On the CAMEO dataset, OmegaFold structures had a mean local-distance difference test (LDDT) score of 0.82, with comparable accuracy to other SOTAs predicted from MSAs. (LDDTs are a commonly used metric for structure evaluation.) On the more challenging CASP dataset, OMegaFOld structures were also quite accurate with an average TM-score-a common metric for assessing the topological similarity of protein structures-of 0.79, slightly lower than that of other SOTAs. THe relative performance of OmegaFold was also tested using the single-sequence versions of AlphaFold2 and RoseTTAFold on these two datasets. When only a single sequence was given as input, their predicted structures were statistically highly inferior to those of OmegaFold (**Fig. 1B**), indicating that the performance of the MSA-based methods drops when evolutionary information is not given.\n",
"\n",
"<img src=\"../images/figure2.png\" width=\"1800\">\n",
"<img src=\"./images/figure2.png\" width=\"1800\">\n",
"\n",
"**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
"\n",
"##### Orphan protein and antibody predictions\n",
"\n",
"OmegaFold's performance in predicting challenging structures of antibody and orphan proteins from PDB was assessed, for which other methods perform poorly. Antibody complementarity-determining regions (CDRs) are the most diverse and variable parts of the molecules. Because of antibodies' fast-evolving nature, MSAs on CDRs, especially in the CDR3 loops on the heavy chain of the antibodies which-despite being highly enriched in amino acid composition-are extremely noisy. As a result, methods like AlphaFold2 are unreliable and have very low predicted LDDT (pLDDT) scores (**Fig. 3A**). Unlike antibodies, orphan proteins, by definition, lack sequence and structure homology information, and thus are also difficult to predict by MSA-based methods (**Fig. 3b**). On both antibody loops and orphan proteins, OmegaFold achieves much higher statistical prediction accuracy, in contrast to AlphaFold2, likely dur to the advantages of its single sequence-based prediction method.\n",
"\n",
"<img src=\"../images/figure3.png\" width=\"1800\">\n",
"<img src=\"./images/figure3.png\" width=\"1800\">\n",
"\n",
"**Fig. 3**: OmegaPLM performance analysis. (A) OmegaPLM predicting antibodies. (B) Orphan protein performance. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
"\n",
Expand Down Expand Up @@ -231,11 +231,49 @@
},
{
"cell_type": "markdown",
"source": [],
"source": [
"In this talktorial we will be using the OmegaFold model. To do so we will need to:\n",
"- Setup OmegaFold locally, to do that we just need to follow the instructions [here](https://github.com/HeliXonProtein/OmegaFold).\n",
"- In the `data` folder a `FASTA` file is available for use. We will predict the 3D structure of the sequence given in the file.\n",
"- This will produce a `PDB` file for each of the sequences in our input file. It will be saved in the output directory we provide. In our case it will be the `output` folder.\n",
"\n",
"**Note**: The prediction confidence values are put in the place of b_factors in `PDB` files."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:root:Downloading weights from https://helixon.s3.amazonaws.com/release1.pt to /Users/introvertuoso/.cache/omegafold_ckpt/model.pt\r\n",
"100%|██████████████████████████████████████| 2.96G/2.96G [04:53<00:00, 10.8MB/s]\r\n",
"INFO:root:Constructing OmegaFold\r\n",
"INFO:root:Reading ./data/A0A5E8GAP1_ECOLX.fasta\r\n",
"INFO:root:Predicting 1th chain in ./data/A0A5E8GAP1_ECOLX.fasta\r\n",
"INFO:root:59 residues in this chain.\r\n",
"INFO:root:Failed to generate ./output/A0A5E8GAP1: EV=1 SV=1.pdb due to The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.\r\n",
"INFO:root:Skipping...\r\n",
"INFO:root:Done!\r\n"
]
}
],
"source": [
"!omegafold ./data/A0A5E8GAP1_ECOLX.fasta ./output"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2023-06-05T22:05:46.742646Z",
"start_time": "2023-06-05T22:00:43.506451Z"
}
}
},
{
"cell_type": "markdown",
"source": [
Expand Down