Add practical section

volkamerlab · Introvertuoso · Jun 5, 2023 · Jun 5, 2023 · Jun 5, 2023 · Jun 5, 2023
commit a42f2c840c0c45a5896df9883abd2f61bc661ac4
diff --git a/notebooks/T03_proteinfolding/T03_proteinfolding.ipynb b/notebooks/T03_proteinfolding/T03_proteinfolding.ipynb
@@ -163,7 +163,7 @@
     "\n",
     "OmegaFold is built atop a deep transformer-based protein language model (PLM) called OmegaPLM. It is trained on a large collection of unaligned and  unlabeled protein sequences, to learn single- and pairwise-residue embeddings as powerful features that model the distribution of sequences; OmegaPLM is able to capture struuctural and functional information encoded in the amino-acid sequences through the embeddings. These are then fed into Geoformer, a new geometry-inspired transformer neural network, to further distill the structural and physical pairwise relationships between amino acids. Lastly, a structural module predicts the 3D coordinates of all heavy atoms.\n",
     "\n",
-    "<img src=\"../images/figure1.png\"  width=\"1800\">\n",
+    "<img src=\"./images/figure1.png\"  width=\"1800\">\n",
     "\n",
     "**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
     "\n",
@@ -179,15 +179,15 @@
     "\n",
     "OmegaFold's ability to predict protein structures was assessed by testing on 2 benchmarks: a CASP set with 29 of the most challenging proteins from the free-modelling category in 2 recent CASP experiments, and a CAMEO dataset with 146 of the most recent single-chain proteins (appearing in the first 6 months of the 2022 CAMEO evaluation), spanning a wide range of prediction difficulty levels. For comparison, we computed predictions as compared to other SOTAs run in their default mode with MSAs as input. Remarkably, the structures predicted by OmegaFold, with a single sequence as input, were as accurate as the advanced MSA-based methods (**Fig. 1B**). On the CAMEO dataset, OmegaFold structures had a mean local-distance difference test (LDDT) score of 0.82, with comparable accuracy to other SOTAs predicted from MSAs. (LDDTs are a commonly used metric for structure evaluation.) On the more challenging CASP dataset, OMegaFOld structures were also quite accurate with an average TM-score-a common metric for assessing the topological similarity of protein structures-of 0.79, slightly lower than that of other SOTAs. THe relative performance of OmegaFold was also tested using the single-sequence versions of AlphaFold2 and RoseTTAFold on these two datasets. When only a single sequence was given as input, their predicted structures were statistically highly inferior to those of OmegaFold (**Fig. 1B**), indicating that the performance of the MSA-based methods drops when evolutionary information is not given.\n",
     "\n",
-    "<img src=\"../images/figure2.png\"  width=\"1800\">\n",
+    "<img src=\"./images/figure2.png\"  width=\"1800\">\n",
     "\n",
     "**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
     "\n",
     "##### Orphan protein and antibody predictions\n",
     "\n",
     "OmegaFold's performance in predicting challenging structures of antibody and orphan proteins from PDB was assessed, for which other methods perform poorly. Antibody complementarity-determining regions (CDRs) are the most diverse and variable parts of the molecules. Because of antibodies' fast-evolving nature, MSAs on CDRs, especially in the CDR3 loops on the heavy chain of the antibodies which-despite being highly enriched in amino acid composition-are extremely noisy. As a result, methods like AlphaFold2 are unreliable and have very low predicted LDDT (pLDDT) scores (**Fig. 3A**). Unlike antibodies, orphan proteins, by definition, lack sequence and structure homology information, and thus are also difficult to predict by MSA-based methods (**Fig. 3b**). On both antibody loops and orphan proteins, OmegaFold achieves much higher statistical prediction accuracy, in contrast to AlphaFold2, likely dur to the advantages of its single sequence-based prediction method.\n",
     "\n",
-    "<img src=\"../images/figure3.png\"  width=\"1800\">\n",
+    "<img src=\"./images/figure3.png\"  width=\"1800\">\n",
     "\n",
     "**Fig. 3**: OmegaPLM performance analysis. (A) OmegaPLM predicting antibodies. (B) Orphan protein performance. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)\n",
     "\n",
@@ -231,11 +231,49 @@
   },
   {
    "cell_type": "markdown",
-   "source": [],
+   "source": [
+    "In this talktorial we will be using the OmegaFold model. To do so we will need to:\n",
+    "- Setup OmegaFold locally, to do that we just need to follow the instructions [here](https://github.com/HeliXonProtein/OmegaFold).\n",
+    "- In the `data` folder a `FASTA` file is available for use. We will predict the 3D structure of the sequence given in the file.\n",
+    "- This will produce a `PDB` file for each of the sequences in our input file. It will be saved in the output directory we provide. In our case it will be the `output` folder.\n",
+    "\n",
+    "**Note**: The prediction confidence values are put in the place of b_factors in `PDB` files."
+   ],
    "metadata": {
     "collapsed": false
    }
   },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:Downloading weights from https://helixon.s3.amazonaws.com/release1.pt to /Users/introvertuoso/.cache/omegafold_ckpt/model.pt\r\n",
+      "100%|██████████████████████████████████████| 2.96G/2.96G [04:53<00:00, 10.8MB/s]\r\n",
+      "INFO:root:Constructing OmegaFold\r\n",
+      "INFO:root:Reading ./data/A0A5E8GAP1_ECOLX.fasta\r\n",
+      "INFO:root:Predicting 1th chain in ./data/A0A5E8GAP1_ECOLX.fasta\r\n",
+      "INFO:root:59 residues in this chain.\r\n",
+      "INFO:root:Failed to generate ./output/A0A5E8GAP1: EV=1 SV=1.pdb due to The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.\r\n",
+      "INFO:root:Skipping...\r\n",
+      "INFO:root:Done!\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!omegafold ./data/A0A5E8GAP1_ECOLX.fasta ./output"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2023-06-05T22:05:46.742646Z",
+     "start_time": "2023-06-05T22:00:43.506451Z"
+    }
+   }
+  },
   {
    "cell_type": "markdown",
    "source": [