Skip to content

Commit

Permalink
scope: add more scope details to tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
aditya0by0 committed Feb 21, 2025
1 parent eba0417 commit dad6f76
Showing 1 changed file with 103 additions and 13 deletions.
116 changes: 103 additions & 13 deletions tutorials/data_exploration_scope.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,96 @@
"---\n"
]
},
{
"cell_type": "markdown",
"id": "f6c25706-251c-438c-9915-e8002647eb94",
"metadata": {},
"source": [
"### Understanding [SCOPe](https://scop.berkeley.edu/) and [PDB](https://www.rcsb.org/) \n",
"\n",
"\n",
"1. **Protein domains form chains.** \n",
"2. **Chains form complexes** (protein complexes or structures). \n",
"3. These **complexes are the entries in PDB**, represented by unique identifiers like `\"1A3N\"`. \n",
"\n",
"---\n",
"\n",
"#### **Protein Domain** \n",
"A **protein domain** is a **structural and functional unit** of a protein. \n",
"\n",
"\n",
"##### Key Characteristics:\n",
"- **Domains are part of a protein chain.** \n",
"- A domain can span: \n",
" 1. **The entire chain** (single-domain protein): \n",
" - In this case, the protein domain is equivalent to the chain itself. \n",
" - Example: \n",
" - All chains of the **PDB structure \"1A3N\"** are single-domain proteins. \n",
" - Each chain has a SCOPe domain identifier. \n",
" - For example, Chain **A**: \n",
" - Domain identifier: `d1a3na_` \n",
" - Breakdown of the identifier: \n",
" - `d`: Denotes domain. \n",
" - `1a3n`: Refers to the PDB protein structure identifier. \n",
" - `a`: Specifies the chain within the structure. (`_` for None and `.` for multiple chains)\n",
" - `_`: Indicates the domain spans the entire chain (single-domain protein). \n",
" - Example: [PDB Structure 1A3N - Chain A](https://www.rcsb.org/sequence/1A3N#A)\n",
" 2. **A specific portion of the chain** (multi-domain protein): \n",
" - Here, a single chain contains multiple domains. \n",
" - Example: Chain **A** of the **PDB structure \"1PKN\"** contains three domains: `d1pkna1`, `d1pkna2`, `d1pkna3`. \n",
" - Example: [PDB Structure 1PKN - Chain A](https://www.rcsb.org/annotations/1PKN). \n",
"\n",
"---\n",
"\n",
"#### **Protein Chain** \n",
"A **protein chain** refers to the entire **polypeptide chain** observed in a protein's 3D structure (as described in PDB files). \n",
"\n",
"##### Key Points:\n",
"- A chain can consist of **one or multiple domains**:\n",
" - **Single-domain chain**: The chain and domain are identical. \n",
" - Example: Myoglobin. \n",
" - **Multi-domain chain**: Contains several domains, each with distinct structural and functional roles. \n",
"- Chains assemble to form **protein complexes** or **structures**. \n",
"\n",
"\n",
"---\n",
"\n",
"#### **Key Observations About SCOPe** \n",
"- The **fundamental classification unit** in SCOPe is the **protein domain**, not the entire protein. \n",
"- _**The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.**_\n",
"\n",
"\n",
"--- \n",
"\n",
"**SCOPe 2.08 Data Analysis:**\n",
"\n",
"The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:\n",
"\n",
"- **Classes**: 12\n",
"- **Folds**: 1485\n",
"- **Superfamilies**: 2368\n",
"- **Families**: 5431\n",
"- **Proteins**: 13,514\n",
"- **Species**: 30,294\n",
"- **Domains**: 344,851\n",
"\n",
"For more detailed statistics, please refer to the official SCOPe website:\n",
"\n",
"- [SCOPe 2.08 Statistics](https://scop.berkeley.edu/statistics/ver=2.08)\n",
"- [SCOPe 2.08 Release](https://scop.berkeley.edu/ver=2.08)\n",
"\n",
"---\n",
"\n",
"## SCOPe Labeling \n",
"\n",
"- Use SCOPe labels for protein domains.\n",
"- Map them back to their **protein-chain** sequences (protein sequence label = sum of all domain labels).\n",
"- Train on protein sequences.\n",
"- This pretraining task would be comparable to GO-based training.\n",
"\n",
"--- "
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down Expand Up @@ -171,25 +261,25 @@
"text": [
"Checking for processed data in data\\SCOPe\\version_2.08\\SCOPe2000\\processed\n",
"Missing processed data file (`data.pkl` file)\n",
"Missing PDB raw data, Downloading PDB sequence data....\n",
"Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
"Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
"Unzipping the file....\n",
"Missing PDB raw data, Downloading PDB sequence data....\n",
"Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
"Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
"Unzipping the file....\n",
"Unpacked and saved to data\\SCOPe\\pdb_sequences.txt\n",
"Removed temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmpsif7r129\n",
"Missing Scope: cla.txt raw data, Downloading...\n"
]
},
{
"name": "stderr",
"Missing Scope: cla.txt raw data, Downloading...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"G:\\anaconda3\\envs\\env_chebai\\lib\\site-packages\\urllib3\\connectionpool.py:1099: InsecureRequestWarning: Unverified HTTPS request is being made to host 'scop.berkeley.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#tls-warnings\n",
"warnings.warn(\n"
]
},
{
"name": "stdout",
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Missing Scope: hie.txt raw data, Downloading...\n",
Expand Down

0 comments on commit dad6f76

Please sign in to comment.