Add SCOPe dataset to our pipeline #67

sfluegel05 · 2024-12-12T14:02:03Z

Our goal is to reproduce the ontology pretraining on a protein-related task. For this, we have already implemented a GO dataset (see #36). The next step would be to add a pretraining task to that. This would give us the following alignment:

stage	chemistry	proteins
unsupervised pretraining	mask pretraining (ELECTRA)	mask pretraining (ESM2, optional)
ontology pretraining	ChEBI	SCOPe
finetuning task	Toxicity, Solubility, ...	GO (MF, BP, CC branches)

SCOPe is a good fit since it is mostly structure-based (unlike GO, which has more complex functional classes). It also has a manageable size (~140,000 entries, similar to ChEBI).

Goal

Add a SCOPe dataset to our pipeline. The data should be processed so that it can be used in the same way as, e.g., the GO data (just with different labels).

Links

SCOPe website: https://scop.berkeley.edu/ (documentation for files: https://scop.berkeley.edu/help/ver=2.08#parseablefiles-2.08)
Latest publication: https://academic.oup.com/nar/article/47/D1/D475/5219094?login=true
Initial publication: https://scop.berkeley.edu/references/murzin-1995-jmb.pdf
PDB (source of their proteins): https://www.rcsb.org/

sfluegel05 · 2025-01-15T13:44:47Z

I have talked to some colleagues and they agreed that the following plan makes sense:

Take the SCOPe labels for protein domains
Trace them back to their protein sequences (the label of a protein sequence becomes the sum of all domain labels)
Train on protein sequences

If this is technically feasible, this gives us a pretraining task that is comparable to the GO task.

aditya0by0 · 2025-01-15T22:11:18Z

Understanding SCOPe and PDB: Detailed Notes

After further exploration of SCOPe and PDB, here’s my current understanding:

Protein domains form chains.
Chains form complexes (protein complexes or structures).
These complexes are the entries in PDB, represented by unique identifiers like "1A3N".

Protein Domain

A protein domain is a structural and functional unit of a protein.

Key Characteristics:

Domains are part of a protein chain.
A domain can span:
1. The entire chain (single-domain protein):
  - In this case, the protein domain is equivalent to the chain itself.
  - Example:
    - All chains of the PDB structure "1A3N" are single-domain proteins.
    - Each chain has a SCOPe domain identifier.
    - For example, Chain A:
      - Domain identifier: d1a3na_
      - Breakdown of the identifier:
        
        d: Denotes domain.
        
        1a3n: Refers to the PDB protein structure identifier.
        
        a: Specifies the chain within the structure.
        
        _: Indicates the domain spans the entire chain (single-domain protein).
      - Example: PDB Structure 1A3N - Chain A
2. A specific portion of the chain (multi-domain protein):
  - Here, a single chain contains multiple domains.
  - Example: Chain A of the PDB structure "1PKN" contains three domains: d1pkna1, d1pkna2, d1pkna3.
  - Example: PDB Structure 1PKN - Chain A.

Protein Chain

A protein chain refers to the entire polypeptide chain observed in a protein's 3D structure (as described in PDB files).

Key Points:

A chain can consist of one or multiple domains:
- Single-domain chain: The chain and domain are identical.
  - Example: Myoglobin.
- Multi-domain chain: Contains several domains, each with distinct structural and functional roles.
Chains assemble to form protein complexes or structures.

Key Observations About SCOPe

The fundamental classification unit in SCOPe is the protein domain, not the entire protein.
The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.
This implies that our focus should shift from "classifying proteins into SCOPe taxonomy" to "classifying protein domains into SCOPe taxonomy."

Please let me know if this interpretation is accurate or if I’ve misunderstood any aspect.

aditya0by0 · 2025-01-15T22:41:13Z

I have talked to some colleagues and they agreed that the following plan makes sense:

Take the SCOPe labels for protein domains

Trace them back to their protein sequences (the label of a protein sequence becomes the sum of all domain labels)

Train on protein sequences

If this is technically feasible, this gives us a pretraining task that is comparable to the GO task.

Thank you for sharing the plan! I have a clarification question. Since each protein sequence can belong to multiple domains, if the label of a protein sequence is the sum of all domain labels, wouldn't that imply that the sequence could simultaneously belong to multiple domain labels eg. classes (e.g., alpha, beta, alpha+beta)? or other hierarchy levels?

This could result in most labels being marked as True. Additionally, this seems to conflict with the SCOPe taxonomy, where each protein domain (classification unit) should only belong to one family, superfamily, fold, and class.

sfluegel05 · 2025-01-16T08:47:46Z

Thanks for the overview. That's very helpful!

I like the annotations page in PDB (e.g. for 1A3N), because it shows SCOPe and GO labels on the same page. It looks like the level the GO is working with is not the protein complex, but the protein chains (for example, for 1A3N, the chain A and B have different GO labels). So we don't need the complex, but the chain as our training data.

Regarding your question: Yes, having different labels for the same protein chain would be a consequence of putting several domains together. However, I don't think that will be a problem. While SCOPe might not handle it that way, our models can deal with multiple inheritance. Drawing the analogy to chemistry: zorbamycin is both a peptide and a carbohydrate because one part of the molecule has a peptide structure, and another part has a carbohydrate structure. Of course, peptides work a bit differently, but I think we can expect our model to learn several domains for the same chain if it contains multiple domains.

aditya0by0 · 2025-01-18T19:43:22Z

SCOPe 2.08 Data Analysis Update:

The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:

Classes: 12
Folds: 1485
Superfamilies: 2368
Families: 5431
Proteins: 13,514
Species: 30,294
Domains: 344,851

For more detailed statistics, please refer to the official SCOPe website:

Note on Data Processing:

Performing one-hot encoding on these hierarchical levels will significantly increase the data size, which could lead to challenges in terms of memory and computational efficiency.

ChEBI and GOUniProt Dataset Overview:

ChEBIOver50 (chebi_version=231): ~1511 label columns
GOUniProtOver250 (go_branch="BP"): ~898 label columns

Given the large number of labels, do you think we should focus on Folds or Superfamilies for classification instead of trying to classify each level, to tackle this challenge?

sfluegel05 · 2025-01-22T13:44:39Z

A todo for later we should keep in mind:

generate ESM2 embeddings for arbitrary proteins (e.g. for the SCOPe dataset) - this should be doable by taking the pretrained model from ESM, but requires some storage and computation capacities

aditya0by0 · 2025-01-22T14:35:46Z

A todo for later we should keep in mind:

generate ESM2 embeddings for arbitrary proteins (e.g. for the SCOPe dataset) - this should be doable by taking the pretrained model from ESM, but requires some storage and computation capacities

I’ve already implemented this functionality in the ESM2EmbeddingReader class. You can check it out here.

This class allows initialization with the name of a pretrained model, downloads the respective model, and generates ESM2 embeddings from a specified layer of the model based on the provided parameters. For more details, please check the constructor.

For testing or trying out this reader, we can use lighter models such as esm2_t6_8M_UR50D (6 layers, valid layers: 1–6), which is a tiny 8M parameter model (~28 MB).

class ESM2EmbeddingReader(DataReader):
    """
    A data reader to process protein sequences using the ESM2 model for embeddings.

    References:
        https://github.com/bio-ontology-research-group/deepgo2/blob/main/deepgo/extract_esm.py

    Note:
        For layer availability by model, Please check below link:
            https://github.com/facebookresearch/esm?tab=readme-ov-file#pre-trained-models-

        To test this reader, try lighter models:
            esm2_t6_8M_UR50D: 6 layers (valid layers: 1–6),  (~28 Mb) - A tiny 8M parameter model.
            esm2_t12_35M_UR50D: 12 layers (valid layers: 1–12),  (~128 Mb) - A slightly larger, 35M parameter model.
        These smaller models are good for testing and debugging purposes.

    """

aditya0by0 · 2025-01-24T21:40:55Z

Some chain sequences contain invalid amino acids, such as "U". In the context of SCOPe, we need to decide how to handle these cases. Here are two potential approaches:

Ignore the sequence: Discard sequences containing invalid amino acids (as implemented in DeepGO 1).
Replace invalid tokens with "X": Substitute invalid amino acids with "X" (as implemented in DeepGO 2). For more details on this approach, refer to the discussion here: ChEB-AI Pull Request #64 and Protein function prediction with GO - Part 3 #64 (comment)

@sfluegel05, Please advise on which approach to follow.

sfluegel05 · 2025-01-26T15:56:27Z

I think the DeepGO2 dataset is our finetuning target here. This means that all pretraining tasks (such as SCOPe ontology pretraining) should use the same encoding. So I would recommend replacing invalid tokens with "X".

sfluegel05 · 2025-02-07T14:36:30Z

@aditya0by0 Some "convenience" todos for the SCOPe dataset:

Create a notebook for SCOPe with the information you have gathered and used for implementing the dataset (some of that information is already in the comments above)
Create a mapping between label names in the dataset and their names in SCOPe (e.g. fold_48725, A.1, haemoglobin-like protein). This will come in handy for instance during evaluations when we want to find out (either manually or automatically) which classes performed poorly and which performed well.

sfluegel05 · 2025-02-10T14:38:59Z

I have some preliminary results for SCOPe2000. We reach a macro-f1 of 0.7 and a micro-f1 of 0.9 which is quite good (although we have no baseline to compare to and this is only for ~30 classes).

aditya0by0 self-assigned this Dec 12, 2024

aditya0by0 linked a pull request Jan 5, 2025 that will close this issue

Protein function prediction with GO - Part 3 #64

Draft

aditya0by0 linked a pull request Feb 7, 2025 that will close this issue

Protein function prediction with GO - Part 3 #64

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SCOPe dataset to our pipeline #67

Add SCOPe dataset to our pipeline #67

sfluegel05 commented Dec 12, 2024

sfluegel05 commented Jan 15, 2025

aditya0by0 commented Jan 15, 2025 •

edited

Loading

aditya0by0 commented Jan 15, 2025

sfluegel05 commented Jan 16, 2025

aditya0by0 commented Jan 18, 2025 •

edited

Loading

sfluegel05 commented Jan 22, 2025

aditya0by0 commented Jan 22, 2025 •

edited

Loading

aditya0by0 commented Jan 24, 2025

sfluegel05 commented Jan 26, 2025

sfluegel05 commented Feb 7, 2025

sfluegel05 commented Feb 10, 2025

Add SCOPe dataset to our pipeline #67

Add SCOPe dataset to our pipeline #67

Comments

sfluegel05 commented Dec 12, 2024

Goal

Links

sfluegel05 commented Jan 15, 2025

aditya0by0 commented Jan 15, 2025 • edited Loading

Understanding SCOPe and PDB: Detailed Notes

Protein Domain

Key Characteristics:

Protein Chain

Key Points:

Key Observations About SCOPe

aditya0by0 commented Jan 15, 2025

sfluegel05 commented Jan 16, 2025

aditya0by0 commented Jan 18, 2025 • edited Loading

sfluegel05 commented Jan 22, 2025

aditya0by0 commented Jan 22, 2025 • edited Loading

aditya0by0 commented Jan 24, 2025

sfluegel05 commented Jan 26, 2025

sfluegel05 commented Feb 7, 2025

sfluegel05 commented Feb 10, 2025

aditya0by0 commented Jan 15, 2025 •

edited

Loading

aditya0by0 commented Jan 18, 2025 •

edited

Loading

aditya0by0 commented Jan 22, 2025 •

edited

Loading