Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SCOPe dataset to our pipeline #67

Open
sfluegel05 opened this issue Dec 12, 2024 · 11 comments · May be fixed by #64
Open

Add SCOPe dataset to our pipeline #67

sfluegel05 opened this issue Dec 12, 2024 · 11 comments · May be fixed by #64
Assignees

Comments

@sfluegel05
Copy link
Collaborator

Our goal is to reproduce the ontology pretraining on a protein-related task. For this, we have already implemented a GO dataset (see #36). The next step would be to add a pretraining task to that. This would give us the following alignment:

stage chemistry proteins
unsupervised pretraining mask pretraining (ELECTRA) mask pretraining (ESM2, optional)
ontology pretraining ChEBI SCOPe
finetuning task Toxicity, Solubility, ... GO (MF, BP, CC branches)

SCOPe is a good fit since it is mostly structure-based (unlike GO, which has more complex functional classes). It also has a manageable size (~140,000 entries, similar to ChEBI).

Goal

Add a SCOPe dataset to our pipeline. The data should be processed so that it can be used in the same way as, e.g., the GO data (just with different labels).

Links

@aditya0by0 aditya0by0 self-assigned this Dec 12, 2024
@aditya0by0 aditya0by0 linked a pull request Jan 5, 2025 that will close this issue
@sfluegel05
Copy link
Collaborator Author

I have talked to some colleagues and they agreed that the following plan makes sense:

  • Take the SCOPe labels for protein domains
  • Trace them back to their protein sequences (the label of a protein sequence becomes the sum of all domain labels)
  • Train on protein sequences

If this is technically feasible, this gives us a pretraining task that is comparable to the GO task.

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Jan 15, 2025

Understanding SCOPe and PDB: Detailed Notes

After further exploration of SCOPe and PDB, here’s my current understanding:

  1. Protein domains form chains.
  2. Chains form complexes (protein complexes or structures).
  3. These complexes are the entries in PDB, represented by unique identifiers like "1A3N".

Protein Domain

A protein domain is a structural and functional unit of a protein.

Key Characteristics:
  • Domains are part of a protein chain.
  • A domain can span:
    1. The entire chain (single-domain protein):
      • In this case, the protein domain is equivalent to the chain itself.
      • Example:
        • All chains of the PDB structure "1A3N" are single-domain proteins.
        • Each chain has a SCOPe domain identifier.
        • For example, Chain A:
          • Domain identifier: d1a3na_
          • Breakdown of the identifier:
            • d: Denotes domain.
            • 1a3n: Refers to the PDB protein structure identifier.
            • a: Specifies the chain within the structure.
            • _: Indicates the domain spans the entire chain (single-domain protein).
          • Example: PDB Structure 1A3N - Chain A
    2. A specific portion of the chain (multi-domain protein):
      • Here, a single chain contains multiple domains.
      • Example: Chain A of the PDB structure "1PKN" contains three domains: d1pkna1, d1pkna2, d1pkna3.
      • Example: PDB Structure 1PKN - Chain A.

Protein Chain

A protein chain refers to the entire polypeptide chain observed in a protein's 3D structure (as described in PDB files).

Key Points:
  • A chain can consist of one or multiple domains:
    • Single-domain chain: The chain and domain are identical.
      • Example: Myoglobin.
    • Multi-domain chain: Contains several domains, each with distinct structural and functional roles.
  • Chains assemble to form protein complexes or structures.

Key Observations About SCOPe

  • The fundamental classification unit in SCOPe is the protein domain, not the entire protein.
  • The taxonomy in SCOPe is not for the entire protein (i.e., the full-length amino acid sequence as encoded by a gene) but for protein domains, which are smaller, structurally and functionally distinct regions of the protein.
  • This implies that our focus should shift from "classifying proteins into SCOPe taxonomy" to "classifying protein domains into SCOPe taxonomy."

Please let me know if this interpretation is accurate or if I’ve misunderstood any aspect.

@aditya0by0
Copy link
Collaborator

I have talked to some colleagues and they agreed that the following plan makes sense:

  • Take the SCOPe labels for protein domains
  • Trace them back to their protein sequences (the label of a protein sequence becomes the sum of all domain labels)
  • Train on protein sequences

If this is technically feasible, this gives us a pretraining task that is comparable to the GO task.

Thank you for sharing the plan! I have a clarification question. Since each protein sequence can belong to multiple domains, if the label of a protein sequence is the sum of all domain labels, wouldn't that imply that the sequence could simultaneously belong to multiple domain labels eg. classes (e.g., alpha, beta, alpha+beta)? or other hierarchy levels?

This could result in most labels being marked as True. Additionally, this seems to conflict with the SCOPe taxonomy, where each protein domain (classification unit) should only belong to one family, superfamily, fold, and class.

@sfluegel05
Copy link
Collaborator Author

Thanks for the overview. That's very helpful!

I like the annotations page in PDB (e.g. for 1A3N), because it shows SCOPe and GO labels on the same page. It looks like the level the GO is working with is not the protein complex, but the protein chains (for example, for 1A3N, the chain A and B have different GO labels). So we don't need the complex, but the chain as our training data.

Regarding your question: Yes, having different labels for the same protein chain would be a consequence of putting several domains together. However, I don't think that will be a problem. While SCOPe might not handle it that way, our models can deal with multiple inheritance. Drawing the analogy to chemistry: zorbamycin is both a peptide and a carbohydrate because one part of the molecule has a peptide structure, and another part has a carbohydrate structure. Of course, peptides work a bit differently, but I think we can expect our model to learn several domains for the same chain if it contains multiple domains.

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Jan 18, 2025

SCOPe 2.08 Data Analysis Update:

The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:

  • Classes: 12
  • Folds: 1485
  • Superfamilies: 2368
  • Families: 5431
  • Proteins: 13,514
  • Species: 30,294
  • Domains: 344,851

For more detailed statistics, please refer to the official SCOPe website:


Note on Data Processing:

Performing one-hot encoding on these hierarchical levels will significantly increase the data size, which could lead to challenges in terms of memory and computational efficiency.

ChEBI and GOUniProt Dataset Overview:

  • ChEBIOver50 (chebi_version=231): ~1511 label columns
  • GOUniProtOver250 (go_branch="BP"): ~898 label columns

Given the large number of labels, do you think we should focus on Folds or Superfamilies for classification instead of trying to classify each level, to tackle this challenge?

@sfluegel05
Copy link
Collaborator Author

A todo for later we should keep in mind:

  • generate ESM2 embeddings for arbitrary proteins (e.g. for the SCOPe dataset) - this should be doable by taking the pretrained model from ESM, but requires some storage and computation capacities

@aditya0by0
Copy link
Collaborator

aditya0by0 commented Jan 22, 2025

A todo for later we should keep in mind:

  • generate ESM2 embeddings for arbitrary proteins (e.g. for the SCOPe dataset) - this should be doable by taking the pretrained model from ESM, but requires some storage and computation capacities

I’ve already implemented this functionality in the ESM2EmbeddingReader class. You can check it out here.

This class allows initialization with the name of a pretrained model, downloads the respective model, and generates ESM2 embeddings from a specified layer of the model based on the provided parameters. For more details, please check the constructor.

For testing or trying out this reader, we can use lighter models such as esm2_t6_8M_UR50D (6 layers, valid layers: 1–6), which is a tiny 8M parameter model (~28 MB).

class ESM2EmbeddingReader(DataReader):
    """
    A data reader to process protein sequences using the ESM2 model for embeddings.

    References:
        https://github.com/bio-ontology-research-group/deepgo2/blob/main/deepgo/extract_esm.py

    Note:
        For layer availability by model, Please check below link:
            https://github.com/facebookresearch/esm?tab=readme-ov-file#pre-trained-models-

        To test this reader, try lighter models:
            esm2_t6_8M_UR50D: 6 layers (valid layers: 1–6),  (~28 Mb) - A tiny 8M parameter model.
            esm2_t12_35M_UR50D: 12 layers (valid layers: 1–12),  (~128 Mb) - A slightly larger, 35M parameter model.
        These smaller models are good for testing and debugging purposes.

    """

@aditya0by0
Copy link
Collaborator

Some chain sequences contain invalid amino acids, such as "U". In the context of SCOPe, we need to decide how to handle these cases. Here are two potential approaches:

  1. Ignore the sequence: Discard sequences containing invalid amino acids (as implemented in DeepGO 1).
  2. Replace invalid tokens with "X": Substitute invalid amino acids with "X" (as implemented in DeepGO 2). For more details on this approach, refer to the discussion here: ChEB-AI Pull Request #64 and Protein function prediction with GO - Part 3 #64 (comment)

@sfluegel05, Please advise on which approach to follow.

@sfluegel05
Copy link
Collaborator Author

I think the DeepGO2 dataset is our finetuning target here. This means that all pretraining tasks (such as SCOPe ontology pretraining) should use the same encoding. So I would recommend replacing invalid tokens with "X".

@aditya0by0 aditya0by0 linked a pull request Feb 7, 2025 that will close this issue
@sfluegel05
Copy link
Collaborator Author

@aditya0by0 Some "convenience" todos for the SCOPe dataset:

  • Create a notebook for SCOPe with the information you have gathered and used for implementing the dataset (some of that information is already in the comments above)
  • Create a mapping between label names in the dataset and their names in SCOPe (e.g. fold_48725, A.1, haemoglobin-like protein). This will come in handy for instance during evaluations when we want to find out (either manually or automatically) which classes performed poorly and which performed well.

@sfluegel05
Copy link
Collaborator Author

I have some preliminary results for SCOPe2000. We reach a macro-f1 of 0.7 and a micro-f1 of 0.9 which is quite good (although we have no baseline to compare to and this is only for ~30 classes).

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants