-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SCOPe dataset to our pipeline #67
Comments
I have talked to some colleagues and they agreed that the following plan makes sense:
If this is technically feasible, this gives us a pretraining task that is comparable to the GO task. |
Understanding SCOPe and PDB: Detailed NotesAfter further exploration of SCOPe and PDB, here’s my current understanding:
Protein DomainA protein domain is a structural and functional unit of a protein. Key Characteristics:
Protein ChainA protein chain refers to the entire polypeptide chain observed in a protein's 3D structure (as described in PDB files). Key Points:
Key Observations About SCOPe
Please let me know if this interpretation is accurate or if I’ve misunderstood any aspect. |
Thank you for sharing the plan! I have a clarification question. Since each protein sequence can belong to multiple domains, if the label of a protein sequence is the sum of all domain labels, wouldn't that imply that the sequence could simultaneously belong to multiple domain labels eg. classes (e.g., alpha, beta, alpha+beta)? or other hierarchy levels? This could result in most labels being marked as True. Additionally, this seems to conflict with the SCOPe taxonomy, where each protein domain (classification unit) should only belong to one family, superfamily, fold, and class. |
Thanks for the overview. That's very helpful! I like the annotations page in PDB (e.g. for 1A3N), because it shows SCOPe and GO labels on the same page. It looks like the level the GO is working with is not the protein complex, but the protein chains (for example, for 1A3N, the chain A and B have different GO labels). So we don't need the complex, but the chain as our training data. Regarding your question: Yes, having different labels for the same protein chain would be a consequence of putting several domains together. However, I don't think that will be a problem. While SCOPe might not handle it that way, our models can deal with multiple inheritance. Drawing the analogy to chemistry: zorbamycin is both a peptide and a carbohydrate because one part of the molecule has a peptide structure, and another part has a carbohydrate structure. Of course, peptides work a bit differently, but I think we can expect our model to learn several domains for the same chain if it contains multiple domains. |
SCOPe 2.08 Data Analysis Update: The current SCOPe version (2.08) includes the following statistics based on analysis for relevant data:
For more detailed statistics, please refer to the official SCOPe website: Note on Data Processing: Performing one-hot encoding on these hierarchical levels will significantly increase the data size, which could lead to challenges in terms of memory and computational efficiency. ChEBI and GOUniProt Dataset Overview:
Given the large number of labels, do you think we should focus on Folds or Superfamilies for classification instead of trying to classify each level, to tackle this challenge? |
A todo for later we should keep in mind:
|
I’ve already implemented this functionality in the This class allows initialization with the name of a pretrained model, downloads the respective model, and generates ESM2 embeddings from a specified layer of the model based on the provided parameters. For more details, please check the constructor. For testing or trying out this reader, we can use lighter models such as class ESM2EmbeddingReader(DataReader):
"""
A data reader to process protein sequences using the ESM2 model for embeddings.
References:
https://github.com/bio-ontology-research-group/deepgo2/blob/main/deepgo/extract_esm.py
Note:
For layer availability by model, Please check below link:
https://github.com/facebookresearch/esm?tab=readme-ov-file#pre-trained-models-
To test this reader, try lighter models:
esm2_t6_8M_UR50D: 6 layers (valid layers: 1–6), (~28 Mb) - A tiny 8M parameter model.
esm2_t12_35M_UR50D: 12 layers (valid layers: 1–12), (~128 Mb) - A slightly larger, 35M parameter model.
These smaller models are good for testing and debugging purposes.
""" |
Some chain sequences contain invalid amino acids, such as "U". In the context of SCOPe, we need to decide how to handle these cases. Here are two potential approaches:
@sfluegel05, Please advise on which approach to follow. |
I think the DeepGO2 dataset is our finetuning target here. This means that all pretraining tasks (such as SCOPe ontology pretraining) should use the same encoding. So I would recommend replacing invalid tokens with "X". |
@aditya0by0 Some "convenience" todos for the SCOPe dataset:
|
Our goal is to reproduce the ontology pretraining on a protein-related task. For this, we have already implemented a GO dataset (see #36). The next step would be to add a pretraining task to that. This would give us the following alignment:
SCOPe is a good fit since it is mostly structure-based (unlike GO, which has more complex functional classes). It also has a manageable size (~140,000 entries, similar to ChEBI).
Goal
Add a SCOPe dataset to our pipeline. The data should be processed so that it can be used in the same way as, e.g., the GO data (just with different labels).
Links
The text was updated successfully, but these errors were encountered: