Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping from Sequences to Domains/Scaffolds #5

Open
brucejwittmann opened this issue Jul 8, 2023 · 2 comments
Open

Mapping from Sequences to Domains/Scaffolds #5

brucejwittmann opened this issue Jul 8, 2023 · 2 comments

Comments

@brucejwittmann
Copy link

I'm working with Dataset 3 and am trying to compare K50 values for proteins that share a fold. I can't see an easy to way to identify which proteins go to which domain (or fold) from the columns in "K50_Dataset3.csv", however. Do you happen to have a table available that maps the name of each entry in K50_Dataset3.csv to its associated domain? In all, I'm hoping to be able to distinguish (1) which natural proteins share a wild type and (2) which designed proteins share the same base scaffold.

@JinyuanSun
Copy link

Hi, have you find a way to solve this?

@brucejwittmann
Copy link
Author

Somewhat, though it's definitely not perfect. I approached it two separate ways:

  1. I made the assumption that all entries with the same pdb name prior to the mutation information were from the same group. This works well for about 90% of the entries, and I could confirm by clustering with mmseqs that my initial assumption about the pdb name was good.
  2. I used mmseqs easy-cluster (see here) to cluster at 90% sequence identity, 80% coverage. I assumed that anything that shared a cluster was from the same domain. This is obviously imperfect, but it at least groups similar proteins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants