Mapping from Sequences to Domains/Scaffolds #5

brucejwittmann · 2023-07-08T00:19:31Z

I'm working with Dataset 3 and am trying to compare K50 values for proteins that share a fold. I can't see an easy to way to identify which proteins go to which domain (or fold) from the columns in "K50_Dataset3.csv", however. Do you happen to have a table available that maps the name of each entry in K50_Dataset3.csv to its associated domain? In all, I'm hoping to be able to distinguish (1) which natural proteins share a wild type and (2) which designed proteins share the same base scaffold.

JinyuanSun · 2023-07-30T09:00:43Z

Hi, have you find a way to solve this?

brucejwittmann · 2023-08-16T05:42:04Z

Somewhat, though it's definitely not perfect. I approached it two separate ways:

I made the assumption that all entries with the same pdb name prior to the mutation information were from the same group. This works well for about 90% of the entries, and I could confirm by clustering with mmseqs that my initial assumption about the pdb name was good.
I used mmseqs easy-cluster (see here) to cluster at 90% sequence identity, 80% coverage. I assumed that anything that shared a cluster was from the same domain. This is obviously imperfect, but it at least groups similar proteins.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping from Sequences to Domains/Scaffolds #5

Mapping from Sequences to Domains/Scaffolds #5

brucejwittmann commented Jul 8, 2023

JinyuanSun commented Jul 30, 2023

brucejwittmann commented Aug 16, 2023

Mapping from Sequences to Domains/Scaffolds #5

Mapping from Sequences to Domains/Scaffolds #5

Comments

brucejwittmann commented Jul 8, 2023

JinyuanSun commented Jul 30, 2023

brucejwittmann commented Aug 16, 2023