Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all-vs-all kmer sharing and mash distances for a set of proteins #185

Open
stubrown opened this issue Nov 26, 2024 · 0 comments
Open

all-vs-all kmer sharing and mash distances for a set of proteins #185

stubrown opened this issue Nov 26, 2024 · 0 comments

Comments

@stubrown
Copy link

Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.

I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.

Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.

your thoughts on this would be very helpful
[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant