all-vs-all kmer sharing and mash distances for a set of proteins #185

stubrown · 2024-11-26T19:54:23Z

Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.

I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.

Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.

your thoughts on this would be very helpful
[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all-vs-all kmer sharing and mash distances for a set of proteins #185

all-vs-all kmer sharing and mash distances for a set of proteins #185

stubrown commented Nov 26, 2024

all-vs-all kmer sharing and mash distances for a set of proteins #185

all-vs-all kmer sharing and mash distances for a set of proteins #185

Comments

stubrown commented Nov 26, 2024