You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.
I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.
Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.
Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.
I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.
Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.
your thoughts on this would be very helpful
[email protected]
The text was updated successfully, but these errors were encountered: