You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've put together a workflow utilising DIAMOND for taxonomy assignment. While DIAMOND is naturally far faster than using (e.g.) NCBI blastx for the same purpose, the DIAMOND taxonomy step is still the slowest in the workflow. Accordingly, I'm looking for ways to speed up this step.
Combining queries
In the DIAMOND publication, it's stated that "DIAMOND is optimized for searches using large query and reference databases". Does this mean that increasing the size of a single DIAMOND search will be more efficient/faster than running several smaller DIAMOND searches? I'm asking this because in my workflow, the short reads for each sample are searched separately against a clustered version of the NR database (clustered at 90% similarity) using diamond blastx. One option I've considered is adding a unique identifier to the read headers to identify each sample, then concatenating the separate reads files per sample into a single large reads file for a single DIAMOND search, followed by deconvolution into results per sample. In your opinion, would such an approach lead to noticeable speed improvements?
Sensitivity speeds
Additionally, since I am only interested in close hits in terms of %id, I use the --fast flag:
--fast
Enable the fast sensitivity mode, which runs faster than default and is designed for finding hits of >90% identity. Option supported since v2.0.10
When looking at the available CLI options for diamond blastx, I can see that another possibility is:
--faster enable faster mode
Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)
block size vs index chunks
Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.
Excluding/including taxa with outfmt 102
When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?
Thanks
The text was updated successfully, but these errors were encountered:
charlesfoster
changed the title
Optimising the speed of DIAMOND based on the size of input queries
Optimising the speed of DIAMOND based on the size of input queries (and other)
Aug 27, 2024
In your opinion, would such an approach lead to noticeable speed improvements.
Depends on the size of your query files, I suggest testing it.
Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)
See extended data figure 2 of the diamond v2 paper. It was not done for the more recent modes. You have to test for yourself what will provide sufficient sensitivity for your application.
Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.
I would recommend always using -c1 first.
When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?
Using this should provide a big speedup as the sequences will not be loaded and searched against.
Hello,
I've put together a workflow utilising DIAMOND for taxonomy assignment. While DIAMOND is naturally far faster than using (e.g.) NCBI blastx for the same purpose, the DIAMOND taxonomy step is still the slowest in the workflow. Accordingly, I'm looking for ways to speed up this step.
Combining queries
In the DIAMOND publication, it's stated that "DIAMOND is optimized for searches using large query and reference databases". Does this mean that increasing the size of a single DIAMOND search will be more efficient/faster than running several smaller DIAMOND searches? I'm asking this because in my workflow, the short reads for each sample are searched separately against a clustered version of the NR database (clustered at 90% similarity) using
diamond blastx
. One option I've considered is adding a unique identifier to the read headers to identify each sample, then concatenating the separate reads files per sample into a single large reads file for a single DIAMOND search, followed by deconvolution into results per sample. In your opinion, would such an approach lead to noticeable speed improvements?Sensitivity speeds
Additionally, since I am only interested in close hits in terms of %id, I use the
--fast
flag:When looking at the available CLI options for
diamond blastx
, I can see that another possibility is:Is there a description somewhere of the difference between
--fast
and--faster
in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)block size vs index chunks
Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set
-c 1 -b 4.5
, but I could also do (e.g.)-c 2 -b 8
.Excluding/including taxa with outfmt 102
When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using
--taxon-exclude 2759
speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?Thanks
The text was updated successfully, but these errors were encountered: