Optimising the speed of DIAMOND based on the size of input queries (and other) #829

charlesfoster · 2024-08-27T01:40:54Z

Hello,

I've put together a workflow utilising DIAMOND for taxonomy assignment. While DIAMOND is naturally far faster than using (e.g.) NCBI blastx for the same purpose, the DIAMOND taxonomy step is still the slowest in the workflow. Accordingly, I'm looking for ways to speed up this step.

Combining queries

In the DIAMOND publication, it's stated that "DIAMOND is optimized for searches using large query and reference databases". Does this mean that increasing the size of a single DIAMOND search will be more efficient/faster than running several smaller DIAMOND searches? I'm asking this because in my workflow, the short reads for each sample are searched separately against a clustered version of the NR database (clustered at 90% similarity) using diamond blastx. One option I've considered is adding a unique identifier to the read headers to identify each sample, then concatenating the separate reads files per sample into a single large reads file for a single DIAMOND search, followed by deconvolution into results per sample. In your opinion, would such an approach lead to noticeable speed improvements?

Sensitivity speeds

Additionally, since I am only interested in close hits in terms of %id, I use the --fast flag:

--fast

Enable the fast sensitivity mode, which runs faster than default and is designed for finding hits of >90% identity. Option supported since v2.0.10

When looking at the available CLI options for diamond blastx, I can see that another possibility is:

--faster                 enable faster mode

Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)

block size vs index chunks

Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.

Excluding/including taxa with outfmt 102

When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?

Thanks

The text was updated successfully, but these errors were encountered:

bbuchfink · 2024-10-20T13:34:50Z

In your opinion, would such an approach lead to noticeable speed improvements.

Depends on the size of your query files, I suggest testing it.

Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)

See extended data figure 2 of the diamond v2 paper. It was not done for the more recent modes. You have to test for yourself what will provide sufficient sensitivity for your application.

Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.

I would recommend always using -c1 first.

When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?

Using this should provide a big speedup as the sequences will not be loaded and searched against.

charlesfoster changed the title ~~Optimising the speed of DIAMOND based on the size of input queries~~ Optimising the speed of DIAMOND based on the size of input queries (and other) Aug 27, 2024

bbuchfink added the question label Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimising the speed of DIAMOND based on the size of input queries (and other) #829

Optimising the speed of DIAMOND based on the size of input queries (and other) #829

charlesfoster commented Aug 27, 2024 •

edited

Loading

bbuchfink commented Oct 20, 2024

Optimising the speed of DIAMOND based on the size of input queries (and other) #829

Optimising the speed of DIAMOND based on the size of input queries (and other) #829

Comments

charlesfoster commented Aug 27, 2024 • edited Loading

Combining queries

Sensitivity speeds

block size vs index chunks

Excluding/including taxa with outfmt 102

bbuchfink commented Oct 20, 2024

charlesfoster commented Aug 27, 2024 •

edited

Loading