Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimising the speed of DIAMOND based on the size of input queries (and other) #829

Open
charlesfoster opened this issue Aug 27, 2024 · 1 comment
Labels

Comments

@charlesfoster
Copy link

charlesfoster commented Aug 27, 2024

Hello,

I've put together a workflow utilising DIAMOND for taxonomy assignment. While DIAMOND is naturally far faster than using (e.g.) NCBI blastx for the same purpose, the DIAMOND taxonomy step is still the slowest in the workflow. Accordingly, I'm looking for ways to speed up this step.

Combining queries

In the DIAMOND publication, it's stated that "DIAMOND is optimized for searches using large query and reference databases". Does this mean that increasing the size of a single DIAMOND search will be more efficient/faster than running several smaller DIAMOND searches? I'm asking this because in my workflow, the short reads for each sample are searched separately against a clustered version of the NR database (clustered at 90% similarity) using diamond blastx. One option I've considered is adding a unique identifier to the read headers to identify each sample, then concatenating the separate reads files per sample into a single large reads file for a single DIAMOND search, followed by deconvolution into results per sample. In your opinion, would such an approach lead to noticeable speed improvements?

Sensitivity speeds

Additionally, since I am only interested in close hits in terms of %id, I use the --fast flag:

--fast

Enable the fast sensitivity mode, which runs faster than default and is designed for finding hits of >90% identity. Option supported since v2.0.10

When looking at the available CLI options for diamond blastx, I can see that another possibility is:

--faster                 enable faster mode

Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)

block size vs index chunks

Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.

Excluding/including taxa with outfmt 102

When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?

Thanks

@charlesfoster charlesfoster changed the title Optimising the speed of DIAMOND based on the size of input queries Optimising the speed of DIAMOND based on the size of input queries (and other) Aug 27, 2024
@bbuchfink
Copy link
Owner

In your opinion, would such an approach lead to noticeable speed improvements.

Depends on the size of your query files, I suggest testing it.

Is there a description somewhere of the difference between --fast and --faster in terms of speed vs sensitivity, and/or the same for the other presets? I'm thinking in terms of "fast == %ID >= 90%; faster == %ID >= 95%; mid-sensitive == ..." (etc.)

See extended data figure 2 of the diamond v2 paper. It was not done for the more recent modes. You have to test for yourself what will provide sufficient sensitivity for your application.

Would I be better off (speedier) keeping the number of index chunks at 1 and then increasing the block size as far as I can within memory limits, or would there be benefits to increasing the index chunks and block size? For example, when desiring a RAM usage of ~90GB, I set -c 1 -b 4.5, but I could also do (e.g.) -c 2 -b 8.

I would recommend always using -c1 first.

When using diamond to find the LCA assignment for input queries (outfmt 102), would excluding an irrelevant part of the subject database speed things up? For example, would excluding eukaryotes using --taxon-exclude 2759 speed things up when I'm only really interested in viruses/bacteria? Or is the speed the same, just with excluded taxa not being reported?

Using this should provide a big speedup as the sequences will not be loaded and searched against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants