-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions to kmtricks vs HowDeSBT #23
Comments
Hi Svenja,
Yes. We use a modified version of HowDeSBT in which we have changed the hash function to fit the one used in kmtricks for constructing the bloom filters. This does not impact the query time.
This question also applies to HowDeSBT alone. This depends on the acceptable false positive rate and on your available disk. You may look for instance to this site (with one unique hash function) https://hur.st/bloomfilter/?k=1 to know the relation between the number of elements in the filter, the size of the filter, false positive rate. Good to know In both cases (kmtricks+howDeSBT or kmindex), it is possible to use the findere trick to decrease false positives. I hope this helps. Best, |
Hi Svenja, Some additional info:
To quickly estimate the number of elements in each filter (= number of distinct k-mers), you can use ntCard on each sample (https://github.com/bcgsc/ntCard). Then you can compute the right size according to the maximum number of distinct k-mers.
You can read more about findere here: https://github.com/lrobidou/findere With a description of your dataset, I will be better able to suggest the right pipeline. You can send me any useful information at teo[dot]lemane[at]proton[dot]me. Téo |
Hi Pierre, hi Téo, thanks a lot for your quick replies! This answers all of my questions. We were on the right track then but wanted to make sure to have a fair comparison (without errors on our side using the tools). We are working on a similar data structure that supports AMQs and want to compare ourselves to you. I will try running Best, EDIT: We plan to test the tools on RefSeq (all complete genomes) and part of the 40k RNA Seq Files from the most recent Mantis paper. |
I already stumbled over the first issue: In the example one should build the index after
But in version v1.2.1 the subcommand The example is probably outdated. |
I didn't set some options when building, I updated the binary. Now it's:
|
Exactly. |
The use of kmtricks to generate indexes was originally intended for collections (hundreds or thousands) of large sequencing samples (like Tara metagenomes). |
Thank you for the heads-up!
from when I contacted our IT when trying to test some tools. |
Hi there, so the kmtricks pipeline seems to have troubles. Data is ~100GB (uncompressed), 25'000 files, RefSeq genomes. Build command:
kmtricks info log:
kmtricks backtrace:
Any ideas? File limit should be fine. We have 1TB RAM at our expense and max resident size was only ~55 GB (see info) so that's not the problem. Side question: |
Hi I think Téo will confirm, but my guess is that the issue comes from the file limit. About the difference between
|
Hi, The error seems to be related to the number of opened files. However, the first step (superk) should work with your configuration. Recently I got some feedback from users who tried kmtricks to index genomes (tens of thousands of samples), leading to the identification of some issues in such a case:
I definitely have to fix that. Unfortunately, I don't know when I can do it. In the meantime, I see two workarounds:
Sorry for the inconvenience. Teo |
Hi there, thanks for the response. Scaling down the number of threads from 32 to 16 worked for now. What does
mean? (Query length is 250, kmer size 32) Is the query not searchable at all? |
Thats a strange behavior. |
Sorry, my fault I think. I gave kmtricks a FASTQ file instead of FASTA (I noticed that only every 4th query did not have problems). Runnning it with FASTA again but it's taking quite some time. Rerunning now with:
Input is a 2.8 GB FASTA file with 10M queries of length 250. Can you make an assumption on the expected runtime? EDIT: Already an hour now with the above command, htop shows only a single thread being used and no output has been written yet. RAM usage is constantly at ~50G. Index size (full kmtrick_index directory) on disk is ~300G. |
Hi there,
I would like to use
kmtricks
, to useHowDeSBT
as this example suggests that there is a convenient wrapper using the newest index build.Is the search of
kmtricks
resp.HowDeSBT
equivalent? Meaning that if I usekmtricks
, the search timings and results are the same as if I would use the original HowDeSBT index/query.Another question: How do I determine the Bloomfilter Size?
in the example
kmtricks pipeline
needs this as a command line argument. But I don't how to choose an appropriate size for my data set.Thanks in advance,
Svenja
The text was updated successfully, but these errors were encountered: