Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice on memory usage? #2090

Open
bl24 opened this issue Mar 4, 2025 · 3 comments
Open

Advice on memory usage? #2090

bl24 opened this issue Mar 4, 2025 · 3 comments

Comments

@bl24
Copy link

bl24 commented Mar 4, 2025

Hi, I'm working on a single end amplicon dataset with three libraries. Here are the results of running filtering

                                                  reads.in   reads.out

mimhigh_S4_R2_001.fastq.gz 7081626 6963273
mimlib_S5_R2_001.fastq.gz 10966462 10890100
mimlow_S5_R2_001.fastq.gz 6390801 6305400

For the samples labeled mimhigh and mimlow I was able to run error rates and dada fine. Only took me a few hours using my normal HPC settings: 28 cpus/4GB per cpu. I've been trying for days now to process the mimlib using different memory settings with no luck. At one point I had ran an HPC job with 6cpus/32GB per cpu for 24 hours and that wasn't enough to get the analysis done.

I know mimlib has a lot more data but is that normal to require so much more computing power?

@benjjneb
Copy link
Owner

benjjneb commented Mar 5, 2025

I'm not sure exactly what youre "CPUs" number means, but DADA2 is not internally parallelized across compute nodes. It will use all the threads available to it (if multithread=TRUE) within a discrete compute unit, but it cannot and does not use multiple compute nodes or physically discrete CPUs.

Perhaps this addresses your question? This also relates to the memory issue -- what you want is a single compute node with potentialy higher memory -- maybe 64GB. Adding more nodes with 4GB a piece won't help.

@bl24
Copy link
Author

bl24 commented Mar 5, 2025

I'm submitting the analysis as a SLURM job, so by "CPUs" I'm referring to the ntasks, which I think refers to the number of cores my job is running on. I've only been submitting a single compute node because I saw you mention in a previous issue that DADA2 doesn't compute across nodes.

@benjjneb
Copy link
Owner

benjjneb commented Mar 6, 2025

Try 64GB allowance then? The memory requirements of the core DADA2 algorithm scale like the number of unique sequence sin the data squared. So your 10M vs. 6M data, if its the same kind of sample, would be expected to require (10/6)^2 ~ 2x more memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants