-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: is it possible to divide the work into stages with appropriate CLI #26
Comments
Hi Steve, Best regards, |
Thanks Robert! So please help me understand the situation here a bit better. Best, |
The Indeed, for this use case it would be better to first load the overlap file and drop everything from the read file that is not needed, and thus decrease the memory consumption. I cannot remember why we implemented it the other way around. |
I totally understand there could be delicate reasons for not doing so. |
As I watch my job progress, another optimization that could be implemented—when GPU is available—is to start loading the next batch of sequences while the GPU is doing the polishing (not the alignment), to save sometime, as the loading is usually single-threaded, is IO-bound hence with much higher latency, and most CPU threads are idle while the GPU is working hard. |
The complete sequence file is loaded at the beginning of the run, and usually should not take that much time. We will explore other options to see if we can reduce memory consumption on bigger genomes. |
So this is what I observe, by running ./racon_wrapper \
-u \
-t 32 \
-c 4 \
--cudaaligner-batches 50 \
--split 18000000 \
${READS} ${OVP} ${DRAFT} on a primate genome:
And each time the device/GPU are (re-)initialized, the memory used by racon drops to almost zero, and it seems to me that all reads are reloaded again, taking some considerable time. Am I running things in a bad manner? |
Unfortunately, it was designed that way. Any reason why you use 18Mbp as the split size? |
Ah, I see. I was just playing with the parameters, as I wasn't quite sure exactly what the parameter So to make sure I understand, at the beginning of each batch (batch size will be total amount of reads divided by the |
Indeed. I usually have set the |
Thanks Robert! |
You will still get a downgrade in speed, because in each batch the reads are parsed anew :/ |
That I understand, and now that you are aware of it, I expect improvements coming, maybe not soon but sometime. |
Hopefully soon :D |
This is related to #24 .
My hands are tied in the following sense, when polishing assembly of large genomes with deep coverage data:
Hence I am wondering if it is possible for racon to expose CLI parameters that permits jobs to be run in stages.
This way, uses can then configure VM of different specifications for different stages and resume work.
I know this might be a big request, but it would make our lives easier.
Thanks!
Steve
The text was updated successfully, but these errors were encountered: