Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: is it possible to divide the work into stages with appropriate CLI #26

Open
SHuang-Broad opened this issue Mar 7, 2020 · 15 comments
Labels
enhancement New feature or request

Comments

@SHuang-Broad
Copy link
Contributor

This is related to #24 .

My hands are tied in the following sense, when polishing assembly of large genomes with deep coverage data:

  1. I want to make use of the GPU acceleration
  2. Using GPU limits my memory allocation for my VM (cloud vendor restriction)
  3. racon tends to load all sequences into memory for preprocessing, potentially demanding a lot of memory (depending on genome size and coverage)

Hence I am wondering if it is possible for racon to expose CLI parameters that permits jobs to be run in stages.
This way, uses can then configure VM of different specifications for different stages and resume work.

I know this might be a big request, but it would make our lives easier.

Thanks!

Steve

@rvaser
Copy link
Collaborator

rvaser commented Mar 8, 2020

Hi Steve,
this will be a bit of a hassle to implement, because the read file is in memory during the whole run. Windows which are created only contain pointers to the sequences, so we do not copy the data unnecessary. I guess we could store windows which contain actual sequences to the disc, and then use a different subroutine which will do the multiple sequence alignment. I will have to think about the best way to do this, and I can not guarantee you when this request will be implemented.

Best regards,
Robert

@rvaser rvaser added the enhancement New feature or request label Mar 8, 2020
@SHuang-Broad
Copy link
Contributor Author

Thanks Robert!

So please help me understand the situation here a bit better.
What I observe is that—for each window, where the number of batches is set by --split—racon (the python wrapper) loads all data into memory (appears to be single-threaded), and process the reads in that batch/window. Is that right?
Now since there's already an overlap file to begin with, would it help to use this overlap file in a way such that loading all data into memory is unnecessary, but only the reads that "map" to the current window?

Best,
Steve

@rvaser
Copy link
Collaborator

rvaser commented Mar 8, 2020

The --split option will split the assembly into batches and then start polishing on each of them by invoking Racon in a sequential manner.

Indeed, for this use case it would be better to first load the overlap file and drop everything from the read file that is not needed, and thus decrease the memory consumption. I cannot remember why we implemented it the other way around.

@SHuang-Broad
Copy link
Contributor Author

I totally understand there could be delicate reasons for not doing so.

@SHuang-Broad
Copy link
Contributor Author

As I watch my job progress, another optimization that could be implemented—when GPU is available—is to start loading the next batch of sequences while the GPU is doing the polishing (not the alignment), to save sometime, as the loading is usually single-threaded, is IO-bound hence with much higher latency, and most CPU threads are idle while the GPU is working hard.

@rvaser
Copy link
Collaborator

rvaser commented Mar 9, 2020

The complete sequence file is loaded at the beginning of the run, and usually should not take that much time. We will explore other options to see if we can reduce memory consumption on bigger genomes.

@SHuang-Broad
Copy link
Contributor Author

So this is what I observe, by running

./racon_wrapper \
    -u \
    -t 32 \
    -c 4 \
    --cudaaligner-batches 50 \
    --split 18000000 \
    ${READS} ${OVP} ${DRAFT}

on a primate genome:

Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.382995 s
[racon::Polisher::initialize] loaded sequences 2165.672248 s
[racon::Polisher::initialize] loaded overlaps 46.699042 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.624735 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 29.238104 s
[racon::Polisher::initialize] aligning overlaps [====================] 80.019571 s
[racon::Polisher::initialize] transformed data into windows 4.801252 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 10.350098 s
[racon::CUDAPolisher::polish] generating consensus [====================] 63.771369 s
[racon::CUDAPolisher::polish] polished windows on GPU 73.660493 s
[racon::CUDAPolisher::polish] generated consensus 0.279268 s
[racon::Polisher::] total = 2628.970957 s
Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.859031 s
[racon::Polisher::initialize] loaded sequences 1996.871102 s
[racon::Polisher::initialize] loaded overlaps 45.387511 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.517121 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 26.356452 s
[racon::Polisher::initialize] aligning overlaps [====================] 78.293230 s
[racon::Polisher::initialize] transformed data into windows 4.440666 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 9.928194 s
[racon::CUDAPolisher::polish] generating consensus [====================] 59.604708 s
[racon::CUDAPolisher::polish] polished windows on GPU 69.994798 s
[racon::CUDAPolisher::polish] generated consensus 0.183795 s
[racon::Polisher::] total = 2462.638074 s
# the racon blocks continue

And each time the device/GPU are (re-)initialized, the memory used by racon drops to almost zero, and it seems to me that all reads are reloaded again, taking some considerable time.

Am I running things in a bad manner?

@SHuang-Broad
Copy link
Contributor Author

And I've attached the monitoring over the last 12 hours below (the time stamp on the top right is noise).
It looks like IO has periodic peaks (reading), indicating reloading of the reads.

Screen Shot 2020-03-09 at 11 07 34 AM
Screen Shot 2020-03-09 at 11 07 24 AM

@rvaser
Copy link
Collaborator

rvaser commented Mar 9, 2020

Unfortunately, it was designed that way. Any reason why you use 18Mbp as the split size?

@SHuang-Broad
Copy link
Contributor Author

Ah, I see.

I was just playing with the parameters, as I wasn't quite sure exactly what the parameter --split means. What I observed is that for my data, the loading uses ~141GB memory, then inbetween the two GPU alignment and polishing steps, racon uses the desired number of threads while memory peak to about 148GB.

So to make sure I understand, at the beginning of each batch (batch size will be total amount of reads divided by the --split size), all reads are loaded, then the batch will be processed by GPU/CPU. The batch size (inversely related to --split) will directly affect the memory "overhead" on top of holding all reads in memory.
Is that right?

@rvaser
Copy link
Collaborator

rvaser commented Mar 9, 2020

Indeed. I usually have set the --split parameter a bit larger than the longest contig.

@SHuang-Broad
Copy link
Contributor Author

Thanks Robert!
That tip is super helpful.

@rvaser
Copy link
Collaborator

rvaser commented Mar 9, 2020

You will still get a downgrade in speed, because in each batch the reads are parsed anew :/

@SHuang-Broad
Copy link
Contributor Author

That I understand, and now that you are aware of it, I expect improvements coming, maybe not soon but sometime.

@rvaser
Copy link
Collaborator

rvaser commented Mar 9, 2020

Hopefully soon :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants