Feature request: is it possible to divide the work into stages with appropriate CLI #26

SHuang-Broad · 2020-03-07T23:03:47Z

This is related to #24 .

My hands are tied in the following sense, when polishing assembly of large genomes with deep coverage data:

I want to make use of the GPU acceleration
Using GPU limits my memory allocation for my VM (cloud vendor restriction)
racon tends to load all sequences into memory for preprocessing, potentially demanding a lot of memory (depending on genome size and coverage)

Hence I am wondering if it is possible for racon to expose CLI parameters that permits jobs to be run in stages.
This way, uses can then configure VM of different specifications for different stages and resume work.

I know this might be a big request, but it would make our lives easier.

Thanks!

Steve

rvaser · 2020-03-08T08:53:35Z

Hi Steve,
this will be a bit of a hassle to implement, because the read file is in memory during the whole run. Windows which are created only contain pointers to the sequences, so we do not copy the data unnecessary. I guess we could store windows which contain actual sequences to the disc, and then use a different subroutine which will do the multiple sequence alignment. I will have to think about the best way to do this, and I can not guarantee you when this request will be implemented.

Best regards,
Robert

SHuang-Broad · 2020-03-08T15:50:12Z

Thanks Robert!

So please help me understand the situation here a bit better.
What I observe is that—for each window, where the number of batches is set by --split—racon (the python wrapper) loads all data into memory (appears to be single-threaded), and process the reads in that batch/window. Is that right?
Now since there's already an overlap file to begin with, would it help to use this overlap file in a way such that loading all data into memory is unnecessary, but only the reads that "map" to the current window?

Best,
Steve

rvaser · 2020-03-08T21:49:14Z

The --split option will split the assembly into batches and then start polishing on each of them by invoking Racon in a sequential manner.

Indeed, for this use case it would be better to first load the overlap file and drop everything from the read file that is not needed, and thus decrease the memory consumption. I cannot remember why we implemented it the other way around.

SHuang-Broad · 2020-03-08T21:50:45Z

I totally understand there could be delicate reasons for not doing so.

SHuang-Broad · 2020-03-09T01:23:11Z

As I watch my job progress, another optimization that could be implemented—when GPU is available—is to start loading the next batch of sequences while the GPU is doing the polishing (not the alignment), to save sometime, as the loading is usually single-threaded, is IO-bound hence with much higher latency, and most CPU threads are idle while the GPU is working hard.

rvaser · 2020-03-09T08:22:03Z

The complete sequence file is loaded at the beginning of the run, and usually should not take that much time. We will explore other options to see if we can reduce memory consumption on bigger genomes.

SHuang-Broad · 2020-03-09T15:05:08Z

So this is what I observe, by running

./racon_wrapper \
    -u \
    -t 32 \
    -c 4 \
    --cudaaligner-batches 50 \
    --split 18000000 \
    ${READS} ${OVP} ${DRAFT}

on a primate genome:

Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.382995 s
[racon::Polisher::initialize] loaded sequences 2165.672248 s
[racon::Polisher::initialize] loaded overlaps 46.699042 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.624735 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 29.238104 s
[racon::Polisher::initialize] aligning overlaps [====================] 80.019571 s
[racon::Polisher::initialize] transformed data into windows 4.801252 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 10.350098 s
[racon::CUDAPolisher::polish] generating consensus [====================] 63.771369 s
[racon::CUDAPolisher::polish] polished windows on GPU 73.660493 s
[racon::CUDAPolisher::polish] generated consensus 0.279268 s
[racon::Polisher::] total = 2628.970957 s
Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.859031 s
[racon::Polisher::initialize] loaded sequences 1996.871102 s
[racon::Polisher::initialize] loaded overlaps 45.387511 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.517121 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 26.356452 s
[racon::Polisher::initialize] aligning overlaps [====================] 78.293230 s
[racon::Polisher::initialize] transformed data into windows 4.440666 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 9.928194 s
[racon::CUDAPolisher::polish] generating consensus [====================] 59.604708 s
[racon::CUDAPolisher::polish] polished windows on GPU 69.994798 s
[racon::CUDAPolisher::polish] generated consensus 0.183795 s
[racon::Polisher::] total = 2462.638074 s
# the racon blocks continue

And each time the device/GPU are (re-)initialized, the memory used by racon drops to almost zero, and it seems to me that all reads are reloaded again, taking some considerable time.

Am I running things in a bad manner?

SHuang-Broad · 2020-03-09T15:10:29Z

And I've attached the monitoring over the last 12 hours below (the time stamp on the top right is noise).
It looks like IO has periodic peaks (reading), indicating reloading of the reads.

rvaser · 2020-03-09T15:11:27Z

Unfortunately, it was designed that way. Any reason why you use 18Mbp as the split size?

SHuang-Broad · 2020-03-09T15:19:05Z

Ah, I see.

I was just playing with the parameters, as I wasn't quite sure exactly what the parameter --split means. What I observed is that for my data, the loading uses ~141GB memory, then inbetween the two GPU alignment and polishing steps, racon uses the desired number of threads while memory peak to about 148GB.

So to make sure I understand, at the beginning of each batch (batch size will be total amount of reads divided by the --split size), all reads are loaded, then the batch will be processed by GPU/CPU. The batch size (inversely related to --split) will directly affect the memory "overhead" on top of holding all reads in memory.
Is that right?

rvaser · 2020-03-09T15:21:19Z

Indeed. I usually have set the --split parameter a bit larger than the longest contig.

SHuang-Broad · 2020-03-09T15:22:02Z

Thanks Robert!
That tip is super helpful.

rvaser · 2020-03-09T15:22:47Z

You will still get a downgrade in speed, because in each batch the reads are parsed anew :/

SHuang-Broad · 2020-03-09T15:24:12Z

That I understand, and now that you are aware of it, I expect improvements coming, maybe not soon but sometime.

rvaser · 2020-03-09T15:24:38Z

Hopefully soon :D

rvaser added the enhancement New feature or request label Mar 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: is it possible to divide the work into stages with appropriate CLI #26

Feature request: is it possible to divide the work into stages with appropriate CLI #26

SHuang-Broad commented Mar 7, 2020

rvaser commented Mar 8, 2020

SHuang-Broad commented Mar 8, 2020

rvaser commented Mar 8, 2020 •

edited

Loading

SHuang-Broad commented Mar 8, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020 •

edited

Loading

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

Feature request: is it possible to divide the work into stages with appropriate CLI #26

Feature request: is it possible to divide the work into stages with appropriate CLI #26

Comments

SHuang-Broad commented Mar 7, 2020

rvaser commented Mar 8, 2020

SHuang-Broad commented Mar 8, 2020

rvaser commented Mar 8, 2020 • edited Loading

SHuang-Broad commented Mar 8, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020 • edited Loading

SHuang-Broad commented Mar 9, 2020

rvaser commented Mar 9, 2020

rvaser commented Mar 8, 2020 •

edited

Loading

rvaser commented Mar 9, 2020 •

edited

Loading