-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleaned up GPU consensus calling #661
base: master
Are you sure you want to change the base?
Conversation
…imal changes to the original source
Hi, additionally, on a multi-thread PC you can use the software parallel to send several process (regions) at the time with little number of nanopolish threads (-t option). Cheers |
At the moment it is the first GPU that is used by default. It is possible to add a command line option to that the user can provide which GPU to execute on - planning to do this in near future. I did not benchmark the CPU version in that multi process approach so far. However, the mutli threaded approaches I benchmarks had around 70-90% CPU utilisation, so I guess the multi process approach will be slightly faster overall. Infact, in modern GPUs it is possible to launch multiple contexts as well, but these approaches are yet to be evaluated. @vellamike could give more insights in to these questions. |
Hi, I made some test my self. Considering that on gtx1070 I can lunch 7 process at the time, I was able to make a variant call in 22 minutes compared to 44 on cpu. I think that with the data set I used, the gpu approach is at least 50% faster. |
What was the average coverage of the dataset? Is it a publicly available dataset, if so I can give a try on a v100 as well. And, how did you launch the multiple processes, is it through that nanopolish's make range? If possible share the commands. |
@lfaino You can also chose the GPU by setting the CUDA_VISIBLE_DEVICES environment variable. For example, If you are able to launch multiple processes from the command line, you should set CUDA_VISIBLE_DEVICES differently for each, in your case you have 3 GPUs so you should do
gtx1070 is a relatively old consumer card so a card like a GV100 should be faster. Multi-GPU support is something we can definitely add in the future. |
@hasindu2008 here the gpu: i made an error in the cpu command because i used 80 threads in total (-P 20 and -t 4) but i have a system with with 72 threads in total. i work with a systen with GTX1070 8gb of RAM. about the dataset, it is not available (but i can share as long as you keep for yourself) and it is about 50X data of a bacterial genome about 6 Mb large Cheers |
@vellamike, just to be clear, is it possible in the future to have something like
|
Hi Luigi, yes this should be possible, could you create a github issue
specifically for this?
…On Tue, Oct 8, 2019 at 8:34 AM Luigi Faino ***@***.***> wrote:
@vellamike <https://github.com/vellamike>,
my idea was a bit different but i can try to make a work around.
i would like to use parallel with makerange.py and send to one GPU or
another based on the fact that one process is finished. in simpler word,
control which gpu finished a job and use it again.
just to be clear, is it possible in the future to have something like
python -m torch.distributed.launch --nproc_per_node=4 train_flipflop.py ...
like in the taiyaki script
taiyaki <https://github.com/nanoporetech/taiyaki>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#661?email_source=notifications&email_token=AALYB7KD3NBLIKTLEYAFDELQNQZZZA5CNFSM4I5DU252YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEATGDJQ#issuecomment-539386278>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALYB7KFCQFIFHAG4OT7IT3QNQZZZANCNFSM4I5DU25Q>
.
|
@jts The answers from CPU (left) and GPU (right) for multi-model (-q dam,dcm) considerably match. The answers from single-model also considerably match. Experiment details Reference: Reads: Commands:
Then regions were extracted for region Chromosome:200000-202000
The vcf files are here: |
@jts On my laptop (12-core Intel i7, 16GB RAM and 1050 NVIDIA GPU) Multi-model (-q cpg) scripts used, logs generated and VCF outputs: |
This pull request restructures and cleans up the pull request #468 (GPU accelerated nanopolish consensus) by @vellamike so that:
In a summary:
src/cuda_kernels/GpuAligner.h and, src/cuda_kernels/gpu_call_variants.inl.
Importantly these files are effective (or compiled) only if make is called with make cuda=1. Those files are completely ignored otherwise.
changes to existing files are :
Makefile - includes cuda.mk if called with make cuda=1
README.md - brief guide on compiling and running for the GPUs
src/nanopolish_call_variants.cpp - option gpu added where as the default behaviour is to follow the CPU code path.
GPU code path generate_candidate_single_base_edits_gpu is only compiled if make cuda=1 is specified, and thus, no impact on the existing code.
I did some benchmarks based on the small chr20 dataset (average coverage of around 30-40X) on three different systems, a server, laptop and a Jetson dev board. On all cases, a speed up of ~5X was observed for the GPU implementation compared to its CPU counterpart (run with maximum threads available on CPU). Importantly, the output from the GPU and CPU are very similar except for a handful of differences, probably due to floating point handling. Further, the implementation was robust, that it ran on these cases without an issue.
The average speed ups for three separate 50kb regions (chr20:5000k-5050k, chr20:5050k-5100k, chr20:5100k-5150k) are as below:
The averages in the graph are based on three executions and the raw time values are as below:
I also checked on a 1M region (chr20:5M-6M) and the speedup on the server with Tesla V100 was even better with ~7X. The raw values are as below:
Given that Nanopolish consensus is known to be very time consuming process for larger genomes, I think this will benefit Nanopolish users.
The compiled binary (with GPU support) is attached if you wish to test on a system with an NVIDIA GPU without the trouble of installing CUDA toolkit.
nanopolish-gpu-bin.tar.gz
Also the test script and the raw outputs and logs are also attached. Just extracting and running simple_bench.sh may work.
test.tar.gz