Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

vaillan6 · 2024-12-03T17:42:03Z

Issue Report

I would like to report that with the newest versions of Dorado (v0.8.1, v0.8.2, and v0.8.3) base calling is now taking twice the time it took with previous versions.

Dorado Version	Basecalled @ Samples/s	Simplex reads basecalled	Simplex reads filtered	Run Time (hours)
0.7.2	2.59E+06	441,667	16,974	6.52
0.8.0	2.56E+06	441,661	16,979	6.60
0.8.1	1.14E+06	441,661	16,979	14.80
0.8.2	1.14E+06	441,661	16,979	14.82
0.8.3	1.14E+06	441,661	16,979	14.78

Run environment:

Dorado version: 0.7.2, 0.8.0, 0.8.1, 0.8.2, 0.8.3
Dorado command:
example:
dorado-#.#.#-linux-x64/bin/dorado basecaller --recursive --min-qscore 10 --trim all --kit-name SQK-NBD114-24 dorado-#.#.#-linux-x64/bin/[email protected] ${FILE} > ${FILE}_Dorado_v#.#.#_alltrim_q10.bam 2> ${FILE}_Dorado_v#.#.#_alltrim_q10.stdout
Operating system:
Linux 5.14.0-427.40.1.el9_4.x86_64 x86_64
NAME="Rocky Linux
VERSION="9.4 (Blue Onyx)
Hardware (CPUs, Memory, GPUs):
GPU

| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:02:00.0 Off |                  Off |
| 30%   34C    P8             23W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |

Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): on device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
SQK-NBD114.24, Native Barcoding Kit 24 V14, ~458,550 reads, FLO-MIN114, 4.39 gigabases, 17.87 kb N50, 48GB directory size
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

example for v0.8.3 run
[2024-11-21 09:28:02.296] [info] Running: "basecaller" "--recursive" "--min-qscore" "10" "--trim" "all" "--kit-name" "SQK-NBD114-24" "/home/brieanne/dorado-0.8.3-linux-x64/bin/[email protected]" "/data/run/brieanne/testing/PLU_AA_01_ONT"
[2024-11-21 09:28:02.458] [info] > Creating basecall pipeline
[2024-11-21 09:28:03.664] [info] cuda:0 using chunk size 12288, batch size 576
[2024-11-21 09:28:06.223] [info] cuda:0 using chunk size 6144, batch size 576
[2024-11-22 00:15:48.634] [info] > Simplex reads basecalled: 441661
[2024-11-22 00:15:48.634] [info] > Simplex reads filtered: 16979
[2024-11-22 00:15:48.634] [info] > Basecalled @ Samples/s: 1.142858e+06
[2024-11-22 00:15:48.634] [info] > 443075 reads demuxed @ classifications/s: 8.318957e+00
[2024-11-22 00:15:48.703] [info] > Finished

The text was updated successfully, but these errors were encountered:

HalfPhoton · 2024-12-04T09:25:23Z

Hi @vaillan6,
Thanks for raising this issue with us.

Do you have a record of the batch size used for each run? It would be good to rule out changes from the auto batch size detection becoming more conservative to prevent OOM issues.

Best regards,
Rich

HalfPhoton · 2024-12-04T09:26:59Z

@vaillan6,
There seems to be an inconsistency in the results. The Run Time and Samples / s for 0.8.1 aren't self-consistent. Is this a typo?

Dorado Version	Basecalled @ Samples/s	Simplex reads basecalled	Simplex reads filtered	Run Time (hours)
0.8.1	2.56E+06	441,661	16,979	14.80
0.8.2	1.14E+06	441,661	16,979	14.82

johannesgeibel · 2024-12-04T14:28:00Z

We observe the same on our V100 GPUs. GPU usage dropped to 70-80% when switching dorado to version 0.8.3 from 0.7.3. Therefore we ran a test on two small pod5 files to figure it out in more detail. Pod5s were located on local SSDs of the node to rule out performance issues of the network.

dorado version	runtime simplex [min]	runtime duplex [min]	samples/s (simplex)	bases per second (duplex)	chunk size (simplex)	batch size (simplex)	chunk size (duplex)	batch size (duplex)
0.7.3	46.43	47.48	2.06e+06	1.64e+05	12288/6144	224/224	12288/10000	256/1664
0.8.3	187.23	187.10	5.11e+05	4.19e+04	12288/6144	288/768	12288/10000	288/1664
0.8.3 limited	46.42	52.40	2.06e+06	1.49e+05	12288/6144	224/224	12288/10000	353/1664

As there were differences in the selected batch size, I also fixed the 0.8.3 batch size to the values of 0.7.3. Now only ~50% of the GPU memory is used (16/32GB, ~31/32GB with auto batch size), while GPU usage is back at 100% and speed seems to be the same. Interestingly, as I can only set a global limit via the -b parameter, my limited 0.8.3 duplex version runs with more batches than the unlimited one, but finisheds only slightly slower than the simplex version, while the "unlimited" duplex (with fewer batches) again only runs on ~75% of the gpu and needed > 3hrs..

Steps to reproduce the issue:

run different versions of dorado on a Tesla V100

Run environment:

Dorado version: 0.7.3 / 0.8.3 --> curretly also testing 0.8.0 to 0.8.2
Dorado command:

cp -R .../pod5 ${TMP_LOCAL}/
for version in  0.7.3 0.8.3 0.8.0 0.8.1 0.8.2
do
    /bin/time -v .../dorado-${version}-linux-x64/bin/dorado basecaller -r ../config/models/[email protected] ${TMP_LOCAL}/pod5 > tmp.u.bam 2> tmp_simplex_${version}.log
    /bin/time -v .../dorado-${version}-linux-x64/bin/dorado duplex -r ../config/models/[email protected] ${TMP_LOCAL}/pod5 > tmp.u.bam 2> tmp_duplex_${version}.log
done

Operating system: CentOS Linux release 7.9.2009 (Core) - Linux version 3.10.0-1160.11.1.el7.x86_64
Hardware (CPUs, Memory, GPUs):
Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz; 376G; Nvidia V100 (driver 530.30.02, cuda 12.1)
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): local ssd
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): two pod5 ~ 2gigabytes in size, but effect also on larger data sets
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

log for simplex 0.7.3
[2024-12-04 10:25:44.666] [info] Running: "basecaller" "-r" "../config/models/[email protected]" "/local/tmp.860726/pod5"
[2024-12-04 10:25:44.833] [info] > Creating basecall pipeline
[2024-12-04 10:25:59.684] [info] cuda:0 using chunk size 12288, batch size 224
[2024-12-04 10:26:00.325] [info] cuda:0 using chunk size 6144, batch size 224
[2024-12-04 11:12:26.592] [info] > Simplex reads basecalled: 77568
[2024-12-04 11:12:26.593] [info] > Basecalled @ Samples/s: 2.057966e+06
[2024-12-04 11:12:26.618] [info] > Finished

HalfPhoton · 2024-12-04T18:53:45Z

Thanks for the additional data points @johannesgeibel, this does appear to be an issue with the auto batch size process as you show.

We'll investigate further.

Best regards,

vaillan6 · 2024-12-04T19:12:33Z

Hi @vaillan6, Thanks for raising this issue with us.

Do you have a record of the batch size used for each run? It would be good to rule out changes from the auto batch size detection becoming more conservative to prevent OOM issues.

Best regards, Rich

@HalfPhoton
Please find the updated table below with batch size added:

	Basecalled @ Samples/s	Simplex reads basecalled	Simplex reads filtered	Time (hours)	Chunk Size	Batch Size
0.7.2	2.59E+06	441,667	16,974	6.52	12288 / 6144	512
0.8.0	2.56E+06	441,661	16,979	6.60	12288 / 6144	480
0.8.1	1.14E+06	441,661	16,979	14.80	12288 / 6144	576
0.8.2	1.14E+06	441,661	16,979	14.82	12288 / 6144	576
0.8.3	1.14E+06	441,661	16,979	14.78	12288 / 6144	576

vaillan6 · 2024-12-04T19:13:19Z

@vaillan6, There seems to be an inconsistency in the results. The Run Time and Samples / s for 0.8.1 aren't self-consistent. Is this a typo?

Dorado Version Basecalled @ Samples/s Simplex reads basecalled Simplex reads filtered Run Time (hours)
0.8.1 2.56E+06 441,661 16,979 14.80
0.8.2 1.14E+06 441,661 16,979 14.82

@HalfPhoton Yes, the table has been updated. My apologies.

vaillan6 · 2024-12-19T19:07:17Z

@HalfPhoton Thank you, the most recent update, v0.9.0, has resolved this issue.
@johannesgeibel incase you have not seen the update yet.

HalfPhoton added the performance Issues related to basecalling performance label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

vaillan6 commented Dec 3, 2024 •

edited

Loading

HalfPhoton commented Dec 4, 2024

HalfPhoton commented Dec 4, 2024

johannesgeibel commented Dec 4, 2024 •

edited

Loading

HalfPhoton commented Dec 4, 2024

vaillan6 commented Dec 4, 2024 •

edited

Loading

vaillan6 commented Dec 4, 2024 •

edited

Loading

vaillan6 commented Dec 19, 2024

Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

Comments

vaillan6 commented Dec 3, 2024 • edited Loading

Issue Report

Run environment:

Logs

HalfPhoton commented Dec 4, 2024

HalfPhoton commented Dec 4, 2024

johannesgeibel commented Dec 4, 2024 • edited Loading

Steps to reproduce the issue:

Run environment:

Logs

HalfPhoton commented Dec 4, 2024

vaillan6 commented Dec 4, 2024 • edited Loading

vaillan6 commented Dec 4, 2024 • edited Loading

vaillan6 commented Dec 19, 2024

vaillan6 commented Dec 3, 2024 •

edited

Loading

johannesgeibel commented Dec 4, 2024 •

edited

Loading

vaillan6 commented Dec 4, 2024 •

edited

Loading

vaillan6 commented Dec 4, 2024 •

edited

Loading