Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base calling speed : twice the time for base calling with Dorado v0.8.1:0.8.3 #1159

Open
vaillan6 opened this issue Dec 3, 2024 · 7 comments
Labels
performance Issues related to basecalling performance

Comments

@vaillan6
Copy link

vaillan6 commented Dec 3, 2024

Issue Report

I would like to report that with the newest versions of Dorado (v0.8.1, v0.8.2, and v0.8.3) base calling is now taking twice the time it took with previous versions.

Dorado Version Basecalled @ Samples/s Simplex reads basecalled Simplex reads filtered Run Time (hours)
0.7.2 2.59E+06 441,667 16,974 6.52
0.8.0 2.56E+06 441,661 16,979 6.60
0.8.1 1.14E+06 441,661 16,979 14.80
0.8.2 1.14E+06 441,661 16,979 14.82
0.8.3 1.14E+06 441,661 16,979 14.78

Run environment:

  • Dorado version: 0.7.2, 0.8.0, 0.8.1, 0.8.2, 0.8.3

  • Dorado command:
    example:
    dorado-#.#.#-linux-x64/bin/dorado basecaller --recursive --min-qscore 10 --trim all --kit-name SQK-NBD114-24 dorado-#.#.#-linux-x64/bin/[email protected] ${FILE} > ${FILE}_Dorado_v#.#.#_alltrim_q10.bam 2> ${FILE}_Dorado_v#.#.#_alltrim_q10.stdout

  • Operating system:
    Linux 5.14.0-427.40.1.el9_4.x86_64 x86_64
    NAME="Rocky Linux
    VERSION="9.4 (Blue Onyx)

  • Hardware (CPUs, Memory, GPUs):
    GPU

| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:02:00.0 Off |                  Off |
| 30%   34C    P8             23W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
    SQK-NBD114.24, Native Barcoding Kit 24 V14, ~458,550 reads, FLO-MIN114, 4.39 gigabases, 17.87 kb N50, 48GB directory size
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

example for v0.8.3 run
[2024-11-21 09:28:02.296] [info] Running: "basecaller" "--recursive" "--min-qscore" "10" "--trim" "all" "--kit-name" "SQK-NBD114-24" "/home/brieanne/dorado-0.8.3-linux-x64/bin/[email protected]" "/data/run/brieanne/testing/PLU_AA_01_ONT"
[2024-11-21 09:28:02.458] [info] > Creating basecall pipeline
[2024-11-21 09:28:03.664] [info] cuda:0 using chunk size 12288, batch size 576
[2024-11-21 09:28:06.223] [info] cuda:0 using chunk size 6144, batch size 576
[2024-11-22 00:15:48.634] [info] > Simplex reads basecalled: 441661
[2024-11-22 00:15:48.634] [info] > Simplex reads filtered: 16979
[2024-11-22 00:15:48.634] [info] > Basecalled @ Samples/s: 1.142858e+06
[2024-11-22 00:15:48.634] [info] > 443075 reads demuxed @ classifications/s: 8.318957e+00
[2024-11-22 00:15:48.703] [info] > Finished

@HalfPhoton
Copy link
Collaborator

Hi @vaillan6,
Thanks for raising this issue with us.

Do you have a record of the batch size used for each run? It would be good to rule out changes from the auto batch size detection becoming more conservative to prevent OOM issues.

Best regards,
Rich

@HalfPhoton HalfPhoton added the performance Issues related to basecalling performance label Dec 4, 2024
@HalfPhoton
Copy link
Collaborator

@vaillan6,
There seems to be an inconsistency in the results. The Run Time and Samples / s for 0.8.1 aren't self-consistent. Is this a typo?

Dorado Version Basecalled @ Samples/s Simplex reads basecalled Simplex reads filtered Run Time (hours)
0.8.1 2.56E+06 441,661 16,979 14.80
0.8.2 1.14E+06 441,661 16,979 14.82

@johannesgeibel
Copy link

johannesgeibel commented Dec 4, 2024

We observe the same on our V100 GPUs. GPU usage dropped to 70-80% when switching dorado to version 0.8.3 from 0.7.3. Therefore we ran a test on two small pod5 files to figure it out in more detail. Pod5s were located on local SSDs of the node to rule out performance issues of the network.

dorado version runtime simplex [min] runtime duplex [min] samples/s (simplex) bases per second (duplex) chunk size (simplex) batch size (simplex) chunk size (duplex) batch size (duplex)
0.7.3 46.43 47.48 2.06e+06 1.64e+05 12288/6144 224/224 12288/10000 256/1664
0.8.3 187.23 187.10 5.11e+05 4.19e+04 12288/6144 288/768 12288/10000 288/1664
0.8.3 limited 46.42 52.40 2.06e+06 1.49e+05 12288/6144 224/224 12288/10000 353/1664

As there were differences in the selected batch size, I also fixed the 0.8.3 batch size to the values of 0.7.3. Now only ~50% of the GPU memory is used (16/32GB, ~31/32GB with auto batch size), while GPU usage is back at 100% and speed seems to be the same. Interestingly, as I can only set a global limit via the -b parameter, my limited 0.8.3 duplex version runs with more batches than the unlimited one, but finisheds only slightly slower than the simplex version, while the "unlimited" duplex (with fewer batches) again only runs on ~75% of the gpu and needed > 3hrs..

Steps to reproduce the issue:

run different versions of dorado on a Tesla V100

Run environment:

  • Dorado version: 0.7.3 / 0.8.3 --> curretly also testing 0.8.0 to 0.8.2
  • Dorado command:
cp -R .../pod5 ${TMP_LOCAL}/
for version in  0.7.3 0.8.3 0.8.0 0.8.1 0.8.2
do
    /bin/time -v .../dorado-${version}-linux-x64/bin/dorado basecaller -r ../config/models/[email protected] ${TMP_LOCAL}/pod5 > tmp.u.bam 2> tmp_simplex_${version}.log
    /bin/time -v .../dorado-${version}-linux-x64/bin/dorado duplex -r ../config/models/[email protected] ${TMP_LOCAL}/pod5 > tmp.u.bam 2> tmp_duplex_${version}.log
done

  • Operating system: CentOS Linux release 7.9.2009 (Core) - Linux version 3.10.0-1160.11.1.el7.x86_64
  • Hardware (CPUs, Memory, GPUs):
    Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz; 376G; Nvidia V100 (driver 530.30.02, cuda 12.1)
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): local ssd
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): two pod5 ~ 2gigabytes in size, but effect also on larger data sets
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

log for simplex 0.7.3
[2024-12-04 10:25:44.666] [info] Running: "basecaller" "-r" "../config/models/[email protected]" "/local/tmp.860726/pod5"
[2024-12-04 10:25:44.833] [info] > Creating basecall pipeline
[2024-12-04 10:25:59.684] [info] cuda:0 using chunk size 12288, batch size 224
[2024-12-04 10:26:00.325] [info] cuda:0 using chunk size 6144, batch size 224
[2024-12-04 11:12:26.592] [info] > Simplex reads basecalled: 77568
[2024-12-04 11:12:26.593] [info] > Basecalled @ Samples/s: 2.057966e+06
[2024-12-04 11:12:26.618] [info] > Finished

@HalfPhoton
Copy link
Collaborator

Thanks for the additional data points @johannesgeibel, this does appear to be an issue with the auto batch size process as you show.

We'll investigate further.

Best regards,

@vaillan6
Copy link
Author

vaillan6 commented Dec 4, 2024

Hi @vaillan6, Thanks for raising this issue with us.

Do you have a record of the batch size used for each run? It would be good to rule out changes from the auto batch size detection becoming more conservative to prevent OOM issues.

Best regards, Rich

@HalfPhoton
Please find the updated table below with batch size added:

<style> </style>
  Basecalled @ Samples/s Simplex reads basecalled Simplex reads filtered Time (hours) Chunk Size Batch Size
0.7.2 2.59E+06 441,667 16,974 6.52 12288 / 6144 512
0.8.0 2.56E+06 441,661 16,979 6.60 12288 / 6144 480
0.8.1 1.14E+06 441,661 16,979 14.80 12288 / 6144 576
0.8.2 1.14E+06 441,661 16,979 14.82 12288 / 6144 576
0.8.3 1.14E+06 441,661 16,979 14.78 12288 / 6144 576

@vaillan6
Copy link
Author

vaillan6 commented Dec 4, 2024

@vaillan6, There seems to be an inconsistency in the results. The Run Time and Samples / s for 0.8.1 aren't self-consistent. Is this a typo?

Dorado Version Basecalled @ Samples/s Simplex reads basecalled Simplex reads filtered Run Time (hours)
0.8.1 2.56E+06 441,661 16,979 14.80
0.8.2 1.14E+06 441,661 16,979 14.82

@HalfPhoton Yes, the table has been updated. My apologies.

@vaillan6
Copy link
Author

@HalfPhoton Thank you, the most recent update, v0.9.0, has resolved this issue.
@johannesgeibel incase you have not seen the update yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to basecalling performance
Projects
None yet
Development

No branches or pull requests

3 participants