Dorado duplex having different pairing rates on the same dataset #372

hasindu2008 · 2023-09-14T06:27:26Z

For the same dataset, depending on if it is multiple POD5 files or a merged single POD5 file, I seem to get a bit different output when using Dorado duplex. I would expect the output to be deterministic irrespective of the number of files for the same dataset.

Merging using:

 pod5 merge duplex_test_multiple_pod5/ -o duplex_test_single_pod5/a.pod5

SIngle POD5:

/install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_single_pod5/ -x cuda:0 > duplex_2.bam
[2023-09-14 15:50:00.340] [info] > No duplex pairs file provided, pairing will be performed automatically

[2023-09-14 15:50:08.632] [info] > Starting Stereo Duplex pipeline
[2023-09-14 15:50:08.647] [info] > Reading read channel info
[2023-09-14 15:50:08.780] [info] > Processed read channel info

[2023-09-14 15:57:08.942] [info] > Simplex reads basecalled: 178104
[2023-09-14 15:57:08.942] [info] > Simplex reads filtered: 5
[2023-09-14 15:57:08.942] [info] > Duplex reads basecalled: 36186
[2023-09-14 15:57:08.942] [info] > Duplex reads filtered: 8288
[2023-09-14 15:57:08.942] [info] > Duplex rate: 46.608578%
[2023-09-14 15:57:08.942] [info] > Basecalled @ Bases/s: 1.741040e+06

Merged POD5:

/install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_multiple_pod5/ -x cuda:0 > duplex_1.bam
[2023-09-14 15:58:00.201] [info] > No duplex pairs file provided, pairing will be performed automatically
[2023-09-14 15:58:08.543] [info] > Starting Stereo Duplex pipeline
[2023-09-14 15:58:08.558] [info] > Reading read channel info
[2023-09-14 15:58:08.701] [info] > Processed read channel info
[2023-09-14 16:05:16.131] [info] > Simplex reads basecalled: 178104
[2023-09-14 16:05:16.131] [info] > Simplex reads filtered: 5
[2023-09-14 16:05:16.131] [info] > Duplex reads basecalled: 34728
[2023-09-14 16:05:16.131] [info] > Duplex reads filtered: 8152
[2023-09-14 16:05:16.131] [info] > Duplex rate: 43.672375%
[2023-09-14 16:05:16.131] [info] > Basecalled @ Bases/s: 1.691432e+06

The text was updated successfully, but these errors were encountered:

vellamike · 2023-09-14T09:35:20Z

Thanks for reporting this @hasindu2008 - we are looking into it.

vellamike · 2023-09-14T10:31:49Z

Hi @hasindu2008 ,I'd like some clarification - in the first section where you run /install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_single_pod5/ -x cuda:0 > duplex_2.bam does this mean you have one Pod5 file ? and in the second section where you run /install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_multiple_pod5/ -x cuda:0 > duplex_1.bam do you have many POD5 files ?

hasindu2008 · 2023-09-14T10:33:56Z

Yes, that is correct.
Those many pod5 files under duplex_test_multiple_pod5/ were merged using pod5 merge command into a single file inside duplex_test_single_pod5 called a.pod5.

vellamike · 2023-09-20T10:47:44Z

Hi @hasindu2008 , could you run both of these a few times and report on the yields reported? Duplex is not fully deterministic because we make a determinism-performance tradeoff. I'd like to understand if what you see with one pod5 producing higher yield is consistent or within the noise.

Also, could you give me some infromation about what kind of data this is? (read lengths, amplicons etc)

PedalheadPHX · 2023-09-24T01:37:53Z

Seeing this now makes we wonder if grouping by channel will have a different effect. I'd hope it would mirror a single pod5 file. I'll try to post if we see any variability.

hasindu2008 · 2023-09-25T06:22:48Z

@vellamike

Yeh, different rates at different times:

[2023-09-25 14:43:18.434] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:43:18.434] [info] > Duplex reads basecalled: 2201
[2023-09-25 14:43:18.434] [info] > Duplex reads filtered: 506
[2023-09-25 14:43:18.434] [info] > Duplex rate: 51.30181%
--
[2023-09-25 14:44:17.218] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:44:17.218] [info] > Duplex reads basecalled: 2202
[2023-09-25 14:44:17.218] [info] > Duplex reads filtered: 509
[2023-09-25 14:44:17.218] [info] > Duplex rate: 51.295177%
--
[2023-09-25 14:45:15.401] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:45:15.401] [info] > Duplex reads basecalled: 2187
[2023-09-25 14:45:15.401] [info] > Duplex reads filtered: 506
[2023-09-25 14:45:15.401] [info] > Duplex rate: 50.79104%
--
[2023-09-25 14:46:14.254] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:46:14.254] [info] > Duplex reads basecalled: 2193
[2023-09-25 14:46:14.254] [info] > Duplex reads filtered: 506
[2023-09-25 14:46:14.254] [info] > Duplex rate: 51.1253%
--
[2023-09-25 14:47:12.884] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:47:12.884] [info] > Duplex reads basecalled: 2198
[2023-09-25 14:47:12.884] [info] > Duplex reads filtered: 509
[2023-09-25 14:47:12.884] [info] > Duplex rate: 51.209366%

vellamike · 2023-09-25T10:36:55Z

Hi @hasindu2008 , we are working on making duplex basecalling fully deterministic, but I'm still surprised to see a 3% difference between the single and multiple POD5s, do you find this difference to be systematic?

hasindu2008 · 2023-09-25T10:49:49Z

I could not find some time to do a through observation to see if it is systematic. From the limited tests, it feels as if things are a bit stochastic

vellamike self-assigned this Sep 14, 2023

vellamike added the bug Something isn't working label Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dorado duplex having different pairing rates on the same dataset #372

Dorado duplex having different pairing rates on the same dataset #372

hasindu2008 commented Sep 14, 2023

vellamike commented Sep 14, 2023

vellamike commented Sep 14, 2023 •

edited

Loading

hasindu2008 commented Sep 14, 2023

vellamike commented Sep 20, 2023

PedalheadPHX commented Sep 24, 2023

hasindu2008 commented Sep 25, 2023

vellamike commented Sep 25, 2023

hasindu2008 commented Sep 25, 2023

Dorado duplex having different pairing rates on the same dataset #372

Dorado duplex having different pairing rates on the same dataset #372

Comments

hasindu2008 commented Sep 14, 2023

vellamike commented Sep 14, 2023

vellamike commented Sep 14, 2023 • edited Loading

hasindu2008 commented Sep 14, 2023

vellamike commented Sep 20, 2023

PedalheadPHX commented Sep 24, 2023

hasindu2008 commented Sep 25, 2023

vellamike commented Sep 25, 2023

hasindu2008 commented Sep 25, 2023

vellamike commented Sep 14, 2023 •

edited

Loading