Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado duplex having different pairing rates on the same dataset #372

Open
hasindu2008 opened this issue Sep 14, 2023 · 8 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@hasindu2008
Copy link

For the same dataset, depending on if it is multiple POD5 files or a merged single POD5 file, I seem to get a bit different output when using Dorado duplex. I would expect the output to be deterministic irrespective of the number of files for the same dataset.

Merging using:

 pod5 merge duplex_test_multiple_pod5/ -o duplex_test_single_pod5/a.pod5

SIngle POD5:

/install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_single_pod5/ -x cuda:0 > duplex_2.bam
[2023-09-14 15:50:00.340] [info] > No duplex pairs file provided, pairing will be performed automatically

[2023-09-14 15:50:08.632] [info] > Starting Stereo Duplex pipeline
[2023-09-14 15:50:08.647] [info] > Reading read channel info
[2023-09-14 15:50:08.780] [info] > Processed read channel info

[2023-09-14 15:57:08.942] [info] > Simplex reads basecalled: 178104
[2023-09-14 15:57:08.942] [info] > Simplex reads filtered: 5
[2023-09-14 15:57:08.942] [info] > Duplex reads basecalled: 36186
[2023-09-14 15:57:08.942] [info] > Duplex reads filtered: 8288
[2023-09-14 15:57:08.942] [info] > Duplex rate: 46.608578%
[2023-09-14 15:57:08.942] [info] > Basecalled @ Bases/s: 1.741040e+06

Merged POD5:

/install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_multiple_pod5/ -x cuda:0 > duplex_1.bam
[2023-09-14 15:58:00.201] [info] > No duplex pairs file provided, pairing will be performed automatically
[2023-09-14 15:58:08.543] [info] > Starting Stereo Duplex pipeline
[2023-09-14 15:58:08.558] [info] > Reading read channel info
[2023-09-14 15:58:08.701] [info] > Processed read channel info
[2023-09-14 16:05:16.131] [info] > Simplex reads basecalled: 178104
[2023-09-14 16:05:16.131] [info] > Simplex reads filtered: 5
[2023-09-14 16:05:16.131] [info] > Duplex reads basecalled: 34728
[2023-09-14 16:05:16.131] [info] > Duplex reads filtered: 8152
[2023-09-14 16:05:16.131] [info] > Duplex rate: 43.672375%
[2023-09-14 16:05:16.131] [info] > Basecalled @ Bases/s: 1.691432e+06
@vellamike vellamike self-assigned this Sep 14, 2023
@vellamike vellamike added the bug Something isn't working label Sep 14, 2023
@vellamike
Copy link
Collaborator

Thanks for reporting this @hasindu2008 - we are looking into it.

@vellamike
Copy link
Collaborator

vellamike commented Sep 14, 2023

Hi @hasindu2008 ,I'd like some clarification - in the first section where you run /install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_single_pod5/ -x cuda:0 > duplex_2.bam does this mean you have one Pod5 file ? and in the second section where you run /install/dorado-0.3.4/bin/dorado duplex /install/dorado-0.3.4/models/[email protected] duplex_test_multiple_pod5/ -x cuda:0 > duplex_1.bam do you have many POD5 files ?

@hasindu2008
Copy link
Author

Yes, that is correct.
Those many pod5 files under duplex_test_multiple_pod5/ were merged using pod5 merge command into a single file inside duplex_test_single_pod5 called a.pod5.

@vellamike
Copy link
Collaborator

Hi @hasindu2008 , could you run both of these a few times and report on the yields reported? Duplex is not fully deterministic because we make a determinism-performance tradeoff. I'd like to understand if what you see with one pod5 producing higher yield is consistent or within the noise.

Also, could you give me some infromation about what kind of data this is? (read lengths, amplicons etc)

@PedalheadPHX
Copy link

Seeing this now makes we wonder if grouping by channel will have a different effect. I'd hope it would mirror a single pod5 file. I'll try to post if we see any variability.

@hasindu2008
Copy link
Author

@vellamike

Yeh, different rates at different times:

[2023-09-25 14:43:18.434] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:43:18.434] [info] > Duplex reads basecalled: 2201
[2023-09-25 14:43:18.434] [info] > Duplex reads filtered: 506
[2023-09-25 14:43:18.434] [info] > Duplex rate: 51.30181%
--
[2023-09-25 14:44:17.218] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:44:17.218] [info] > Duplex reads basecalled: 2202
[2023-09-25 14:44:17.218] [info] > Duplex reads filtered: 509
[2023-09-25 14:44:17.218] [info] > Duplex rate: 51.295177%
--
[2023-09-25 14:45:15.401] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:45:15.401] [info] > Duplex reads basecalled: 2187
[2023-09-25 14:45:15.401] [info] > Duplex reads filtered: 506
[2023-09-25 14:45:15.401] [info] > Duplex rate: 50.79104%
--
[2023-09-25 14:46:14.254] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:46:14.254] [info] > Duplex reads basecalled: 2193
[2023-09-25 14:46:14.254] [info] > Duplex reads filtered: 506
[2023-09-25 14:46:14.254] [info] > Duplex rate: 51.1253%
--
[2023-09-25 14:47:12.884] [info] > Simplex reads basecalled: 9572
[2023-09-25 14:47:12.884] [info] > Duplex reads basecalled: 2198
[2023-09-25 14:47:12.884] [info] > Duplex reads filtered: 509
[2023-09-25 14:47:12.884] [info] > Duplex rate: 51.209366%

@vellamike
Copy link
Collaborator

Hi @hasindu2008 , we are working on making duplex basecalling fully deterministic, but I'm still surprised to see a 3% difference between the single and multiple POD5s, do you find this difference to be systematic?

@hasindu2008
Copy link
Author

I could not find some time to do a through observation to see if it is systematic. From the limited tests, it feels as if things are a bit stochastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants