Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demultiplexing Issues with Dorado v0.7.3 vs MinKNOW #1179

Open
prekijpatel opened this issue Dec 17, 2024 · 4 comments
Open

Demultiplexing Issues with Dorado v0.7.3 vs MinKNOW #1179

prekijpatel opened this issue Dec 17, 2024 · 4 comments

Comments

@prekijpatel
Copy link

prekijpatel commented Dec 17, 2024

Issue Report

Please describe the issue:

I recently encountered significant differences in basecalling and demultiplexing results between Dorado v0.7.3 and MinKNOW.

When using Dorado for basecalling, most reads were categorized as "unclassified." For example, the size of the FASTQ for barcode 16 was 125 MB.

However, when I basecalled the same dataset using MinKNOW 24.11, the size of the FASTQ for barcode 16 was much larger (959 MB). After downstream assembly, I found that the MinKNOW output for this barcode was heavily contaminated with reads that appeared to belong to other barcodes.

I am trying to determine whether:

  1. Dorado is overly strict during demultiplexing, or
  2. MinKNOW is too lenient, leading to contamination and misclassified reads.

Additionally, I am unsure whether the contamination originates from mis-demultiplexing or inherent issues with the sample.

Steps to reproduce the issue:

  1. Basecall using Dorado v0.7.3 with the following command:
    dorado basecaller [email protected] ../../combined_pod5/ -v -x cuda:all -b 0 -c 33000 > basecalled_13092024.bam
  2. Basecall using minknow standard settings for SQK-RBK114-24 with model sup.
  3. Demultiplex the reads from bam file.
dorado demux -o ./ --kit-name SQK-RBK114-24 -t 16 --emit-fastq ../basecalled_13092024.bam
  1. Compare the size of the FASTQ files with those generated by MinKNOW for the same dataset.

Run environment:

  • Dorado version: v0.7.3
  • Dorado command:
    dorado basecaller [email protected] ../../combined_pod5/ -v -x cuda:all -b 0 -c 33000 > basecalled_13092024.bam
  • Operating system: Ubuntu 24
  • Hardware (CPUs, Memory, GPUs): i9-12th gen, Nvidia RTX 3060, 64 Gb RAM
  • Source data type: pod5
  • Source data location: On device drive
  • Details about data:
    • Flow cell: R10.4.1
    • Kit: SQK-RBK114-24

Thank you for your help and insights!

@malton-ont
Copy link
Collaborator

Hi @prekijpatel,

I would recommend running the dorado basecaller command with --no-trim and then allowing dorado demux to perform barcode trimming to clear up the barcodes/adapters/primers. It is possible that adapter trimming from the basecaller command is interfering with the barcode detection in the demux stage.

@prekijpatel
Copy link
Author

Oo, alright! I shall try that.

Also, does it mean that my MinKnow data is correctly demuxed and the mixture of samples I see is inherent contamination in samples sequenced?

@malton-ont
Copy link
Collaborator

Without more information I'm not sure it's possible to tell, but if dorado gives similar results after that change then I'd say it points in that direction.

@prekijpatel
Copy link
Author

Sure, I will try the Dorado basecaller with --no-trim will keep things posted.
Thanks a lot for help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants