Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado Demultiplexing Significantly Increased Unclassified Reads Compared to Guppy #1178

Open
ohickl opened this issue Dec 17, 2024 · 6 comments

Comments

@ohickl
Copy link

ohickl commented Dec 17, 2024

Issue Report

Please describe the issue:

Significant increase in unclassified reads when using Dorado (version 0.8.2) for demultiplexing compared to using Guppy (version 6.5.7).

Steps to reproduce the issue:

  1. Basecall raw POD5 data using Dorado:
${DORADO_PATH}/dorado basecaller \
    sup \
    ${BASECALL_WORK_DIR}/raw \
    --kit-name ${KIT} \
    --recursive \
    --device 'cuda:all' > ${BASECALL_WORK_DIR}/simplex.bam 2> ${BC_LOG}
  1. Demultiplex the resulting BAM file using Dorado:
${DORADO_PATH}/dorado demux \
    --output-dir ${DEMUX_WORK_DIR}/demux \
    --no-classify \
    ${DEMUX_WORK_DIR}/simplex.bam &> ${DM_LOG}

Run environment:

Dorado version: 0.8.2
Dorado command:
    Basecalling: dorado basecaller sup ... (see full command above)
    Demultiplexing: dorado demux ... (see full command above)
Operating system: Red Hat Enterprise Linux 8.3
Hardware (CPUs, Memory, GPUs): 2 Xeon Gold 6132, 768G RAM, 4x Tesla V100 SXM2 32G
Source data type: pod5
Source data location: RAM Disk
Details about data:
    Flow cell: R10.4.1
    Kit: SQK-NBD114-96
    Read lengths: Q1: ~800, Q2: ~1.45K, Q3: ~1.5K
    Number of reads: ~11M
    Total dataset size: 156G
Dataset to reproduce, if applicable: -

Logs

Here a section of the barcoding_summary.txt file showing examples of unclassified reads with high barcode scores (front or rear >=90):
barcode_summary_sub.tsv.gz

Here the basecalling and (part of) the demux logs:
bc:

[2024-12-11 16:13:47.231] [info] Running: "basecaller" "sup" "/dev/shm/long_read_prep_3752484/basecalling_1733915028684359659/raw" "--kit-name" "SQK-NBD114-96" "-r" "-x" "cuda:all"
[2024-12-11 16:13:47.425] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2024-12-11 16:13:47.428] [info]  - downloading [email protected] with httplib
[2024-12-11 16:13:47.538] [error] Failed to download [email protected]: SSL server verification failed
[2024-12-11 16:13:47.538] [info]  - downloading [email protected] with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  9  178M    9 16.1M    0     0  39.6M      0  0:00:04 --:--:--  0:00:04 39.5M
 62  178M   62  111M    0     0  79.5M      0  0:00:02  0:00:01  0:00:01 79.4M
 62  178M   62  111M    0     0  28.8M      0  0:00:06  0:00:03  0:00:03 28.8M
100  178M  100  178M    0     0  41.1M      0  0:00:04  0:00:04 --:--:-- 41.1M
[2024-12-11 16:13:54.419] [info] > Creating basecall pipeline
[2024-12-11 16:13:56.890] [info] Calculating optimized batch size for GPU "Tesla V100-SXM2-16GB" and model /dev/shm/long_read_prep_3752484/.temp_dorado_model-2c25cba979a2b5a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-11 16:13:56.891] [info] Calculating optimized batch size for GPU "Tesla V100-SXM2-16GB" and model /dev/shm/long_read_prep_3752484/.temp_dorado_model-2c25cba979a2b5a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-11 16:13:56.967] [info] Calculating optimized batch size for GPU "Tesla V100-SXM2-16GB" and model /dev/shm/long_read_prep_3752484/.temp_dorado_model-2c25cba979a2b5a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-11 16:13:57.020] [info] Calculating optimized batch size for GPU "Tesla V100-SXM2-16GB" and model /dev/shm/long_read_prep_3752484/.temp_dorado_model-2c25cba979a2b5a7/[email protected]. Full benchmarking will run for this device, which may take some time.
[2024-12-11 16:14:01.391] [info] cuda:0 using chunk size 12288, batch size 160
[2024-12-11 16:14:01.391] [info] cuda:2 using chunk size 12288, batch size 160
[2024-12-11 16:14:01.392] [info] cuda:3 using chunk size 12288, batch size 160
[2024-12-11 16:14:01.392] [info] cuda:1 using chunk size 12288, batch size 160
[2024-12-11 16:14:02.167] [info] cuda:0 using chunk size 6144, batch size 352
[2024-12-11 16:14:02.177] [info] cuda:2 using chunk size 6144, batch size 352
[2024-12-11 16:14:02.178] [info] cuda:1 using chunk size 6144, batch size 288
[2024-12-11 16:14:02.179] [info] cuda:3 using chunk size 6144, batch size 288
[2024-12-11 23:30:17.939] [info] > Simplex reads basecalled: 10687069
[2024-12-11 23:30:17.958] [info] > Simplex reads filtered: 649
[2024-12-11 23:30:17.960] [info] > Basecalled @ Samples/s: 7.442969e+06
[2024-12-11 23:30:17.960] [info] > 10834188 reads demuxed @ classifications/s: 4.139955e+02
[2024-12-11 23:30:20.534] [info] > Finished

demux:

> Output records written: 10
...
> Output records written: 10834188

Additionally:

I would greatly appreciate guidance on the following:

Why Dorado classifies these reads as unclassified despite their high barcode scores?
Are there any parameters or settings within Dorado's demultiplexing module that can be adjusted to potentially relax the classification criteria and reduce the number of unclassified reads? Specifically, are there thresholds or parameters related to secondary barcode matching or other score considerations that could be explored?
Is this a known issue or expected behavior with Dorado 0.8.2?

Best

Oskar

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 17, 2024

Hi @ohickl,

A full description of the classification algorithm can be found in the docs here. This also contains information on creating a custom barcode arrangement in which you can override various scoring parameters.

My first suggestion would be to turn on the verbose log for a small subset of the unclassified reads - I suspect you'll see messages like:

  • Two ends confidently predict different BCs or
  • Found midstrand barcode flanks

Either of these two conditions will cause dorado to mark reads as unclassified regardless of its final barcode scores.

These were options available in guppy, but are always on in dorado. Guppy controlled these with the parameters --detect_mid_strand_barcodes and --allow_inferior_barcodes (both false by default).

@ohickl
Copy link
Author

ohickl commented Dec 17, 2024

Hi, thanks for the swift reply!
I ran guppy without either of those args like this:

"${GUPPY_PATH}/guppy_barcoder" \
    -i "${BARCODE_WORK_DIR}/filt" \
    -s "${BARCODE_WORK_DIR}/demux" \
    --barcode_kits "${KIT}" \
    --enable_trim_barcodes TRUE \
    --trim_adapters \
    --compress_fastq \
    --recursive \
    --worker_threads ${THREADS} \
    --device 'cuda:all'

I assumed you meant to rerun dorado demux on the unclassified reads. Using --verbose, i get no additional output, still only > Output records written:....
i ran it like this:

"${DORADO_PATH}/dorado" demux \
    --output-dir "demux" \
    --kit-name "SQK-NBD114-96" \
    --verbose \
    --threads 32 \
    --emit-summary \
    unclassified.bam

and interestingly got 105985 classified reads. Running again on this runs unclassified reads yielded no further classifications.
How can it be that running demux with classification will classify reads that dorado basecaller did not classify, supplying the same kit?

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 17, 2024

Apologies, I should been more specific - I meant the "very verbose" logging, using the -vv flag. (Be aware this will produce lots of information!). It looks like you have midstrand barcode detection turned off in guppy - you could try turning that on and see if you see a comparable drop in classifications.

Interesting. One of the required checks for dorado barcoding is that the barcode begins/ends within 75 bases of the start/end of the read. If there are additional bases that are pushing the barcode beyond this then the read would be unclassified. It would then be trimmed by the adapter trimming - possibly this then brought the barcode back within the required proximity.

@ohickl
Copy link
Author

ohickl commented Dec 17, 2024

hmm, not sure if Im doing something wrong but -vv or -vvv etc does not change the stdout at all.
I am running:

> "${DORADO_PATH}/dorado" --version
[2024-12-17 15:41:48.380] [info] Running: "--version"
0.8.2+6b413c9

I also checked the arrangement file options, is it possible to just specify parts of it, e.g.:

[arrangement]
## Scoring options
[scoring]
max_barcode_penalty = 11
barcode_end_proximity = 100 # from 75
min_barcode_penalty_dist = 3
min_separation_only_dist = 6
flank_left_pad = 5
flank_right_pad = 10
front_barcode_window = 200 # from 175
rear_barcode_window = 200 # from 175
midstrand_flank_score = 0.95

@malton-ont
Copy link
Collaborator

@ohickl,

Log output from dorado goes to stderr (since we put the actual output onto stdout) - are you redirecting that anywhere?

dorado demux \
    -vv \
     --output-dir "demux" \
    --kit-name "SQK-NBD114-96" \
    --verbose \
    --threads 32 \
    --emit-summary \
    unclassified.bam

Should output lines like:

[2024-12-17 15:04:14.585] [trace] Barcode for 5950760d-f5a4-4c49-8fa9-3fce090d6728 is SQK-NBD114-96_barcode04
[2024-12-17 15:04:14.585] [trace] top window v1 15
[2024-12-17 15:04:14.585] [trace] bottom window v1 0
[2024-12-17 15:04:14.585] [trace] 
C-C-GA-GATCA-CAGT-CGCGAACACTCTTTC
* | || || |* || | ||*||*|  |||* |
AGCAGAAGA-CGGCA-TACGAGATC--TCTA-C

Yes it's possible to only provide a subset of scoring options, but this must be done as part of a full custom arrangement, specifying the masks and sequences:

[arrangement]
name = "SQK-NBD114-96-override"
kit = "SQK-NBD114-96-override"

# mask values from dorado/utils/barcode_kits.cpp
mask1_front = "ATTGCTAAGGTTAA"
mask1_rear = "CAGCACCT"
mask2_front = "ATTGCTAAGGTTAA"
mask2_rear = "CAGCACC"

barcode1_pattern = "NB%02i"
barcode2_pattern = "NB%02i"
first_index = 1
last_index = 96

## Scoring options
[scoring]
barcode_end_proximity = 100 # from 75
front_barcode_window = 200 # from 175
rear_barcode_window = 200 # from 175

@ohickl
Copy link
Author

ohickl commented Dec 17, 2024

Ok. thanks! Ill also try the new 0.9.0 to see if it natively results in more identifications.
I usually redirect both stderr and stdout to a file, in this case it seemed to result in the extra traces to be lost. never had that before. using script -q -c "dorado demux -vv --output-dir demux --kit-name SQK-NBD114-96 --threads 32 --emit-summary unclassified.bam" log.log > /dev/null works though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants