Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] duplex and barcode demultiplexing? #522

Open
JWDebler opened this issue Dec 13, 2023 · 14 comments
Open

[question] duplex and barcode demultiplexing? #522

JWDebler opened this issue Dec 13, 2023 · 14 comments
Labels
duplex Issues related to duplex basecalling enhancement New feature or request

Comments

@JWDebler
Copy link

Hi,
I am wondering if there is currenly a way to do barcode demultiplexing and duplex calling using only dorado, and if so, what would be the best way to do it?

@vellamike
Copy link
Collaborator

Hi @JWDebler , duplex barcode demux is currently not implemented in Dorado. The best way to currently do it would be to demux the simplex reads and then use the duplex read IDs (which are a concatenation of simplex read IDs) to demux.

@vellamike vellamike added the enhancement New feature or request label Dec 13, 2023
@tijyojwad
Copy link
Collaborator

we have this on our roadmap and we plan to implement this for a future release

@vdruelle
Copy link

vdruelle commented Apr 5, 2024

Are there updates on the timeline to release barcode demultiplexing for duplex ? It seems to be a heavily requested feature for a while now...
Since the duplex reads are already in the data it's a shame to not be able to use them :/

@JWDebler
Copy link
Author

JWDebler commented Apr 5, 2024

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:

location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \
dorado demux --output-dir simplex --no-classify all.bam && \
rm simplex/*unclassified* && \
for file in simplex/*.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view -h $file | cut -f 1 > simplex/$id.readids.txt; done && \
for file in simplex/*readids.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); grep -v "@" $file > simplex/$id.clean.txt; done && \
for file in simplex/*clean.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); pod5 filter -r $location -i $file -t 10 --missing-ok --duplicate-ok --output $id.pod5; done && \
for file in *.pod5; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado duplex sup $file > $id.duplex.untrimmed.bam; done && \
for file in *untrimmed.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado trim $file > $id.duplex.bam && \
rm *untrimmed.bam && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:1 $file | pigz -9 > $id.duplex.fastq.gz; done && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:0 $file | pigz -9 > $id.simplex.fastq.gz; done 

@vdruelle
Copy link

vdruelle commented Apr 5, 2024

I see, thanks a lot for the details of your implementation !

@mbhall88
Copy link

mbhall88 commented Apr 5, 2024

I created this script to demux a duplex basecalled BAM. Takes a summary file after demultiplexing simplex reads and your duplex called BAM. There’s some options for dealing with situations where the two reads that make a duplex read don’t have the same barcode in simplex mode or if one is unclassified. Also supports fastq output

@vdruelle
Copy link

vdruelle commented Apr 5, 2024

Thanks for your answer and the link to your code ! Great work on the manuscript as well, it's super nice to see the variant calling metrics there (also shows that duplex is maybe not essential ^^).

@JWDebler
Copy link
Author

JWDebler commented Apr 5, 2024

interesting @mbhall88. I'm a bit confused. How did you get barcodes in the duplex bam file? I thought dorado couldn't do duplex calling and barcode assignment at the same time yet, which is why we're doing it the long way around.

@mbhall88
Copy link

mbhall88 commented Apr 6, 2024

There’s no barcodes in the duplex bam. But that script uses the assignment from the sequencing summary file from the simplex basecalling to assign barcodes for the duplex reads. I also found cases when testing that there were sometime simplex reads that are part of a duplex read that have different barcode assignments (normally it was just that one was unclassified and the other was classified) so I added options for how you want to deal with this

@JWDebler
Copy link
Author

JWDebler commented Apr 6, 2024

Ahhh, ok, that makes sense. Do you do simplex live calling during the run then? We always use post run basecalling, and I don't think that produces the summary file.

@mbhall88
Copy link

mbhall88 commented Apr 6, 2024

No I generally basecall after the fact also. But you can create a summary file with https://github.com/nanoporetech/dorado?tab=readme-ov-file#sequencing-summary

the demux command also has an —emit-summary option which will place a summary file in the output directory (see https://github.com/nanoporetech/dorado?tab=readme-ov-file#classifying-existing-datasets)

@vdruelle
Copy link

vdruelle commented Apr 8, 2024

Thanks for the details, I'll give it a try ! Looking at your script it does not seem super complicated to demultiplex duplex reads. I'm a bit puzzled why there is no native way of doing it with Dorado...

@tijyojwad tijyojwad mentioned this issue May 6, 2024
@tijyojwad tijyojwad added the duplex Issues related to duplex basecalling label May 23, 2024
@hagen-wende
Copy link

This is my current implementation, maybe useful for somebody ...
Assuming you have seqkit and samtools installed in the respective conda environments you can run this bash script after simplex basecalling from the nanopore output directory (the one that contains the fastq_pass and pod5 folder)

dorado_duplex_bact.sh

This uses the bacterial research model for the basecalling, just change this to "sup" or "hac" in line 53 if you want to use the standard models

@luckybillion
Copy link

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:

location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \

Just to clarify, when running doradobase caller would --no-trim need to be added for the subsequent demultiplex step to work?
my understanding is that otherwise the barcodes are removed and dorado demux won't work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplex Issues related to duplex basecalling enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants