Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are duplex foldback chimeras handled by Dorado? #443

Open
wilsonte-umich opened this issue Oct 27, 2023 · 2 comments
Open

are duplex foldback chimeras handled by Dorado? #443

wilsonte-umich opened this issue Oct 27, 2023 · 2 comments

Comments

@wilsonte-umich
Copy link

doradao duplex identifies when two different reads have sequenced the same strand of a source DNA molecule.

Simplex read splitting was added to dorado in v0.4.0.

However, I am uncertain if the reads that are split are then checked for duplex state? Some duplex sequence reads are output as a single chimeric read that falsely appears as a foldback inversion, not as two independent reads. Does dorado find these (after splitting) and apply stereo calling to them in duplex mode?

@wilsonte-umich
Copy link
Author

wilsonte-umich commented Oct 28, 2023

By examining output from dorado duplex v0.4.1 with the help of the pi:Z: parent read tag, I found some answers to my question. I'll summarize in two posts for clarity since one issue/question remains.

First, the following demonstrate that yes, dorado will perform duplex basecalling on split reads. I provide this as proof for those who might land here.

code

echo "duplex status counts for 10K called reads"
samtools view batch_0.unaligned.bam | 
head -n 10000 | 
grep "pi:Z" | 
perl -ne '$_ =~ m/(dx:i:\S+)/ and print $1,"\n"' | 
sort | 
uniq -c
echo

echo "an example duplex split read"
samtools view batch_0.unaligned.bam | 
grep -E "6bb31c00-5955-4ef8-8700-0a48a0a4a880|0ac981b8-408c-489e-b28d-10ac1383625b|c6795acf-765a-45e1-b477-b491bd265523" | 
perl -ne '
    @f = split("\t"); 
    print $f[0],"\t"; 
    $_ =~ m/(pi:Z:\S+)/ and print "$1\t"; 
    $_ =~ m/(dx:i:\S+)/ and print "$1\n"; 
'

echo "splitting status of duplex read sources"
samtools view batch_0.unaligned.bam | 
head -n 10000 | 
grep "dx:i:-1" | 
perl -ne '$_ =~ m/pi:Z:\S+/ ? print "pi:Z:TAG_PRESENT\n" : print "pi:Z:TAG_ABSENT\n"' |
sort | 
uniq -c

output

duplex status counts for 10K called reads
    113 dx:i:0
    174 dx:i:-1

an example duplex split read
0ac981b8-408c-489e-b28d-10ac1383625b;c6795acf-765a-45e1-b477-b491bd265523       dx:i:1
0ac981b8-408c-489e-b28d-10ac1383625b    pi:Z:6bb31c00-5955-4ef8-8700-0a48a0a4a880       dx:i:-1
c6795acf-765a-45e1-b477-b491bd265523    pi:Z:6bb31c00-5955-4ef8-8700-0a48a0a4a880       dx:i:-1

splitting status of duplex read sources
    322 pi:Z:TAG_ABSENT
    174 pi:Z:TAG_PRESENT

So, split reads show a mix of simplex and duplex. In this small sampling of one run, there are more duplex than simplex, since many chimeric reads are false foldback duplexes. Duplex reads arise from both split and unsplit read pairs, with plenty of both types.

The example shows that parent read 0a48a0a4a880 was split to 10ac1383625b and b491bd265523, which were properly duplexed.

@wilsonte-umich
Copy link
Author

The above was the good news. Here's the bad news.

My downstream code looks for alignment pairs from a single read where minimap2 reports them as aligning to the same genomic location on opposite strands, the expected alignment pattern when a chimeric read comprises the two strands of the same duplex molecule. Read splitting prevents some of those, but not all as exemplified here:

code

echo "all basecalled reads for a missed foldback chimera"
samtools view batch_0.unaligned.bam | 
grep "34d1a7e9-cae6-43a4-88ad-219535e61a37" | 
perl -ne '
    @f = split("\t"); 
    print $f[0],"\t"; 
    $_ =~ m/(pi:Z:\S+)/ ? print "$1\t" : print "pi:Z:TAG_ABSENT\t"; 
    $_ =~ m/(dx:i:\S+)/ and print "$1\n"; 
'
echo

echo "all minimap2 alignments for that chimeric read, PAF format"
zcat ../*.paf.gz | 
grep "34d1a7e9-cae6-43a4-88ad-219535e61a37" | 
cut -f 1-12

output

all basecalled reads for a missed foldback chimera
34d1a7e9-cae6-43a4-88ad-219535e61a37    pi:Z:TAG_ABSENT dx:i:0

all minimap2 alignments for that chimeric read, PAF format
34d1a7e9-cae6-43a4-88ad-219535e61a37    573     38      425     +       chr9       124595110       59089221        59089607        386   387      60
34d1a7e9-cae6-43a4-88ad-219535e61a37    573     417     572     -       chr9       124595110       59089222        59089374        150   156      60

Thus, read 219535e61a37 failed splitting, but is a foldback chimera. It was presumably not found by the splitting algorithm since the second pass on the second strand yielded only a partial sequence of the molecule.

I don't have a number, but such reads are frequent in the data I'm analyzing, it is not hard to find examples.

My code is finding them after reference alignment, but in my perfect world, Dorado would find these chimeric reads that cannot be split based on adapter searching alone and subject them to stereo basecalling. This could be done by looking for self-complementarity of reads.

Any chance that Dorado could be made to do that? It could increase the duplex yield.

Perhaps it is undesirable due to reduced confidence in the nature of a chimera in the absence of adapter confirmation, or computational reasons. If so, it will be useful to have it definitely stated that Dorado will never catch them so people realize that downstream code must.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant