are duplex foldback chimeras handled by Dorado? #443

wilsonte-umich · 2023-10-27T23:00:06Z

doradao duplex identifies when two different reads have sequenced the same strand of a source DNA molecule.

Simplex read splitting was added to dorado in v0.4.0.

However, I am uncertain if the reads that are split are then checked for duplex state? Some duplex sequence reads are output as a single chimeric read that falsely appears as a foldback inversion, not as two independent reads. Does dorado find these (after splitting) and apply stereo calling to them in duplex mode?

wilsonte-umich · 2023-10-28T13:08:46Z

By examining output from dorado duplex v0.4.1 with the help of the pi:Z: parent read tag, I found some answers to my question. I'll summarize in two posts for clarity since one issue/question remains.

First, the following demonstrate that yes, dorado will perform duplex basecalling on split reads. I provide this as proof for those who might land here.

code

echo "duplex status counts for 10K called reads"
samtools view batch_0.unaligned.bam | 
head -n 10000 | 
grep "pi:Z" | 
perl -ne '$_ =~ m/(dx:i:\S+)/ and print $1,"\n"' | 
sort | 
uniq -c
echo

echo "an example duplex split read"
samtools view batch_0.unaligned.bam | 
grep -E "6bb31c00-5955-4ef8-8700-0a48a0a4a880|0ac981b8-408c-489e-b28d-10ac1383625b|c6795acf-765a-45e1-b477-b491bd265523" | 
perl -ne '
    @f = split("\t"); 
    print $f[0],"\t"; 
    $_ =~ m/(pi:Z:\S+)/ and print "$1\t"; 
    $_ =~ m/(dx:i:\S+)/ and print "$1\n"; 
'

echo "splitting status of duplex read sources"
samtools view batch_0.unaligned.bam | 
head -n 10000 | 
grep "dx:i:-1" | 
perl -ne '$_ =~ m/pi:Z:\S+/ ? print "pi:Z:TAG_PRESENT\n" : print "pi:Z:TAG_ABSENT\n"' |
sort | 
uniq -c

output

duplex status counts for 10K called reads
    113 dx:i:0
    174 dx:i:-1

an example duplex split read
0ac981b8-408c-489e-b28d-10ac1383625b;c6795acf-765a-45e1-b477-b491bd265523       dx:i:1
0ac981b8-408c-489e-b28d-10ac1383625b    pi:Z:6bb31c00-5955-4ef8-8700-0a48a0a4a880       dx:i:-1
c6795acf-765a-45e1-b477-b491bd265523    pi:Z:6bb31c00-5955-4ef8-8700-0a48a0a4a880       dx:i:-1

splitting status of duplex read sources
    322 pi:Z:TAG_ABSENT
    174 pi:Z:TAG_PRESENT

So, split reads show a mix of simplex and duplex. In this small sampling of one run, there are more duplex than simplex, since many chimeric reads are false foldback duplexes. Duplex reads arise from both split and unsplit read pairs, with plenty of both types.

The example shows that parent read 0a48a0a4a880 was split to 10ac1383625b and b491bd265523, which were properly duplexed.

wilsonte-umich · 2023-10-28T13:39:12Z

The above was the good news. Here's the bad news.

My downstream code looks for alignment pairs from a single read where minimap2 reports them as aligning to the same genomic location on opposite strands, the expected alignment pattern when a chimeric read comprises the two strands of the same duplex molecule. Read splitting prevents some of those, but not all as exemplified here:

code

echo "all basecalled reads for a missed foldback chimera"
samtools view batch_0.unaligned.bam | 
grep "34d1a7e9-cae6-43a4-88ad-219535e61a37" | 
perl -ne '
    @f = split("\t"); 
    print $f[0],"\t"; 
    $_ =~ m/(pi:Z:\S+)/ ? print "$1\t" : print "pi:Z:TAG_ABSENT\t"; 
    $_ =~ m/(dx:i:\S+)/ and print "$1\n"; 
'
echo

echo "all minimap2 alignments for that chimeric read, PAF format"
zcat ../*.paf.gz | 
grep "34d1a7e9-cae6-43a4-88ad-219535e61a37" | 
cut -f 1-12

output

all basecalled reads for a missed foldback chimera
34d1a7e9-cae6-43a4-88ad-219535e61a37    pi:Z:TAG_ABSENT dx:i:0

all minimap2 alignments for that chimeric read, PAF format
34d1a7e9-cae6-43a4-88ad-219535e61a37    573     38      425     +       chr9       124595110       59089221        59089607        386   387      60
34d1a7e9-cae6-43a4-88ad-219535e61a37    573     417     572     -       chr9       124595110       59089222        59089374        150   156      60

Thus, read 219535e61a37 failed splitting, but is a foldback chimera. It was presumably not found by the splitting algorithm since the second pass on the second strand yielded only a partial sequence of the molecule.

I don't have a number, but such reads are frequent in the data I'm analyzing, it is not hard to find examples.

My code is finding them after reference alignment, but in my perfect world, Dorado would find these chimeric reads that cannot be split based on adapter searching alone and subject them to stereo basecalling. This could be done by looking for self-complementarity of reads.

Any chance that Dorado could be made to do that? It could increase the duplex yield.

Perhaps it is undesirable due to reduced confidence in the nature of a chimera in the absence of adapter confirmation, or computational reasons. If so, it will be useful to have it definitely stated that Dorado will never catch them so people realize that downstream code must.

wilsonte-umich mentioned this issue Feb 12, 2024

Foldback chimeras #590

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

are duplex foldback chimeras handled by Dorado? #443

are duplex foldback chimeras handled by Dorado? #443

wilsonte-umich commented Oct 27, 2023

wilsonte-umich commented Oct 28, 2023 •

edited

Loading

wilsonte-umich commented Oct 28, 2023

are duplex foldback chimeras handled by Dorado? #443

are duplex foldback chimeras handled by Dorado? #443

Comments

wilsonte-umich commented Oct 27, 2023

wilsonte-umich commented Oct 28, 2023 • edited Loading

wilsonte-umich commented Oct 28, 2023

wilsonte-umich commented Oct 28, 2023 •

edited

Loading