Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment of duplex reads is incorrect or truncated in some cases, giving truncated or poor quality reads? #441

Open
Delayed-Gitification opened this issue Oct 27, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@Delayed-Gitification
Copy link

Delayed-Gitification commented Oct 27, 2023

Hi,

I was hoping someone could give some insight into this problem I am having.

For my amplicon duplex reads (length ~ 1.2kb) I find that some basecall in duplex excellently using the latest Dorado version, with a median Q score of above 30.

However, for a large majority of them, I find that the duplex read is truncated: only part of the region covered by both reads is basecalled in duplex.

Furthermore, in some cases (not particularly rare, >10% of my duplex reads) the alignment is completely wrong, and results in a duplex read that is misaligned and entirely incorrect (median Q scores of < 10)

I attach a screenshot of an extreme example (though these are not at all uncommon in my dataset!). Here, the top read is the forward read (median Q score of above 20). The bottom read is the reverse read (less good, but still ok, median Q score of around 16). The read in the middle is the duplex read that is derived from these.

You can see that the forward and reverse read cover the same region and have a very similar sequence (as expected!). However, the duplex read is misaligned and its sequence is entirely wrong. As expected, Q scores are very low (median <8)

So two questions:

  1. Is it expected that the duplex read is often truncated compared to the simplex reads that give rise to it?
  2. Any idea why in some cases the duplex alignment and basecalling seems to be failing so dramatically?

Thanks again to the developers for all their work on this

(note - I've used the forward read as the reference here and aligned the duplex read and the reverse read to it)

Screenshot 2023-10-27 at 16 20 42

@Delayed-Gitification Delayed-Gitification changed the title Alignment of duplex reads is incorrect or truncated in some cases, given truncated or poor quality reads? Alignment of duplex reads is incorrect or truncated in some cases, giving truncated or poor quality reads? Oct 27, 2023
@Delayed-Gitification
Copy link
Author

example.zip
I attach the pod5 and bam file (default parameters, super accuracy 4khz [email protected]) for the example above

@tijyojwad
Copy link
Collaborator

Hi @Delayed-Gitification - thanks for sharing the data!

My hunch is our heuristics are picking up false positive pairs because of amplicons. In general duplex doesn't work super well with amplicons yet. We are going to release some updates with our next release (expected in a day or two) that should help with this. I'll ping on this thread once that is out, and would be great to get your feedback.

@Delayed-Gitification
Copy link
Author

Oh that's great news, looking forward to it.

In this case they are definitely true positive (experiment was designed to ensure this) so hopefully the updates you are releasing fix this!

@tijyojwad
Copy link
Collaborator

Hi @Delayed-Gitification - we just released the updated version of dorado (v0.4.2) - https://github.com/nanoporetech/dorado#installation . Please let me know if you see some improvements. The main change is to limit only adjacent reads when ordered by sequencing time for pairing.

@Delayed-Gitification
Copy link
Author

Unfortunately I don't see an improvement here

@tijyojwad
Copy link
Collaborator

Got it, thank you for testing! We'll have a look at your sample dataset

@tijyojwad tijyojwad added the enhancement New feature or request label Oct 30, 2023
@Delayed-Gitification
Copy link
Author

Delayed-Gitification commented Oct 30, 2023

Thanks! Just to note that the two reads in the .pod5 file have been split by some custom software. So some of the metadata values have been imputed. The actual signal is unchanged though, except for being split into two, and simplex basecalling works very well for both pod5 entries.

(The read is derived from an amplicon with a hairpin adapter at one end, meaning both strands are read in a single read. My code then splits this into two reads (upstream and downstream of the hairpin adapter) in a new pod5 file.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants