Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amplicon basecalling #536

Open
rmcminds opened this issue Dec 18, 2023 · 0 comments
Open

Amplicon basecalling #536

rmcminds opened this issue Dec 18, 2023 · 0 comments

Comments

@rmcminds
Copy link

rmcminds commented Dec 18, 2023

I know there have been a few discussions already about amplicons, but I was thinking a bit more broadly about this subject and had a few more questions/observations:

In other dorado issues covering amplicons, it was noted that the duplex algorithm isn't to be trusted with these datasets in part because it is hard to tell whether a second read is the negative strand of the first read, or if it's just another molecule with a similar sequence. For most amplicon workflows I know of, we wind up pooling many reads in to redundant units like ASVs, and do a lot of post-basecalling denoising or clustering by comparing the sequences of many molecules. So it doesn't really matter whether reads come from the exact same molecule in the sequencing library - what matters is whether they came from identical biological molecules before PCR. After PCR, there are obviously many copies, so the 'duplex' concept can be extended to 'polyplex'. Perhaps in a perfect world, we wouldn't have separate basecalling and then denoising steps - we would simply be able to do basecalling in a way that pools information across reads to increase accuracy and retain count information.

There are many Nanopore users submitting issues to the DADA2 (ASV denoising) repository and getting the simple response that they shouldn't use DADA2 for ONT reads because DADA2 doesn't do well with longer and more error-prone reads. The main issue seems to be an assumption that a significant portion of the reads are completely error-free, because the algorithm assumes that the 'true' sequence for an ASV partition is the one that is seen most frequently. Even with extremely high accuracy, this could be an issue as reads get longer, because the probability of an entire read being 100% accurate drops with length, so extra-long reads are always likely to have at least a couple of errors. Thus the original sequence might not exist at all among the raw reads, and the number of exact copies of random derivatives can't be used to determine what the original sequence was.

I figured DADA2's assumptions could be relaxed a little, and a consensus-based algorithm could be used to denoise long reads instead. I thought at first this could be a tweak to dada2, but it would likely be more accurate if applied to the basecall probability distributions in dorado, rather than the simplifed basespace from fastqs downstream. If a library is known to be amplicons, this knowledge could help with even 'simplex' basecalling, because all reads should be alignable/homologous, and that information could help with things like read splitting and barcode identification. A consensus-based dada2-like algorithm might look something like:

  1. assume all reads come from a single 'true' sequence, estimated through consensus of all reads
  2. if there is a high probability that this is false, split the reads into two draft partitions.
    a. choose two draft centers as references, such as the actual reads most similar and most distinct from the previous consensus
    b. assign reads to the partitions by identifying which reference they are most similar to
  3. update the inferred 'true' sequence for each partition via consensus of all reads assigned to that partition
  4. repeat until no more partitions are needed to explain the variation

Choosing how to split reads into partitions could be done in different ways. Perhaps reads could be split by finding the single alignment column that has the highest likelihood of alternate versions, assigning reads based on their most likely identity at that single location, and iterating between updating the consensuses and allowing the reads to swap partitions.

I assume there are people working on these things already (I see that duplex pairing for amplicons is considered a 'research problem' in other discussions, e.g. #268 (comment)), and I am new to the Nanopore community so may just not be aware. But if there is any interest in this topic, I'd love to talk with people about it. I'd also be happy to try to contribute code and test data, myself - but I'm not very handy with C and not yet very familiar with the existing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant