Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus duplex read generation by combining dx:i:-1 and dx:i:1 reads #491

Open
bruzecruise opened this issue Nov 29, 2023 · 9 comments
Open
Assignees

Comments

@bruzecruise
Copy link

bruzecruise commented Nov 29, 2023

For duplex base calling as I understand the output BAM file contains reads with the following tags:

dx:i:1 for duplex reads.
dx:i:0 for simplex reads which don't have duplex offsprings.
dx:i:-1 for simplex reads which have duplex offsprings.

Would it be possible to create a 4th option here that can be the "consensus" read consisting of the combination of the dx:i:1 and the dx:i:-1 reads - essentially extending the length of duplex reads with their simplex ends. Currently, we are stuck with either excluding the dx:i:-1 reads which could be reducing the potential read lengths or including them which falsely increases our coverage.

#327

Also any current workarounds would be appreciated.

Cheers,
Dan

@vellamike
Copy link
Collaborator

Hi @bruzecruise ,

If you wanted to do this, my recommendation would be to write a custom tool to do it yourself. You would be able to get most of the way there by ignoring the complement read (the second read in the duplex read, the one following the semicolon) and just using the template - since the template is normally longer, more accurate, and has the same direction as the duplex read). What you would need to do is align the two to each other and find a point where you can "stitch" the two reads together.

We have considered adding this to Dorado, but came to the view that it would have limited utility. We may reconsider this in the future if there is a lot of demand for it.

@vellamike vellamike self-assigned this Dec 4, 2023
@bruzecruise
Copy link
Author

bruzecruise commented Dec 6, 2023

Hi @vellamike thanks for the suggestion! If i get around to making some sort of script I'll be sure to share it here.

@vellamike vellamike closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
@minefield47
Copy link

minefield47 commented Jan 17, 2024

@bruzecruise
Have you done any more looking into this process? I am looking to do a de-novo assembly of a non-model organism and the ability to concatenate the duplex with the template read to prevent chimera formation during assembly.

Thank you for any information you can provide!

@vellamike
Copy link
Collaborator

vellamike commented Jan 18, 2024

I'm still a bit skeptical that combining the duplex read with the non duplex template component is such a good idea, are you seeing significant amounts of length difference between the simplex template and corresponding duplex reads?

@minefield47
Copy link

Hi @vellamike
Could you elaborate why you are skeptical?

We have only a single sequencing run right now. I will look into the data from the run tomorrow and update. The main concern that has been raised in my group is an issue in which the template strand is 100kb but the complement read is 1kb, making the duplex read generated 1kb. Since this read is derived from a significantly longer template strand, we would want to utilize the duplex + remaining 99kb template strand during assembly to prevent chimera formation or gaps. From my understanding of the dxr tags and the conversation here, simply sorting out by tag would either remove data or create redundant data an assembler would have to decipher, raising concerns with duplexing potentially changing the bases at a given position.

Thank you,

@vellamike
Copy link
Collaborator

@minefield47 The example you give (100kb->1kb) is quite bad but I'm a little bit skeptical that this is common enough to warrant a separate fix, I suspect it's quite rare.

Have you considered writing a small tool to do this step? what you need to do is local align the duplex read to the template using something like edlib and stitch the missing simplex component.

@vellamike vellamike reopened this Jan 18, 2024
@bruzecruise
Copy link
Author

Hi @minefield47
I've sadly haven't had anytime to try and write a little consensus script.

BUT I have manually looked through 10 or so alignments of duplex reads and their simplex parents and so far it seems duplex reads are largely the same size as their simplex parents. The worse offending case I could find was where simplex reads could extend a 11,000 bp duplex read another 400 bp. I'll update you if anything changes.

@minefield47
Copy link

minefield47 commented Jan 26, 2024

@vellamike Apologies for the delay, it has been a busy few days. I started doing some simple statistical analysis around the duplexing and I found a series of weird cases in which a read is used as a template strand for one duplex read, having a length close to the original read length, while also being a complement for a different duplex read. For instance, this read of 10kb is used as the complement for a template of ~700bp. Is this expected behavior?

image

Thank you!

@minefield47
Copy link

minefield47 commented Jan 27, 2024

@bruzecruise
No worries at all, thank you for sharing. Another developer shared the code to determine pairs (https://github.com/nanoporetech/dorado/blob/master/dorado/read_pipeline/PairingNode.cpp#L67) and based on cursory glances through my first library (we just finished our second and plan to continue for the next couple of months until we can generate a de novo genome), I get some results pretty similar as you described comparing template to the simplexes. The worst I got by far is the example I showed above in which a 10k read was used as the complement for a read that was only ~700bp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants