-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplex read id in multiple duplex pairs #389
Comments
Hi @davidebolo1993 , thanks for flagging this! Most likely this is happening because our pairing heuristic is mis-identifying one of the pairs. The pairing algorithm takes into account multiple factors (read lengths of template and complement, their overlap, the position of the overlap, etc.). We're constantly improving those heuristics (we'll be releasing a newer version soon), but some false positive pairs might get through. In most cases these false positive pairs eventually gets discarded due to low q-score. I'm wondering if that's also the case for you. What mean q-score do you see for these 2 duplex reads that share a common simplex ancestor? The |
Hi @tijyojwad, Here you have it:
There is definitely one in higher quality. Do you have any threshold suggested for filtering these results for the time being ? Maybe when a simplex has higher quality than one of the duplex (if multiple ones) ? Just asking, because I don't think a fixed threshold will work. Thanks, Davide |
Oh this is interesting - I would have expected one of them to be quite low (< 10), so setting a Are there quite a few of these? I would expect these to be quite rare, so for now a heuristic to just keep the pair with the higher qscore should be sufficient. Once we release the new version (within a week) it'd be great if you can rerun and see if the problem persists. |
We will have a look at this with the most recent version and give a feedback! Thanks, Davide |
Any update on this @davidebolo1993 ? |
Closing this as there has been no updates in a month - please re-open if you're issue is still ongoing |
@HalfPhoton @tijyojwad I can give an update as I am currently having the same issue! Here are my current "duplicate" reads from my two libraries in which a read is being used as both a template and a complement. I additionally added on the parent simplex read length and quality. Let me know if you need anything else! Previously, I had a false positive rate (about ~10 for the second image/library). However, after seeing #617, I swapped from utilizing the Turing GPUs available to me, I have since re-basecalled my libraries using the [email protected] & [email protected] with Dorado V0.5.1+a7fb3e3 on A100 GPUs. Thank you, |
Hi @minefield47 thank you for reporting the issue. The underlying cause is likely the same one highlighted in #389 (comment) . I think a reasonable heuristic is to pick the one with higher duplex Q score as the "correct" pair and discard the other one. We've considered applying some heuristics to pick the "right" one in case of duplicates, but due to its rare occurrence we haven't prioritized it yet.
Is this 10 out of the whole dataset? What percentage is that?
We're investigating the highlighted discrepancy, but in general I would expect very minor differences in the output. Are you seeing significant changes after moving to A100? What were you running on before? |
Additionally, I am currently running the dorado duplexing and then running the adapter/primer trimming function from these reads to create a trimmed bam file. Something I just noticed is that the file size of my individual channels do not seem to be changing. Does the duplex function automatically do adapter/primer trimming or is this not expected behavior? |
Hi @minefield47 - this is super useful info, thank you so much!
Another option might be to just remove the lower quality duplex read that has a duplicated parent and keep the simplex reads as they are. this way you reduce the chance of using an incorrect duplex pair but still get the support from reads covering that region.
I think it's somewhat species and pipeline dependent. With duplex reads we're already correcting some of the errors so there will be fewer errors to polish later. |
Also, @minefield47 would you be able to share what the "random GPU" in your run was? And just to confirm, both runs used the exact same pod5/model the only difference was the GPU used right? |
It wasn't a single GPU model. I used an array of 512 jobs with IMB's LSF where each pod5 channel was assigned to either an RTX2080, GTX1080, A10, A30, or A100 depending upon what was available on the cluster at that moment. So channel 1 might have been called on an A100, but channel 2 might have been called on a RTX2080, etc. After sequencing completed and everything was called, I removed my stderr/stdout so I no longer know what channels were called on which GPU.
Yes. Everything else was completely identical, I have the model downloaded to the cluster and basecall using it to prevent waiting for the model to download on every channel. |
I am going to look into the two duplex reads/their parents and see where the differences are. An idea my PI just had was to keep these "false positives" in during assembly and see where they get mapped to. They might be from areas that are repetitive or the leading/trailing end difference between the two simplex reads might be high A/T regions that broke apart at different locations during library prep and we are seeing true differences. We will find out!
Sounds great, we have the compute power and are dealing with a genome that is expected to have a lot of repetitive elements. As such, we are going to run a bunch of different versions and get some comparisons. I asked this a bit ago above but wanted to put it here so it does not get lost:
Thank you for the quick replies! You and the rest of the team have been amazing help these past few weeks getting everything up and running. |
Are you working only with the duplex reads? In that I can the adapters will be automatically trimmed off as a consequence of the overlaps created during pairing. But if you have simplex reads in your dataset as well then I'd expect some trimming after running |
When I asked this, I was going off memory of the previous libraries called on the ensemble of GPUs, which I have since removed off the cluster, so cannot examine it anymore. Anyways, converted the untrimmed to fastq and just started randomly searching for trimmed reads in the untrimmed reads and the random 10 simplex only reads all got trimmed. Sorry for the confusion! |
No problem!
the default output mode is BAMs, so both should be binary outputs. |
Hi @tijyojwad, I finally found time to get back to you on this. Our facility re-basecalled the experiment using dorado v.0.5.2 this time. The .ubam from dorado duplex has 10288340 entries. Among these, we have:
In the perfect scenario the number of simplex having a duplex offspring would be double the number of duplex (so, I expected this to be 1093682). The reason dx:i:-1 tagged reads are less than expected by just doubling the dx:i:0 reads is because 31825 simplex with duplex offspring appears in more than a duplex pair. In particular:
There are a couple of things that look weird to me here. fa789c2c-50d6-478e-aa11-d214495148f7 0 19
fa789c2c-50d6-478e-aa11-d214495148f7 0 19
fa789c2c-50d6-478e-aa11-d214495148f7 0 19
fa789c2c-50d6-478e-aa11-d214495148f7 -1 19
fa789c2c-50d6-478e-aa11-d214495148f7;cd0ab1a3-86cd-4663-a552-d71a5c8eb5ce 1 22
fa789c2c-50d6-478e-aa11-d214495148f7;cd0ab1a3-86cd-4663-a552-d71a5c8eb5ce 1 22
fa789c2c-50d6-478e-aa11-d214495148f7;cd0ab1a3-86cd-4663-a552-d71a5c8eb5ce 1 22
fa789c2c-50d6-478e-aa11-d214495148f7;cd0ab1a3-86cd-4663-a552-d71a5c8eb5ce 1 22 These are read_id dx_tag and qs_tag for one of the read appearing in 4 duplex pairs - the duplex pair is always the same but appears 4 time in the alignment (is there a reason for this? Do you have a sense if this somehow affects downstream applications?). Also, the same read here is tagged as simplex with no duplex offspring, simplex with duplex offspring and appears as duplex as well. I thought that 0 and -1 tags were mutually exclusive but it doesn’t seem the case from this example.
Does this make sense? Can you elaborate a bit on this? Thanks for your time, Davide |
@davidebolo1993 This is a super useful analysis, and some of it is certainly not expected. A simplex read can end up in multiple duplex reads, but it should be a really low number (e.g. in your case is ~0.3%). And in fact we put in some changes to limit each simplex from occurring in at most 2 duplex reads (i.e. it either pairs with a read immediately preceding it or succeeding it). So a bug has crept in here which needs to be looked at.
Indeed should be mutually exclusive. We will take a look at this. |
Hey @tijyojwad, any updates on this ? Thanks, Davide |
Hi @davidebolo1993 - unfortunately not yet, but I'm starting to look into this now. Sorry for the delay! |
This will take a bit more time @davidebolo1993 , but hopefully we'll have some fixes in place for the next major release. |
Hi guys,
we are investigating our first duplex run. I've read useful discussions in #316 and #327 but couldn't find an obvious explanation to what we see.
From the docs and issues we see that a simplex read (let's call this
r
) that are also part of a duplex pair (let's call itd
) is taggeddx:i:-1
and the corresponding duplex pair (d
, indeed) is in the formr,t
(r
andt
are the read names) and is tagged asdx:i:1
.An example is this read here (
0ee988dd-2227-47f7-ab19-99acfc66d686
), with the corresponding tags.So far so good, and indeed most of the simplex reads having a duplex pair follow this scheme.
There are, however, simplex reads (
dx:i:-1
) that have multiple duplex pairs, so that the readr
appears in a first duplexr,t
and in a second duplexq,r
.An example is this read here(
d69f94b2-51d2-4c61-8c3b-7104c6cccc2a
):What is happening here ? Are the other 2 ids basically referring to the same template read
d69f94b2-51d2-4c61-8c3b-7104c6cccc2a
but are partial duplex of 2 different part of it ? Something like this (#327 (comment)) but at different ends? I'm just guessing as I couldn't find anything related to this - sorry if I missed it.Thanks,
Davide
The text was updated successfully, but these errors were encountered: