-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use isONclust to help determine whether two reads came from the same isoform? #17
Comments
Hi @lauraht, You are right that it can be challenging distinguishing between structural differences and errors accumulated in regions using only the minimizes. There is no tailored way to do this in isONclust. However, if runtime is not an issue for you (e.g., relatively small datasets), you can probably do what you want by setting
or even At the base level alignment, distinguishing between structural differences and read errors tend to be much easier. The To go further, you can in this run-setting also call the experimental parameter Best, |
Hi Kristoffer, Thank you so much for your advice! I really appreciate it. I will give it a try then. Thank you very much again! |
Hi Kristoffer, I tried a dataset with 450 ONT reads. This dataset contains a query read (let’s call it read-1) and its matched reads (449 matched reads). These matched reads were obtained by running minimap2 read-to-read overlaps ( I first tried I then tried I then tried So I tried I also used My goal is to identify a bunch of matched reads that came from the same isoform as read-1 did. I further looked into those reads that are in the same cluster as read-1. I found that in both the first setting (0.95, 0.95) and the last setting (1.0, 0.8), the reads that share the same cluster with read-1 have medium or low numbers of matching minimizers with read-1 (the numbers of matching minimizers were obtained from minimap2). However, those reads that have the highest numbers of matching minimizers with read-1 are not in the same cluster as read-1. I was wondering why this is the case? I thought the matched reads with more matching minimizers with read-1 should be more likely in the same cluster as read-1, since they are more likely to come from the same isoform as read-1. So I don’t quite understand why the result is a bit opposite. I would appreciate your advice. I am attaching the fastq file of these 450 reads. The first read in the file is read-1. read1_matched_reads.fastq.gz Thank you very much for your help! |
I think I know what is going on. First off, isONclust does not hande reverse complements, therefore, it will put all the forward oriented reads in one cluster and reverse in the other cluster. I should implement an option to merge them.. This is part of why you should get at least 2 clusters with isONclust. I did get 2 clusters with your data running
this gives 2 large clusters (cluster 0 and 1) and some singletons. I ran below comand to produce fastq file of each separate cluster
For the parameters I suggested, you probably get very fragmented results because your reads contain the poly-A tails and adapters/barcodes after the poly-A sequences. The poly-A tails are of very different length (see your fast[q/a] files). Another way to see this is by taking an example read of each formed cluster and aligning them to human using blat. So, I would suggest running pychopper to remove the poly-A and the barcodes. This should result in higher similarity. You can also run isONcorrect on the two/four largest clusters after pythopper to get even higher identity. Also, your dataset is small enough to inspect by eye. If you convert the fastq files of isONclust cluster 0 and 1 from above commend you can multi-align the reads in either cluster with seaview. This helps intuition. I also tried As for the
In above scenario readX will have a alignment coverage say 0.85 because read1 does not cover readX in the start of readX. If we flip the scenario (see below figure), the coverage of read1 to readX may be 0.95.
The coverage gets worse if you have differences in ends (poly-A tail lengths /barcodes)
Above illustrates X is different end sequences from polyA tails etc. |
Also, about the number of shared minimizers: it is difficult to state anything about this. You can try a denser sampling of minimizes (as proposed in the previous message). Your long poly-A (poly-T) tails may get you in trouble here though, as they may all on none be selected as minimizes (depending on the implementation in minimap2 and isONclust - e.g. masking frequent k-mers etc.) |
Hi Kristoffer, Thank you so much for your detailed explanation and advice! I will run pychopper to remove the poly-A and the barcodes then. I saw that you used If I use According to what you explained about the alignment coverage w.r.t. the cluster representative, it seems that the clustering result may be impacted if different reads are used as the cluster representative. In this case, should I just use a looser Thank you very much! |
Hi @lauraht, The k and w are used to find initial candidate clusters - so yes, they matter. An alignment won't be tested between a read and representative that has no (or very few) shared minimizes. I'm not sure whether it has any effect in your data but since your dataset is so small, you can afford very low k and dense sampling with low w. yes, a looser Best, |
Thank you so much Kristoffer! I just wanted to confirm one thing with you: since It seems that pychopper identifies, orients, and trims full-length cDNA reads. I need to keep and trim all reads including both full-length and non-full-length reads. I also have a dataset of direct RNA reads. So I was wondering if you know some good alternative tool that does trimming on all reads for Nanopore? I also looked into porechop, which seems to trim adapters for Nanopore reads without limiting to full-length reads. But it seems that porechop only trims adapters (it did not mention poly-A tails). And porechop is no longer maintained, so it might be a bit out of date. Thank you so much for all the help! |
Yes, you are right, it is defined differently in isONclust - in the way you are describing it. Unfortunately, I don't know of such a tool. Someone should write one. Perhaps reach out to the author of porechop to see if it supports your scenario? |
Hi Kristoffer, Currently this small dataset (450 reads) took ~10 seconds to complete running isONclust by using 85 cores in parallel, which is fast. I have 800,000 this kind of datasets with about similar sizes (some have more reads while some have fewer), so the estimated runtime of finishing all these datasets would be ~92.6 days, which seems to be not feasible with our current computing resources. If I could reduce the runtime to 1-2 seconds per dataset, it would be feasible. In fact, I do not need to cluster all the reads in this dataset; I only need to find out which reads are in the same cluster as read-1 (the read of interest). In this case, if I treat read-1 as the representative of a cluster and only find out which reads belong to this single cluster by using the alignment threshold, I guess the runtime would be reduced, right? I was wondering if there is a way to use isONclust for this reduced usage? Or would it be possible to use some part of isONclust for doing this partial clustering? Another question is, given that a dataset with 450 reads took ~10 seconds to do full clustering with always invoking the alignment (using 85 cores), what would be the estimated runtime for a dataset of 800,000 reads for doing full clustering with always invoking the alignment (using 85 cores)? Thank you so much for all your help! |
Hi @lauraht, I suggest you run each instance with one core and instead start multiple isONclust jobs (85 if 85 cores) in parallel. There is a lot of overhead starting 85 processes (cores) with python. I wouldn't advise parallelizing any dataset under 100k reads. In fact, I ran
with If your dataset above is an average dataset, and you can successfully parallelize the calls to isONclust (using e.g. shell script or similar), then I get an approximate runtime of Best, |
Thank you so much Kristoffer! I tried I also need to subset the fastq file for those reads matched with the read of interest from the original 800,000 reads fastq file to form the small dataset in order to run isONclust. This subsetting process also took a couple of seconds, and it is sequentially performed for 800,000 times when reading/processing the minimap2 output for 800,000 reads. Each formed small dataset is for each of 800,000 reads. So I was considering an alternative. Instead of performing isONclust on each of 800,000 small datasets, alternatively I could run isONclust on all 800,000 reads together. I think running isONclust on the small dataset would be easier to find the reads from the same isoform, as those reads in the small dataset were already identified as matched reads to the read of interest by minimap2 --- from the isONclust output I could then further identify those reads that came from the same isoform as the read of interest. Now, if I run isONclust on all 800,000 reads together with the same alignment threshold ( I started running isONclust on all 800,000 reads together with I would appreciate your advice and thank you so much for all the help! |
Because of the relatively stringent criteria that you apply ( Perhaps you could run with default parameters on |
Hi Kristoffer, Thank you so much for your detailed explanation and suggestion! One thing I noticed is that 85 cores were only used at the beginning; later on just 3-4 cores were used, and currently it is running with only 2 cores. Another thing is, I was wondering if it would be possible to add a Thank you very much for your help! |
This is expected. reducing by 2 each clustering pass. Perhaps the optimal number of cores for your data is either 16 or 8 (or at least an even power of 2). Ok, I will consider this. Yes, please open a separate issue, you can just copy paste the text is sufficient. Thanks! |
Hi Kristoffer,
I understand that each cluster generated by isONclust represents all reads that came from the same gene.
However, I was looking for a de novo tool/method that could determine whether two overlapped reads (that have a set of matching minimizers) came from the same isoform for ONT reads. So I was wondering if there is any way in which I may use isONclust to help determine this? Or is it possible to use isONclust in an alternative or modified way to cluster reads that came from the same isoform?
Since ONT reads have high error rates, those unmatched gaps/regions (with no matching minimizers) between two matched/overlapped reads may be caused by either sequencing errors or alternative splicing, so it seems challenging to determine whether two reads came from the same isoform for ONT reads. I would appreciate your advice.
Thank you very much!
The text was updated successfully, but these errors were encountered: