-
Notifications
You must be signed in to change notification settings - Fork 42
The prediction results showed a low recall #50
Comments
supplementary question: |
Hi:
In my opinion,because the sequence composition changes, alignment information changes,.So it is hard to predict the known positive site of input successfully. |
loss in mapped squiggle for RNA reads is also compounded by polyA absent from guppy called fastq.
I guess the loss curve is from the second round of training? This looks very similar to mine. I see greater data loss during the first round and produces a curve very similar to the one produced with the walkthrough_modbase exercise. |
So you have encountered a similar situation?My friend:)
Yep,this is from second round. |
Sorry for the slow response! I hope I can help you to diagnose the problem.
Ok, good. If this was the first round of training I would say something looks wrong, but for refining a model this looks fine. The plots showing the mapped reads also look ok. Let's assume the training is ok for now and check the basecalling.
The example you gave has an accuracy of about ~77% which is... not great. One possibility is that the default chunk size used by Could you try |
Thanks a lot!!!I will try to alter the parameter and evaluate the results.Keep in touch. |
Update:
Uh,it seems to be getting worse... |
Would it be possible to produce a ROC or Precision-recall curve for this threshold? While in theory these values correspond to log probabilities, in practice they often require calibration. See some discussion on this topic in megalodon docs here. |
@marcus1487 Thank you for your advice!I'll read the megalodon docs later. |
Update: |
A number of improvements to the training scripts have been implemented in the latest release. In this latest release I have found that a |
Hi:
As I said in the previous issues,I'm trying to use taiyaki to train my modified RNA model.But I found the prediction results showed a low recall.I'll go through the process of getting results.
1.Follow the instructions to train a modified base model(About 140k reads covering 2000 transcriptome modification sites,Approximately 1.5 modified bases per read).
2.Using model to basecalling the training set itself again.
basecall.py --device 0 --modified_base_output basecalls.hdf5 ${trainning_set_reads} training2/model_final.checkpoint > basecalls.fa
3.
Basecall. fa
was mapping to transcriptome by minimap2 to get bam file(minimap2 -t 8 -ax splice -uf -k14 ${transcriptpme} ${workspace}basecalls_reversed.fa > ${workspace}basecalls.sam
).The threshold value was set to obtain the modified base coordinates on reads(Handlingbasecalls.hdf5 files
).4.Converting the coordinates of modified bases with read as the origin to the coordinates with transcriptome as the origin(Process cigar in the bam file using
r.get_reference_positions
of pysam).5.Calculating how many of the 2000 transcriptome modification sites are covered by modified bases.I found that only about 6% positions were recalled.I don't know what the problem is, maybe my test method is not suitable, or the training problem.
Unfortunately, this downstream analysis software does not support RNA data.So, does downstream analysis of taiyaki currently require users to do it themselves?
https://github.com/nanoporetech/megalodon
The text was updated successfully, but these errors were encountered: