Error when trying to variant call using an R10 model #1059

hasindu2008 · 2023-02-17T04:02:39Z

Hi Jared,

I have trained a basic k-mer model for R10.4.1 using your R10 branch in nanopolish, using the recent R10.4.1 base models provided by ONT. I am trying to evaluate some variant calls using that, but I am getting the following error

nanopolish variants -r reads.fastq -b reads.bam -g ~/scratch/hg38noAlt/hg38noAlt.fa -p2 --models-fofn=a.txt
Error: unknown model:

The contents of a.txt looks like:

../r10-models/r10.4.1_400bps.nucleotide.9mer.model

The head of the ../r10-models/r10.4.1_400bps.nucleotide.9mer.model is like:

#model_name     r9.4_450bps.nucleotide.6mer.template.model
#kit    r9.4_450bps
#k      9
#strand template
#alphabet       nucleotide
AAAAAAAAA       54.20490641     3.0     1.0     1.0
AAAAAAAAC       58.59079727     3.0     1.0     1.0
AAAAAAAAG       55.952068       3.0     1.0     1.0
AAAAAAAAT       58.43352009     3.0     1.0     1.0
AAAAAAACA       63.65995928     3.0     1.0     1.0

I initially had the following but changed the name to R9.4 to see if the name was the issue.

#model_name     r9.10.1_400bps.nucleotide.9mer.template.model
#kit    r10.4.1_40bps

Is there a way to test a new R10 9-mer model in nanopolish?

I manage to do an event alignment using f5c eventalign using this model and most reads successfully aligned. As I am not sure of the best way to check the accuracy of signal alignments, I thought perhaps using the nanopolish variant caller could be a solution. Your thoughts are welcome.

The text was updated successfully, but these errors were encountered:

jts · 2023-02-17T13:51:29Z

It would appear that nanopolish failed to parse the header for some reason. What you pasted looks fine, could you provide the model to me so I can look into this more closely?

hasindu2008 · 2023-02-17T23:27:27Z

Hi Jared,
Thanks for the quick response. The file is available here https://raw.githubusercontent.com/hasindu2008/f5c/r10/test/r10-models/r10.4.1_400bps.nucleotide.9mer.model

hasindu2008 · 2023-02-18T01:57:02Z

Also, when training for a large human dataset using the r10 branch, the nanopolish train seems to crash after some time:

/usr/bin/time -v nanopolish  train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam  -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/
[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
nanopolish: src/nanopolish_train.cpp:515: void add_aligned_events_for_read(const ReadDB&, const faidx_t*, std::vector<StateSummary>&, const string&, const string&, size_t, std::unordered_map<unsigned int, int>&, const std::vector<BedMethylRecord>&, const bam_hdr_t*, const bam1_t*, size_t, int, int): Assertion `training_kit == sr.get_model_kit_name(strand_idx)' failed.

Any idea how this can be fixed? Now I am trying to extract chromosome 1 and try running only on it.

hasindu2008 · 2023-02-18T10:34:24Z

The, training for chr1 did not crash, but unfortunately none of the k-mers have been trained.

[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
[bam process] processed 566251 bam records in 590.00s (960 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 1
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 2
[bam process] processed 566251 bam records in 587.00s (965 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 3
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 4
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s

Could be some parameters? Incase you want to reproduce, all the necessary files are here as a tarball. To reproduce:

tar xf nanopolish-train.tar 
cd nanopolish-train/PGXX22563_pcr/
../slow5-nanopolish-train/nanopolish  train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam  -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/

This uses the SLOW5 integrated nanopolish r10 branch so we can do the training fast when signal files are residing on NFS and @hiruna72 has separately sent a PR.
Note: The training worked for R9.4.1 data when existing nanopolish R9.4 were used as base models (though the accuracy got worse because the existing pore-model for R9.4 was anyway good)

jts · 2023-02-22T16:11:10Z

So, going back to the original issue when I run with your model I get this error:

Error, unrecognized model kit r10.4.1_400bps

This is because the metadata in nanopolish hasn't been set up for R10.4.1:

https://github.com/jts/nanopolish/blob/r10/src/pore_model/nanopolish_model_names.cpp#L17

If I change the model kit to be r10_450bps the model loads correctly but all the reads are discarded, likely because the signal calibration fails. This is probably why no training happens on chr1. I will try to look into why calibration fails.

hasindu2008 · 2023-02-23T00:46:52Z

Hi Jared,

After renaming as you said, the pore model seems to load. I added some debug prints and the alignment step seem to fail.

However, when I run f5c event align with the same model, most reads align and pass calibration (see attached if they look reasonable). Is the adaptive banded event alignment algorithm in the r10 branch different to the one that used to be in Nanopolish main branch (that I used 3 years ago when writing f5c)? I briefly looked at the r10 code and they look considerably different.

summary.tsv.gz
events.tsv.gz

jts · 2023-02-23T13:30:04Z

The r10 branch training method re-uses the adaptive banded alignment code, which I made more generic to support this use. I thought it was functionally equivalent but I will look into it.

jts · 2023-02-23T13:54:57Z

I've tried your f5c r10 branch on some local R10.4.1 data I generated but no reads aligned:

[meth_main] total entries: 4238, qc fail: 0, could not calibrate: 0, no alignment: 4238, bad fast5: 0

Could you provide me with the small set of reads that you used to generate eventalign.tsv and summary.tsv?

hasindu2008 · 2023-02-23T22:06:22Z

Hi Jared,

You can download a little dataset from here. Commandline should like like:

./f5c eventalign -b test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.bam -r test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.fastq -g hg38noAlt.fa --slow5 test/hg2_lsk114_reads_1000/PGXX22394_reads_1000.blow5 --kmer-model test/r10-models/r10.4.1_400bps.nucleotide.9mer.model

The output is like:

[meth_main] total entries: 1076, qc fail: 2, could not calibrate: 0, no alignment: 156, bad fast5: 0

hasindu2008 · 2023-02-23T22:07:14Z

@jts Could you also provide me with a little portion of your dataset? Would like to investigate what the difference is?

jts · 2023-02-24T13:36:01Z

Ah, f5c does work for me - I was providing the model incorrectly before.

jts · 2023-02-24T15:16:46Z

I needed to make two changes in 5b6ac67 to make the reads load:

the R10 pore detection was failing due to the new flowcell type
the event detection parameters that I used for R10.0 do not seem to work for R10.4.1, so I have gone back to the r9 ed parameters for now (which I assume f5c is using)

jts · 2023-02-24T15:18:11Z

Note that the r10 branch is now far out of date with master. Rebasing it seems messy so I will likely re-implement R10.4.1 support from scratch

hasindu2008 · 2023-02-24T23:06:06Z

Cool. Now it seems to actually do training some k-mers when I tested with chr1. Now let me run with the whole genome. For a PCR'ed dataset, does the following command look right to you?
nanopolish train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../kmers/

hasindu2008 · 2023-03-03T22:21:58Z

Hi @jts

After 5 training rounds, roc plots for snps look like this.

Do you think the accuracy has saturated and would it be beneficial to go for more rounds?
Or any idea how we could improve the accuracy further - for instance parameters and thresholds?

hasindu2008 mentioned this issue Feb 27, 2023

Training an R10.4.1 methylation model #1064

Closed

hasindu2008 closed this as completed Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when trying to variant call using an R10 model #1059

Error when trying to variant call using an R10 model #1059

hasindu2008 commented Feb 17, 2023

jts commented Feb 17, 2023

hasindu2008 commented Feb 17, 2023

hasindu2008 commented Feb 18, 2023

hasindu2008 commented Feb 18, 2023

jts commented Feb 22, 2023

hasindu2008 commented Feb 23, 2023

jts commented Feb 23, 2023

jts commented Feb 23, 2023

hasindu2008 commented Feb 23, 2023

hasindu2008 commented Feb 23, 2023

jts commented Feb 24, 2023

jts commented Feb 24, 2023

jts commented Feb 24, 2023

hasindu2008 commented Feb 24, 2023

hasindu2008 commented Mar 3, 2023

Error when trying to variant call using an R10 model #1059

Error when trying to variant call using an R10 model #1059

Comments

hasindu2008 commented Feb 17, 2023

jts commented Feb 17, 2023

hasindu2008 commented Feb 17, 2023

hasindu2008 commented Feb 18, 2023

hasindu2008 commented Feb 18, 2023

jts commented Feb 22, 2023

hasindu2008 commented Feb 23, 2023

jts commented Feb 23, 2023

jts commented Feb 23, 2023

hasindu2008 commented Feb 23, 2023

hasindu2008 commented Feb 23, 2023

jts commented Feb 24, 2023

jts commented Feb 24, 2023

jts commented Feb 24, 2023

hasindu2008 commented Feb 24, 2023

hasindu2008 commented Mar 3, 2023