Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRISPR prediction assertion error #356

Open
augustkx opened this issue Dec 13, 2024 · 4 comments
Open

CRISPR prediction assertion error #356

augustkx opened this issue Dec 13, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@augustkx
Copy link

augustkx commented Dec 13, 2024

Hi,

I am running Bakta on some metagenomic samples. I installed Bakta via conda conda install -c conda-forge -c bioconda bakta, Bakta software version: 1.9.4.

I still come across the following errors. I would appreciate it very much if you could help have a look.

AssertionError
Traceback (most recent call last):
File ".../miniconda3/envs/env_Bactopia/bin/bakta", line 10, in
sys.exit(main())
File ".../miniconda3/envs/env_Bactopia/lib/python3.8/site-packages/bakta/main.py", line 210, in main
genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path)
File ".../miniconda3/envs/env_Bactopia/lib/python3.8/site-packages/bakta/features/crispr.py", line 107, in predict_crispr
assert spacer_seq == spacer_genome_seq # assure PILER-CR provided sequence equals sequence extracted from genome

Here is the last row of the debug log:

1:42:45.561 - DEBUG - CRISPR - spacer: array-id=12, start=6103, stop=6112, genome-seq=TACGGCAAAT, spacer-seq=TTACGGCAAA

Here is the crispr.txt :

Array 12

contig_1397
Pos Repeat %id Spacer Left flank Repeat Spacer
========== ====== ====== ====== ========== ================================================ ======
912 48 100.0 29 GAGCTTAGCA ....................................-........... TCGTAGATGAAACTATTGTAAATAGCGGT
989 48 97.9 29 AAATAGCGGT ...............................C....-........... TTCGATGCAAAGACAGAGTACGCTAATCT
1066 48 100.0 30 ACGCTAATCT ....................................-........... AATCGACACTAACGTTTGAGTTTGGAGTGG
1144 48 100.0 30 TTTGGAGTGG ....................................-........... TATAATCTTGATATATTATATAAGTCCTTA
1222 48 100.0 30 TAAGTCCTTA ....................................-........... TTTACACCAGCGTTAAGCAGAGAATCAAGC
1300 48 100.0 30 AGAATCAAGC ....................................-........... CGGTCGCAAATATGCTTCTTTGCTCTATGA
1378 48 100.0 30 TGCTCTATGA ....................................-........... TGAATTAGGGTATATTGCACACATTCCAAT
1456 48 100.0 30 ACATTCCAAT ....................................-........... AATTTTGACAAATCAACACCTTTAGATACG
1534 48 100.0 29 TTTAGATACG ....................................-........... TCGTACATTATATCGATAGTACCTTTGTT
1611 48 100.0 29 TACCTTTGTT ....................................-........... TAAAGTTATCATCAACAAAGCCCTTATCA
1688 48 100.0 30 GCCCTTATCA ....................................-........... ACGCACGTCATCACTTAGTGATTGACTTTG
1766 48 100.0 30 ATTGACTTTG ....................................-........... CTTCAATACGATATTTTTTGACTTGATTAA
1844 48 100.0 29 ACTTGATTAA ....................................-........... TGTAATATAAGGCTAGTATTTCACCTGCT
1921 48 100.0 29 TTCACCTGCT ....................................-........... ACGTGAAATCATTGATACACTTTATATCA
1998 48 100.0 29 CTTTATATCA ....................................-........... GTTATCATCAACCACATCATCCGGCTGTG
2075 48 100.0 30 TCCGGCTGTG ....................................-........... TCATTTTCTTTAATTTCTTGTATATCCTCG
2153 48 100.0 30 TATATCCTCG ....................................-........... AAAATTATCGTCAACAAATCCCTTGTCATA
2231 48 100.0 29 CCTTGTCATA ....................................-........... TGATTGCATGGCACTCTTATTGAGAGATG
2308 48 100.0 30 TTGAGAGATG ....................................-........... GACATACTTTGCGATGTTAATATTTCGTTA
2386 48 100.0 30 TATTTCGTTA ....................................-........... TTAATTTGGTTCTCTCGCTCCCTGTGGTAT
2464 48 100.0 30 CCTGTGGTAT ....................................-........... TAGAGAACTTAGAAACGTATAGTGAACTTA
2542 48 100.0 30 AGTGAACTTA ....................................-........... TGCGCCCAAACTCTCACCTCGTGCAATTCT
2620 48 100.0 29 GTGCAATTCT ....................................-........... CAGGTTTTGGAATGTTTCCAAACATAGGC
2697 48 100.0 30 AAACATAGGC ....................................-........... AAACGCACGCGCATTAGATATACGCGCGCG
2775 48 100.0 30 TACGCGCGCG ....................................-........... TTTCAGGGATTTTATTAAAAATTGTTCCTG
2853 48 100.0 30 ATTGTTCCTG ....................................-........... CATTTTTGTATGGTTAAACTTCCGCTTCCA
2931 48 100.0 30 TCCGCTTCCA ....................................-........... TGATACCGCAGGATTTCTCTATCAAACCAT
3009 48 100.0 30 ATCAAACCAT ....................................-........... AGTGGAGTGTTGCGATACACTGGTGAGGTG
3087 48 100.0 30 TGGTGAGGTG ....................................-........... GTCTGAACATTGCATCAATGGTATCTTGGC
3165 48 100.0 30 GTATCTTGGC ....................................-........... CCAAGTGCGTCAATTTCGGGCGTTGGCACT
3243 48 100.0 30 CGTTGGCACT ....................................-........... ATCATCGCCCTGGCAGTAACCAGAATTAGG
3321 48 100.0 30 CAGAATTAGG ....................................-........... TACACATGATAATACCATGTTCTTTTACAT
3399 48 100.0 29 TCTTTTACAT ....................................-........... GCTACAATTAAGCCGTCCTCTGGCTGAGG
3476 48 100.0 30 CTGGCTGAGG ....................................-........... CCCTATACCCCCTATATCTCATTATCGCTA
3554 48 100.0 29 ATTATCGCTA ....................................-........... CGCACGTCATCACTAAGGGATTGGCTTTG
3631 48 100.0 30 ATTGGCTTTG ....................................-........... CTGAGATACGTTTTTGCTCTCTACGTTTCT
3709 48 100.0 30 CTACGTTTCT ....................................-........... CTCTTGGACATCAATATTATTATAGTCCAA
3787 48 100.0 29 TATAGTCCAA ....................................-........... ACCATTTAACACTAACATCGTGAGATAGG
3864 48 100.0 30 GTGAGATAGG ....................................-........... ATCAACGCTACCATCTTTAGCAGGTACATA
3942 48 100.0 30 CAGGTACATA ....................................-........... GCGCACGCGCAAAGAAGTCTACGCGCGCAC
4020 48 100.0 30 ACGCGCGCAC ....................................-........... CCAGACCATTATGATCAGTTGGAAGAAGCT
4098 48 100.0 30 GGAAGAAGCT ....................................-........... GCCTTCCTGCCTCCGATTTGAAAGGAATTG
4176 48 100.0 29 AAAGGAATTG ....................................-........... TTATCCTGAATCTGTGTAATATAGTATGT
4253 48 100.0 30 TATAGTATGT ....................................-........... TCTTCATTTGGGAAGCCGTCTTCCTTTGCA
4331 48 100.0 30 TTCCTTTGCA ....................................-........... AGTGATGCAGGTAACCCCGGTTATCGATGT
4409 48 100.0 29 TTATCGATGT ....................................-........... TGATAATTTATCCTTAACCTCAGCTGATT
4486 48 100.0 30 TCAGCTGATT ....................................-........... GCTCACCAGTATTAGGGTTTATCCAGTAGG
4564 48 100.0 30 ATCCAGTAGG ....................................-........... AAGCGAAAGCAGAAAGTGTAAGAGATACCT
4642 48 100.0 29 AGAGATACCT ....................................-........... ACATACTTTGAGAAGTTAATATTTGTTTA
4719 48 100.0 30 TATTTGTTTA ....................................-........... GTACACCACCGGTTAAGCTGCCAGAATTCA
4797 48 100.0 29 CCAGAATTCA ....................................-........... GAGGAAAATACTTGTCATACCACGCCTTA
4874 48 100.0 30 CCACGCCTTA ....................................-........... TAATGTTATAACTGTCGTCTTGATTTTTCA
4952 48 100.0 30 TGATTTTTCA ....................................-........... ATATTTAATTTATTAAATTTGAATAACACC
5030 48 100.0 30 GAATAACACC ....................................-........... GCAAATTCTTTACCGCCGTAGCTGCCATTT
5108 48 97.9 30 GCTGCCATTT ....................................T........... ATAGAGGATGCAGCGTTGCCGAAAGCACCT
5186 48 100.0 30 GAAAGCACCT ....................................-........... TGTTGGCTGTATTCTTATAGATAGCCATTC
5264 48 100.0 30 ATAGCCATTC ....................................-........... TGAGAGTAGCGCTTTTTGTTAATACCGACT
5342 48 100.0 30 AATACCGACT ....................................-........... TCTGTTAAAATAGCTTAAATGTGGGTACTT
5420 48 100.0 30 GTGGGTACTT ....................................-........... AGAATAGCTGCAATACCTCGTCTATATTCT
5498 48 100.0 30 TCTATATTCT ....................................-........... CACTGCAATGGATACAATGGTAAGGGAAAA
5576 48 100.0 491 TAAGGGAAAA ....................................-........... CTAATAATTTGTATGTATCCGAATATTATGGTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACAAGGAATGAAACGTGGGGATGCGAGAAGTGGTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACGTTCAACGTTCTCTATGTTGGTAGAGATGGGTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACCGCGCACGCGCAAAGAAGTCTACGCGCGCAGTTGTGATTTGCTTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACTACGTCCTCCAATGATTGAGTTTTTATTTAGTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACATGCCTGCTTTGCGCCCCTTTGATGTTTCAGTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTATCAGCAACAACGACGAAACAGGCAATTCTACCCCCACGA
6115 48 100.0 ACCCCCACGA ....................................-........... TTACGGCAAA
========== ====== ====== ====== ========== ================================================
62 48 37 GTTGTGATTTGCTTGAAAATTCGTACCTTTGTGGTA-TCAGCAACAAC

@augustkx augustkx added the bug Something isn't working label Dec 13, 2024
@oschwengers
Copy link
Owner

Hi @augustkx , thanks for reporting this. PILERCR is a constant source of novel edge cases. To take a deeper look into this, could you provide me with the sequence of the contig that encodes this array?

@augustkx
Copy link
Author

Hi, thank you for taking care of it! Here is it.
contig_1397.txt

@oschwengers
Copy link
Owner

Hi, I tested v1.10.3 with the sequence you've provided. "Unfortunately", I cannot reproduce this error:

$bakta --db ../databases/bakta/latest --debug --skip-trna --skip-tmrna --skip-rrna --skip-ncrna --skip-ncrna-region --skip-cds --skip-sorf --skip-ori --skip-gap --skip-plot --output test contig_1397.txt 

Bakta v1.10.3
Options and arguments:
	input: .../bakta-gh-356/contig_1397.txt
	db: .../db-v5.1, version 5.1, full
	translation table: 11
	output: .../bakta-gh-356/test
	tmp directory: /tmp/tmp4dqhhfzh
	prefix: contig_1397
	threads: 8
	debug: True
	skip tRNA: True
	skip tmRNA: True
	skip rRNA: True
	skip ncRNA: True
	skip ncRNA region: True
	skip CDS: True
	skip sORF: True
	skip gap: True
	skip oriC/V/T: True
	skip plot: True

Parse genome sequences...
	imported: 1
	filtered & revised: 1
	contigs: 1

Start annotation...
skip tRNA prediction...
skip tmRNA prediction...
skip rRNA prediction...
skip ncRNA prediction...
skip ncRNA region prediction...
predict CRISPR arrays...
	found: 1
skip CDS prediction...
skip sORF prediction...
skip gap annotation...
skip oriC/T annotation...
apply feature overlap filters...
select features and create locus tags...
	selected: 1
improve annotations...
	revised gene symbols: 0

Genome statistics:
	Genome size: 6,130 bp
	Contigs/replicons: 1
	GC: 38.7 %
	N50: 6,130
	N90: 6,130
	N ratio: 0.0 %
	coding density: 83.6 %

annotation summary:
	tRNAs: 0
	tmRNAs: 0
	rRNAs: 0
	ncRNAs: 0
	ncRNA regions: 0
	CRISPR arrays: 1
	CDSs: 0
		hypotheticals: 0
		pseudogenes: 0
	sORFs: 0
	gaps: 0
	oriCs/oriVs: 0
	oriTs: 0

Export annotation results to: /home/oliver/tmp/bakta-gh-356/test
	human readable TSV...
	GFF3...
	INSDC GenBank & EMBL...
	genome sequences...
	feature nucleotide sequences...
	translated CDS sequences...
	feature inferences...
	skip generation of circular genome plot...
	machine readable JSON...
	Genome and annotation summary...

Could you please try v1.10.3 on this and test if the error still occurs?

@oschwengers oschwengers self-assigned this Dec 20, 2024
@augustkx
Copy link
Author

augustkx commented Dec 21, 2024

Hi,

Thanks very much for testing!
I also ran Bakta on this specific contig and encountered no error. But when I ran all the contigs together, there still was an error.

So I need to check one thing:
This is the last line of the log file. Does it mean the error happened at this contig and location? And does it mean that the contig indicated in the crispr.txt under Array 12 is the problematic contig?

00:36:59.695 - DEBUG - CRISPR - repeat: array-id=12, start=6056, stop=6103
00:36:59.695 - DEBUG - CRISPR - spacer: array-id=12, start=6103, stop=6112, genome-seq=TACGGCAAAT, spacer-seq=TTACGGCAAA


Here is the crispr.txt file, it seems to have finished running despite the assert spacer_seq == spacer_genome_seq # assure PILER-CR provided sequence equals sequence extracted from genome error, does it?

crispr.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants