Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecountering an issue when running AlleleCall #182

Open
kamivain opened this issue Aug 14, 2023 · 9 comments
Open

Ecountering an issue when running AlleleCall #182

kamivain opened this issue Aug 14, 2023 · 9 comments
Assignees
Labels
Status: In Progress Has been assigned and is being worked on.

Comments

@kamivain
Copy link

Hello,I have encountered an issue when running AlleleCall to the genomes. It said "AttributeError: 'NoneType' object has no attribute 'seq'", what's the matter, thank you!

$ chewBBACA.py AlleleCall -i bu_genome -g bu_schema/schema_seed/ --gl bu_result_wgMLST/cgMLST/cgMLSTschema99.txt -o bu_result251_cgMLST --cpu 2

chewBBACA version: 3.2.0
Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: [email protected]

==========================
chewBBACA - AlleleCall

Started at: 2023-08-13T22:39:06

Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: bu_schema/schema_seed/bu_train.trn
CPU cores: 2
BLAST path: /usr/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 251
Number of loci: 971

== CDS prediction ==

Predicting CDS for 251 inputs...
[====================] 100%

== CDS extraction ==

Extracting predicted CDS for 251 inputs...
[====================] 100%
Extracted a total of 1694809 CDS from 251 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 603928 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 194185 exact matches (matching 38271 distinct alleles).
Unclassified CDS: 565657

== CDS translation ==

Translating 565657 CDS...
[====================] 100%
Identified 3633 CDS that could not be translated.
Information about untranslatable and small sequences stored in bu_result251_cgMLST/temp/invalid_cds.txt
Unclassified CDS: 562024

== Protein deduplication ==

Identifying distinct proteins...identified 296723 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 5906 exact matches (22513 distinct CDS, 30655 total CDS).
Unclassified proteins: 290823

== Clustering ==

Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 81137 distinct minimizers for 971 loci.
Clustering proteins...
[====================] 100%
Clustered 290823 proteins into 984 clusters.
Clusters to BLAST: 984
[====================] 100%
Classifying clustered proteins...
[====================] 100%
Classified 11856 distinct proteins.
Unclassified proteins: 278967

== Representative determination ==

Iteration 1

Loci: 971
BLASTing loci representatives against unclassified proteins...done.
Traceback (most recent call last):
File "/home/yao/.local/bin/chewBBACA.py", line 8, in
sys.exit(main())
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 1545, in main
functions_info[process]1
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 528, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2718, in main
results = allele_calling(input_files, schema_directory, temp_directory,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2510, in allele_calling
locus_results = expand_matches(match_info, prot_index, dna_index,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 1389, in expand_matches
target_protein = str(pfasta_index.get(target_id).seq)
AttributeError: 'NoneType' object has no attribute 'seq'

@ramirma
Copy link
Member

ramirma commented Aug 14, 2023

Dear @kamivain,

Thank you for your interest in chewBBACA. Please have a look at issue #176. I note that you are using python 3.10. Althought this should not be a problem we do advise to use python 3.9, this may also result in a clearer error reporting. The other potential problem is if you are using BLAST>2.9. Please downgrade if necessary because we know there are incompatibilities. If downgrading BLAST does not solve the problem there may be problems with the file or contig names. Please look into the previous issues reported on this.

Best Regards,

Mario

@kamivain
Copy link
Author

kamivain commented Aug 14, 2023 via email

@rfm-targa rfm-targa added the Status: In Progress Has been assigned and is being worked on. label Sep 6, 2023
@Fla1487
Copy link

Fla1487 commented Dec 15, 2023

I have the similar problem, but if I apply the command on a selection of the genomes it appears to be solved. Conversely, when applied on the second part I have agains the problem.

@rfm-targa
Copy link
Contributor

Greetings @Fla1487,

Thank you for your interest in chewBBACA. Based on what you report, it might be related to issues in one or several input files (badly formatted files, special characters in the filename or sequence headers, etc). Updating to the latest version may also help, as it solves several issues in older versions. If you cannot find the cause of the issue, please share what's printed to the stdout, as it might include enough information to determine the type of issue.

Kind regards,

Rafael

@rfm-targa rfm-targa self-assigned this Dec 18, 2023
@artmisk13
Copy link

artmisk13 commented Jul 3, 2024

Edits:
I have also tried to set the python to ver 3.9 and BLAST to version 2.9 but it came out with a new error (see file:
allele_call_pubold12_err_new.txt
allele_call_pubold12_err_new.txt)

Hi chewBBACA developer,

I encounter the exact error, but only to a subset of my genomes. So, initially, I tried to perform AlleleCall for 2500 genomes which failed due to the same error. Then I did multiple AlleleCall to 4 batches of 600-700 genomes, some of them succeeded, but some failed (N = 937 genomes). This is what I have done:

  • All genomes (with successful vs failed AlleleCall runs) were in the same directory
  • Check that the fasta files did exist, with the correct extension.
  • Check fasta file names and the sequence headers of each fasta, all without special character
  • Update chewBBACA last version (through conda)

Below is the code
chewBBACA.py AlleleCall -i ./fasta/ -g ./cgmlst_scheme -o ./output_pub-old1-1_2-3 --output-unclassified --output-missing --output-novel --cpu 8

The error file and output file of this run are attached. Please kindly look into this and what can you suggest for me to do? Thank you very much!

Best,
Krisna
allele_call_pubold12_err.txt
allele_call_pubold12_out.txt

@rfm-targa
Copy link
Contributor

Hello @artmisk13,

Thank you for reporting this issue. We know of more users who have encountered this bug under similar circumstances. Based on what users report, it should be related to a single or a set of input files. We never got the same issue or managed to reproduce the error even when users shared data. That is the reason why we could not look into this properly. This error is strange because chewBBACA cannot get a sequence that should be in the FASTA file. Could you share a minimal test case that leads to the same error? For example, we can use the schema, a subset of the schema loci, and a genome to find and fix the issue. Any data you share with us is handled privately; we will only use it for bug fixing (you can upload a Zip with the data to WeTransfer and send the link to [email protected]).

Also, part of the problem might be related to the environment configuration. If you are using a conda environment to run chewie, can you run conda list -n <ENV_NAME> --export > package-list.txt to get the list of packages in the environment? If you share that file with us, we can create an environment with the same packages as yours, which might help us identify the issue if it is related to any specific package.

Lastly, the BLAST error BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB should be related to input files with a short, unique identifier (4 or fewer chars) and numeric only (e.g. 123.fna has 123 as a unique identifier). It should work if you change the file names to be composed of more than 4 chars or add a letter.

Let us know if you can share some data and if changing the file names fixes the BLAST error.

Best regards,

Rafael

@artmisk13
Copy link

artmisk13 commented Jul 8, 2024

Hi Rafael,

Thanks for your thorough explanation and suggestions, they are really helpful!

  1. Change file names:
    As you suggested I changed the fasta file names which only have 3 characters and the same error still occurred. However, I changed all fasta file names so that they all contained letters and it finally worked! So for the sake of curiosity, I ran this using my initial chewie AlleleCall run setup (ver 3.2, BLAST 2.14, python 3.9.16), and:
  • It worked when adding letters in the fasta file names with "_" separator, i.e. 5108.fasta to Hinf_5108.fasta
  • Produced a different error error_tmp.txt when adding letters in the fasta with "-" separator, i.e. 5108.fasta to Hinf-5108.fasta

So I'm guessing there is a problem somewhere in 1) reading the fasta files when the name only has numerical characters and 2) creating the "missing_classes.fasta" file when there is a '-' separator in the input fasta name (problem in string variable splitting?). The 2nd problem probably has been addressed in the newer chewie version. I hope this new information helps you further in debugging the AlleleCall module.

  1. Share data for debug:
    I'm happy to share the scheme and the genomes to help you debug the problem. The scheme is publicly available from PubMLST: H. influenzae cgMLST. The "unpublished" status is there just because the manuscript is still under peer-review*. The genomes are also publicly available (curated complete genomes from NCBI), and these are the isolate IDs
    these are the isolate IDs and you can download the contigs from PubMLST

*Once the manuscript is accepted for publication, I am happy to upload the scheme to Chewie-NS so more people can use it!

Best,
Krisna

@rfm-targa
Copy link
Contributor

rfm-targa commented Jul 9, 2024

Hello @artmisk13,

Thank you for sharing the details and data about the errors. It will help us a lot. We will probably change how IDs are processed internally to solve this kind of issue for good.
Uploading the schema to Chewie-NS would be great. Just let us know when you'd like to do it, and we'll add you as a contributor or upload it if you prefer us to handle that.
I will let you know when we have changed things.

Best regards,

Rafael

@rfm-targa
Copy link
Contributor

Hello @artmisk13,

We released chewBBACA v3.3.9. This version includes changes to check if BLAST interprets input unique IDs as PDB chain IDs or if it modifies the IDs at all. We use makeblastdb to create BLAST databases (DBs) to search for matches based on lists of identifiers. To use the list of identifiers, we need to include -parse_seqids when creating the DBs, and that leads to the issue where BLAST modifies some of the sequence IDs in the FASTA used to make the DB. This is a problem when we cannot match the IDs recovered from the DB to those in the original FASTA file. To avoid this issue, chewBBACA will warn users about any input files affected whose unique IDs lead to the issue. To continue, users will have to rename the files. This is safer than accepting the files and changing/checking everything internally to ensure it works.
Let us know if the latest version identifies the input files in your dataset that caused the issue.

Kind regards,

Rafael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress Has been assigned and is being worked on.
Projects
None yet
Development

No branches or pull requests

5 participants