Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAA files do not always reflect the real order of the genes #64

Open
2 of 3 tasks
najwataib opened this issue Sep 26, 2023 · 2 comments
Open
2 of 3 tasks

FAA files do not always reflect the real order of the genes #64

najwataib opened this issue Sep 26, 2023 · 2 comments

Comments

@najwataib
Copy link

Is your feature request related to a problem? Please describe.
MSF relies on the order of the protein sequences in the faa files to identify systems. However, sometimes the proteins files downloaded from the ncbi are not ordered according to their positions on the genomes. This results on MSF missing some systems.

Describe the solution you'd like
One solution would be to use gff files to retrieve the order of the proteins on the genomes, and either re-ordering the faa files either including directly this information while processing all the hits found after hmm searches.

Describe alternatives you've considered
Currently, I am re-ordering systematically all the faa files I download from the ncbi.

Please complete the following information):

OS:

  • Linux
  • Windows
  • Mac

MacSyFinder Version:
macsyfinder2.1

@jpjarnoux
Copy link

Hi !
I encounter the same problem, I found a solution by using the translated_cds file from NCBI.
Be careful that there are pseudogenes inside.
If you want to remove the pseudogenes, you can use this line

awk '/^>/ {if (skip) skip=0; if (/pseudo=true/) skip=1} !skip' input_genome > genome_without_pseudo

@saphia
Copy link
Contributor

saphia commented Sep 19, 2024

Hi Najwa and Jérôme,

This is actually a feature that we may add in a future version of MacSyFinder, the reordering of the FASTA files. But this would mean we would have to (optionally) require more input files than simple FASTA files (e.g. GFF files).
Stay tuned!

Best,

Sophie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants