Skip to content

A Python wrapper to provide a convienacne way to generate K identifiers required for KEGG Mapper, by giving assembled genome sequences.

Notifications You must be signed in to change notification settings

sang-15/Mixing-Metabolomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This is the repository created for COMP483 Project by group 2:
A Python wrapper to provide a convienacne way to generate K identifiers required for KEGG Mapper, by giving assembled genome sequences.

Perequisites

Included within wrapper

This wrapper uses SilentGene - prokka2kegg_batch to automatically convert annotated gene from Prokka's (.gtk file) to K ids required.
The python script and database required for prokka2kegg are included in the wrapper, user does not need to download or install anything for this part.

Running wrapper

-i: (required) Specifiey the FASTA source and Taxonomy information for each sample in a json format
-e: (oprional) Email used for Biopython
-o: (optional) An optional flag for user to name the output folder for each run. Defalt will give output in a folder named 'formatkresults'.

GenBank assembly accession

The following code will run the wrapper by giving GenBank assembly accession

python3 formatk.py -i '{"GCA_002861225.1": "Escherichia coli"}' 
python3 formatk.py \
-i '{"GCA_002861225.1": "Escherichia coli", "GCA_002861815.1": "Lactobacillus crispatus"}'

User specified FASTA

The following code will run the wrapper with user specified FASTA

  • Runing 1 sample: Escherichia_coli.FASTA
python3 formatk.py -i '{"DIR/Escherichia_coli.FASTA": "Escherichia coli"}' 
  • Runing multiple samples: Escherichia_coli.FASTA and Lactobacillus_crispatus.FASTA
python3 formatk.py \
-i '{"DIR/Escherichia_coli.FASTA": "Escherichia coli", "DIR/Lactobacillus_crispatus.FASTA": "Lactobacillus crispatus"}' 

Mixing both GenBank assembly accession and user specified FASTA

The following code will run the wrapper with mixed GenBank assembly accession and user specified FASTA

  • Runing multiple samples: GCA_002861225.1 and Lactobacillus_crispatus.FASTA
python3 formatk.py \
-i '{"GCA_002861225.1": "Escherichia coli", "DIR/Lactobacillus_crispatus.FASTA": "Lactobacillus crispatus"}' 

With optional flags -o and -e

The following code will run the wrapper by giving GenBank assembly accession (GCA_002861225.1) and give the results in a folder named 'TESTRUN' under '$HOME/' directory with your own email

python3 formatk.py -i '{"GCA_002861225.1": "Escherichia coli"}' -e useremail -o TESTRUN 

Output

We rename each sample by the end of the process based on sample taxonomy information and it's input file name or GenBank assembly accession as 'genus_species_filename' or 'genus_species_GenBank_ID'.

The wrapper will generate a 'formatkresults' folder under '$HOME/' directory (or user specified folder if '-o' argument is supplied), and the folder contains the following:

  • formatk_out.txt
    Final gene list required by KEGG MAPPER
  • formatk_out_order.txt
    The order of each sample in formatk_out.txt and KEGG
  • Others
    • Downloaded assemble genome data from NCBI (if applicable)
    • Prokka folder
      Prokka generated results for each entry
    • GBK folder
      Prokka generated gbk files output for each entry
    • 2kegg
      SlientGene generated results for each entry

Test data

  • Pathway mapping of two organisms (global/overview maps)
    Organism 1 #00cc33 #00cc33
    Organism 2 #ff3366 #ff3366
    Organisms 1 and 2 #3366ff #3366ff

  • Pathway mapping of multiple organisms (regular maps)
    Organism 1 #bfffbf #bfffbf
    Organism 2 #ffbbcc #ffbbcc
    Organism 3 #bbccff #bbccff
    Organism 4 #cfffcf #cfffcf
    Organism 5 #ffcfef #ffcfef
    Organism 6 #cfefff #cfefff
    Organism 7 #dfefcf #dfefcf
    Organism 8 #ffefcc #ffefcc
    Organism 9 #dfccff #dfccff
    Organism 10 #dfdfcc #dfdfcc

About

A Python wrapper to provide a convienacne way to generate K identifiers required for KEGG Mapper, by giving assembled genome sequences.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages