This is the repository created for COMP483 Project by group 2:
A Python wrapper to provide a convienacne way to generate K identifiers required for KEGG Mapper, by giving assembled genome sequences.
-
SRA-Toolkit
For downloading FASTA from NCBI -
Prokka
For rapid prokaryotic genome annotation
This wrapper uses SilentGene - prokka2kegg_batch to automatically convert annotated gene from Prokka's (.gtk file) to K ids required.
The python script and database required for prokka2kegg are included in the wrapper, user does not need to download or install anything for this part.
-i: (required) Specifiey the FASTA source and Taxonomy information for each sample in a json format
-e: (oprional) Email used for Biopython
-o: (optional) An optional flag for user to name the output folder for each run. Defalt will give output in a folder named 'formatkresults'.
The following code will run the wrapper by giving GenBank assembly accession
- Runing 1 sample: GCA_002861225.1
python3 formatk.py -i '{"GCA_002861225.1": "Escherichia coli"}'
- Runing multiple samples: GCA_002861225.1 and GCA_002861815.1
python3 formatk.py \
-i '{"GCA_002861225.1": "Escherichia coli", "GCA_002861815.1": "Lactobacillus crispatus"}'
The following code will run the wrapper with user specified FASTA
- Runing 1 sample: Escherichia_coli.FASTA
python3 formatk.py -i '{"DIR/Escherichia_coli.FASTA": "Escherichia coli"}'
- Runing multiple samples: Escherichia_coli.FASTA and Lactobacillus_crispatus.FASTA
python3 formatk.py \
-i '{"DIR/Escherichia_coli.FASTA": "Escherichia coli", "DIR/Lactobacillus_crispatus.FASTA": "Lactobacillus crispatus"}'
The following code will run the wrapper with mixed GenBank assembly accession and user specified FASTA
- Runing multiple samples: GCA_002861225.1 and Lactobacillus_crispatus.FASTA
python3 formatk.py \
-i '{"GCA_002861225.1": "Escherichia coli", "DIR/Lactobacillus_crispatus.FASTA": "Lactobacillus crispatus"}'
The following code will run the wrapper by giving GenBank assembly accession (GCA_002861225.1) and give the results in a folder named 'TESTRUN' under '$HOME/' directory with your own email
python3 formatk.py -i '{"GCA_002861225.1": "Escherichia coli"}' -e useremail -o TESTRUN
We rename each sample by the end of the process based on sample taxonomy information and it's input file name or GenBank assembly accession as 'genus_species_filename' or 'genus_species_GenBank_ID'.
The wrapper will generate a 'formatkresults' folder under '$HOME/' directory (or user specified folder if '-o' argument is supplied), and the folder contains the following:
- formatk_out.txt
Final gene list required by KEGG MAPPER - formatk_out_order.txt
The order of each sample in formatk_out.txt and KEGG - Others
- Downloaded assemble genome data from NCBI (if applicable)
- Prokka folder
Prokka generated results for each entry - GBK folder
Prokka generated gbk files output for each entry - 2kegg
SlientGene generated results for each entry
- E. coli: GCA_002861225.1
- L. crispatus: GCA_002861815.1
- P. mirabilis: GCA_012030515.1
-
Pathway mapping of two organisms (global/overview maps)
Organism 1#00cc33
Organism 2#ff3366
Organisms 1 and 2#3366ff
-
Pathway mapping of multiple organisms (regular maps)
Organism 1#bfffbf
Organism 2#ffbbcc
Organism 3#bbccff
Organism 4#cfffcf
Organism 5#ffcfef
Organism 6#cfefff
Organism 7#dfefcf
Organism 8#ffefcc
Organism 9#dfccff
Organism 10#dfdfcc