This repository contains the ETL/scripts used to obtain data for the F. culmorum KnetMiner knowledge graph.
N.b. Some data is created or obtained externally, as mentioned in the manuscript; this includes eggNog data, BLAST, and OMA data. These datasets are also processed in the f_culmorum_etl.py
script present in this repository.
Python 3.7+ is a requirement for both scripts, as is R for rpy2.
Additional Python dependencies include the following:
For the biomart script, simply provide the directory you wish to have your data exported to. Perform the following command (append to the directory arg as you wish):
python fusarium_biomart.py /home/
For the f_culmorum_etl.py
script, which is the main ETL script, a few booleans are used to have a choice between what data you obtain.
Please change the input file names accordingly to any updated data, directly in the script, or write a config file for it. Some of the boolean flags are grouped due to data dependencies for certain data types to be obtained/produced, i.e. UniProt data required for mapping to eggNog data.
- You must add a base directory with the
-b
or--bdir
flag, which is where all your files will be written to, or respectively stored for BLAST/eggNog/OMA outputs derived externally. ^^ - If you wish to download Ensembl specific data, use the
-e
or---ensembl
flag. NOTE: this requires you to have your BLAST data present in the BLAST folder within your base directory, as this data is further mapped. You will need to name your data accordingly, to what's present in the script - for the F. culmorum blast output, it should be namedresults_f_culmorum.out
, the Ascomycota BLAST output should be namedall_uniprot_f_culmorum.out
. The resultant outputs given are thef_culmorum_phi_mapping.txt
andf_fulculmorum_ascomycota_mapping.txt
outputs, containing mappings between the BLAST data and Ensembl data. - If you wish to specifically transformed any eggNog data (as described in the manuscript), use the
-egg
or--eggnog
flag. You will need the eggnog output, named asegg_nog_fusarium_filtered.tsv
, which should be present in the eggNog folder. You will also need the base fasta file, which in this case isfculmorumUK99vs_proteins.fa
. You will retrieve data back in the BLAST folder, containg mappings between eggNog & BLAST proteins with gene names. Additionally, the uniprot data used will be downloaded into the uniprot directory. - For mapping only data use to
-m
or--mapping
flag. This will obtain data which maps identifiers, be it gene or protein identifiers, to external identifiers. You will need to have performed the eggNog data transformation, first. - For additional gene name data, use the
-n
or--names
flag. This does require the fusarium_mutant_db.tsv curated file from RRes, which must be present in the misc folder within the base directory. The output will be called fg_gene_names.txt in the misc folder. - For mapping BLAST data (F.culmorum to PhiBase fasta) to corresponding proteins in the core PhiBase database , so that phenotypes and diseases can be identified, use the
-p
or--phi
flag. Note that the input files must be present in the phibase folder within the base directory, with the blast data being named asphibase_blast_raw.out
and the mapping between phibase & F.culmorum named asf_culmorum_phi_mapping.txt
. The resultant files will be in the phibase folder, which includesfusarium-phi-gene-mapping.txt
&phibase-blast-filtered.txt
- To obtain string data, set the
-str
or--string
flag as true.
^^ Note that the folder structure is indeed created by the ETL script, but you may wish to create it prior to use so dependent files can be placed in their respective folders (as outlined in the above options avaiable).
This includes:
'uniprot', 'BLAST', 'cyc', 'InterPro', 'eggNog', 'mapping', 'ensembl', 'agdb', 'string', 'OMA', 'biomart'