This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Signature Analysis Pipeline.
This pipeline performs network-based signature analysis on the columns of a given spreadsheet, where spreadsheet's columns correspond to sample-labels and rows correspond to gene-labels. The signature is based on correlating gene expression data (network enriched) against known gene signature data.
There are four similarity "signature" methods that one can choose from:
- similarity (traditional method)
- net_similarity (with network enrichment)
- cc_similarity (with bootstraps)
- cc_net_similarity (with bootstraps and network enrichment)
and two correlation measures:
- spearman
- cosine
git clone https://github.com/KnowEnG-Research/Gene_Signature_Pipeline.git
pip3 install pyyaml
pip3 install knpackage
pip3 install scipy==0.18.0
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install matplotlib==1.4.2
pip3 install scikit-learn==0.17.1
apt-get install -y python3-pip
apt-get install -y libfreetype6-dev libxft-dev
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
cd Gene_Signature_Pipeline
cd test
make env_setup
Command | Option |
---|---|
make run_spearman | spearman similarity |
make run_net_spearman | spearman similarity with network enrichment |
make run_cc_spearman | spearman similarity with bootstraps |
make run_cc_net_spearman | spearman similarity with bootstraps & network enrichment |
Follow steps 1-3 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in the Gene_Signature_Pipeline/data/run_files zTEMPLATE_cc_net_spearman.yml
Change processing_method to one of: serial, parallel depending on your machine.
processing_method: serial
set the data file targets to the files you want to run, and the parameters as appropriate for your data.
- Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
- Run
python3 ../src/gene_signature.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_spearman.yml
Key | Value | Comments |
---|---|---|
method | similarity, cc_similarity, net_similarity or cc_net_similarity | Choose similarity method |
similarity_measure | spearman, cos | Choose correlation measure |
gg_network_name_full_path | directory+gg_network_name | Path and file name of the 4 col network file |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user supplied gene sets |
signature_name_full_path | directory+signature_data_name | Path and file name of user supplied signature data |
results_directory | directory | Directory to save the output files |
tmp_directory | directory | Directory to save the intermediate files |
rwr_max_iterations | 100 | Maximum number of iterations without convergence in random walk with restart |
rwr_convergence_tolerence | 1.0e-8 | Frobenius norm tolerence of spreadsheet vector in random walk |
rwr_restart_probability | 0.7 | alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo |
rows_sampling_fraction | 0.8 | Select 80% of spreadsheet rows |
number_of_bootstraps | 4 | Number of random samplings |
processing_method | serial or parallel or distribute | Choose processing method |
gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
signature_data_name =
- Output files of all four methods save samples by signature similarity "correlation" with name similarity_matrix_{method}{measure}{timestamp}_viz.tsv.
signature 1 | ... | signature m | |
---|---|---|---|
sample 1 | float | ... | float |
... | ... | ... | ... |
sample n | float | ... | float |