This is the repository for cnnGWAS, which is a method that captures complex biological patterns shared by GWAS risk variants. By utilizing convolutional neural networks (CNN), the method prioritizes functional variants among all candidate variants in GWAS loci.
- We recommend using miniconda as a package manager when executing cnnGWAS. Miniconda can be installed from the following link: https://docs.conda.io/en/latest/miniconda.html#
- All software tests were done under the following environment setting, and therefore we strongly recommend to set the miniconda environment via following commands:
conda create --name cnnGWAS python=2.7
conda activate cnnGWAS
conda install theano=1.0.3 numpy=1.15.4 matplotlib=2.2.3 pandas=0.24.2 bedtools=2.28.0 bedops=2.4.36 mkl=2019.4
library(ggplot2)
library(gplots)
library(grid)
library(gridExtra)
library(plyr)
library(reshape2)
awk (GNU Awk 3.1.7)
wget (GNU Wget 1.12 built on linux-gnu)
impG-summary (http://bogdan.bioinformatics.ucla.edu/software/impg/)
- When environment setting is complete, clone or download the repository. Through the following steps, feature DB will be downloaded and training will take place. Input text file (GWAS summary statistics) is required.
- Required data can be downloaded by executing the following commands:
cd DB
bash Download_DB.sh
cd ..
- Input summary statistics file should be located under directory "00_RawDatas/". Example input file "AUTsummary.hg19.txt" is provided.
- Input file must be a tab delimited text file with 6 columns.
- Each of the columns should have a header row named as follows:
snpid chr bp a1 a2 zscore
rs3131972 chr1 752721 A G 0.618384984132729
rs3131969 chr1 754182 A G 1.20540840742781
- All coordinates are based on hg19 (GRCh37).
- Raw GWAS summary statistics files can include different information; in most cases, z-score can be calculated from the given information.
- We recommend simple file prefix, preferably disease name (e.g. SLE), always followed by ".txt" extension.
- Underbar "_" must not be included inside a file prefix (e.g. PSY_AUT_dis.txt; not allowed), since it will cause string processing error.
- Example file names: "SLE.txt", "MDD2019.txt"
Following command executes the imputation process. Multiple summary statistics files can be given as an input (e.g. bash RUN.sh FILE1.txt FILE2.txt FILE3.txt).
cd 01_RunImpG
bash RUN.sh AUTsummary.hg19.txt
cd ..
In this stage, the input files (*.pkl.gz) for training are produced.
cd 02_PreProc
bash RUN.sh AUTsummary.hg19.txt
cd ..
In this stage, the training itself takes place. As an example, a sample input file is provided (02_PreProc/Data/AUTsummary.hg19_AE_30SNP.pkl.gz).
cd 03_RunCNN
bash RUN.sh AUTsummary.hg19.txt
cd ..
The performance of the model is saved as a pdf file. Also, the scores of candidate SNPs are calculated and given as a text file. SNPs with causal score > 0.5 are classified as positive, and are considered as candidate causal variants.
cd 04_Performance
bash RUN.sh AUTsummary.hg19.txt
cd ..
- Each folder contains "NOTE" or "RUN_NOTE.txt". Please read the contents before executing the software.