Huan ZHONG (hzhong5 at uab dot edu)
We have developed the MethylXcan which can predict the gene expression pattern based on DNA methylation data, and can help to provide insights into the mechanism of these associations.
R 3.2.1 is suggested. Some R packages, like "glmnet" and "methods" are also required.They will be automatically installed. No further installation is needed. You only need to format the input files acording to the requirement, and run one perl script on these files.
One tab-delimited annotation file containing gene expression probe, gene name, official name, chromosome and locations. Here is one gene entry as example.
ILMN_2038774 EEF1A1 NM_001402.5 chr6:74284964-74285013
**Note** This file is suggested without header.
One tab-delimited gene expression profiling dataset, containing gene probe and its profiling values (normalized if it is microarray data) from different samples.
Hybridization REF TWPID6598 TWPID3283 TWPID5553...
ILMN_1343291 16.043236443862 15.9458304153505 15.9085900238073...
One tab-delimited DNA methylation dataset, containing CpG probes and their methylation values (normalized if it is microarray data, beta values are required) from different samples.
Hybridization REF TWPID5259 TWPID8404 TWPID2116...
cg00240178 0.36676 0.38544 0.30756...
One tab-delimited CpG probe annotation file.
IlmnID CHR MAPINFO Strand UCSC_RefGene_Name UCSC_RefGene_Group
cg00240178 6 74232108 R EEF1A1 TSS1500
One tab-delimited gene annotation file.
chr strand txStart txEnd name
chr6 - 74225472 74230755 EEF1A1
perl run_gene_list.pl ex_probe_list.txt ex_dataset.txt me_dataset.txt methylation_annotation.txt data/gene_annotation.demo.txt
The final results will be named as "MethylXcan.txt", including 21 columns.
CpG: name of CpG probes.
n.site: number of CpG sites per gene.
gene: gene name.
beta.single: regression coefficient from single regression of gene expression on its each CpGs methylation separately.
beta.multiple: regression coefficients from multiple regression of gene expression on the methylation of its all CpG sites simultaneously.
beta.glmnet: coefficient from lasso regression between gene expression and its corresponding CpGs' methylation ratios.
R2.single.max: the largest coefficient of determination from the single regressions of one gene.
R2.single.var: the variance of all the R2 values obtained from one gene.
R2.single.cv.max: max coefficient of determination from cross-validation of single regression.
R2.single.cv.max.var: variance between coefficients of determination from cross-validation of single regression.
R2.multiple: coefficient of determination from multiple regressions.
R2.multiple.adjust: adjusted coefficient of determination from multiple regressions.
R2.multiple.cv: coefficient of determination from cross-valudation of multiple regressions.
R2.multiple.cv.var: variance of coefficient of determination from cross-valudation of multiple regressions.
R2.glmnet: coefficient of determination from lasso regressions.
R2.glmnet.cv: coefficient of determination from cross-validation of lasso regressions.
R2.glmnet.cv.var: variance of coefficient of determination from cross-validation of lasso regressions.
p.single: p-value from single regression.
p.multiple: p-value for each CpG in a multiple regressions.
p.multiple.overall: the overall p-value from multiple regressions.
genevar: variance of gene expression profiling between different samples.
dist: the distance between each CpG and its corresponding gene's TSS site.
The program might take a long time to run, hours for Gb-sized datasets. In demo, it might take 10 seconds to run 4 probes. So when running the job in cluster, it is recommended to split your probe files (ex_probe_list.txt) into several files, and send the jobs to different nodes.
Download the demo folder, and go into the demo folder and simply run
perl script/run_gene_list.pl \
data/ex_probe_list.demo.txt \
data/ex_dataset.demo.txt \
data/me_dataset.demo.txt \
data/methylation_annotation.demo.txt \
data/gene_annotation.demo.txt
The final "MethylXcan.txt" is the final results.
............