The Constrained Non-Exonic Predictor (CNEP) provides a score for each base of the genome of evidence that the base is in a constrained non-exonic element from information within large-scale collections of epigenomic and transcription factor binding data. The method was applied to the human genome using over sixty thousand features defined from such data.
The Conservation Signature Score by CNEP (CSS-CNEP) is a complementary score for each base of the genome of its expected CNEP score based on the ConsHMM conservation state annotation and overlapping constrained element annotations of the base.
Additionally there is a utility GetOverlapCoord to obtain a ranked list of features overlapping a given coordinate where the features are ranked by their expected score statistics.
CNEP predictions for the human genome (hg19) is available in BigWig format (.bw) here or in Wig format (.wig) by chromosome here. Input files used to generate the predictions are available here. LiftOver of CNEP predictions in hg38 is available here. Bases that were not mapped to uniquely by LiftOver or mapped to a chromosome other than chr1-22 or chrX were filtered.
CSS-CNEP predictions for the human genome (hg19) is available in BigWig format (.bw) here or in Wig format (.wig) by chromosome here. LiftOver of CSS-CNEP predictions in hg38 is available here. Bases that were not mapped to uniquely by LiftOver or mapped to a chromosome other than chr1-22 or chrX were filtered.
Running CNEP involves seven steps. The first step is preparing the input files. The second set is generating a set of sampled positions for training. The third is creating training files for calling Liblinear. The fourth step is calling Liblinear on each training file. The fifth step is generating predictions for portions of each chromosome. The sixth step is combining predictions from different portions of the same chromosome and label set. The seventh step is to average predictions based on different label sets.
The input files and directories are specified in Constants.java. If they need to be changed then edit Constants.java and type
javac -classpath . *.java
By default these are the inputs:
INPUTBEDDIR - directory containing bed files for the input features or subdirectories containing them; one bed file for input feature
featurelist_allinputbeds.txt -- file containing the list of bed files within INPUTBEDDIR one per line
chrorderlist.txt -- list of chromosomes to be included one per line
hg19.chrom.sizes -- tab delimited text file with chromosome information, two columns with first chromosome and the second chromosome size
LABELSCOORDSBEDDIR -- directory containing .bed files for constrained element sets; one .bed file for each constrained elements
labellist.txt -- text file listing constrained elements one per line without .gz extension included if files are gzipped
exons_gencode_v19.bed.gz - bed file with exon coordinates
traindirfeatures.txt - this file is only relevant if splitting training data across multiple directories (see option B below). It a tab delimited text file with each line corresponding to a feature subset. The first column contains the directory TRAINDIR_SUBSET for the feature subset from Step 3 option B. The second column contains the featurelist_subset.txt file name for the feature subset from Step 3 option B. The third column contains the corresponding INPUTBEDDIR_SUBSET sub-directory of INPUTBEDDIR where the files are from Step 3 option B.
These are by default directories with intermediate outputs:
SAMPLEDIR -- Directory where sample files are written
TRAINDIR -- Directory where training files are writen
MODELSDIR -- Directory where trained models are written
PREDICTIONSPORTIONSDIR -- directory where predictions are written for portions of chromosomes
FULLCHROMBYLABELDIR -- directory where full chromosome predictions are written for single label sets
The directory with the final CNEP predictions in Wig format is CNEPDIR
Additionally the defaults for the number of samples per classifier is 1,000,000, the number of classifers being ensembled is 10, and the number of portions for predictions is 10.
This step generates a set of sampled positions for training each of the classifiers. For each chromosome ten sets of a million positions are generated where positions from that chromosome are excluding from sampling. This should be executed by calling:
java -classpath . MakeSampling
This generates training files for each of the constrained element sets based on the sampled positions in samplingfile for training Liblinear.
There are two options for creating training files. Option A creates one set of training files for all the features. Option B creates training files based on subsets of features that are merged in step 4 Option B. Option A is simpler, but Option B better scales to large number of features as they can be split into subsets of features and processed in parallel.
Option A
java -mx14000M -classpath . MakeTrainFiles samplingfile
This should be called for each sampling file. There are 10 sampling files for each chromosome.
Option B
java -mx14000M -classpath . MakeTrainFiles samplingfile INPUTBEDDIR_SUBSET featurelist_subset.txt TRAINDIR_SUBSET
This should be called for each sampling file and feature subset. INPUTBEDDIR_SUBSET is a directory within INPUTBEDDIR where the bed files are for the subset of features. featurelist_subset.txt is a file which lists the file names of features, one per line, that are part of the subset and also in INPUTBEDDIR_SUBSET. TRAINDIR_SUBSET is the name of the directory where training data for the subset of features should be written. Each feature subset should have a different output directory since files of the same name are written to each directory.
First obtain Liblinear, which can be downloaded from here https://www.csie.ntu.edu.tw/~cjlin/liblinear/. Also gunzip must be available. Next create the directory MODELSDIR.
There are also two options for this step. If Option A was used in Step 3, then option A should be used in this step. If Option B was used in Step 3, then option B should be used in this step.
Option A
For each training file in TRAINDIR_SUBSET directories created in step 3 option A, execute these set of commands:
gunzip TRAINDIR/trainfile
LIBLINEAR/liblinear-2.1/train -s 6 -B 1 -c 1 TRAINDIR/trainfile MODELSDIR/trainfile.model
gzip TRAINDIR/trainfile
gzip MODELSDIR/trainfile.model
where trainfile is the name of the training file.
Option B
Execute the command
java -classpath . MergeFeatureFiles
Also, for each training file name in the TRAINDIR_SUBSET created in step 3 option B, trainfile, execute these set of commands:
java -classpath . MergeTrainFiles trainfile TRAINDIR_MERGED
gunzip TRAINDIR_MERGED/trainfile
LIBLINEAR/liblinear-2.1/train -s 6 -B 1 -c 1 TRAINDIR_MERGED/trainfile MODELSDIR/trainfile.model
rm TRAINDIR_MERGED/trainfile
gzip MODELSDIR/trainfile.model
where TRAINDIR_MERGED is the name of the directory where the merged training data feature files should be written and should be different than the train directory for any of the subsets
MODELSDIR is the name of the directory where the models should be written
This step takes the trained classifers and makes predictions for each label set in ten different subsets of each chromosome. Predictions are made on portions of each chromosome to allow execution to be done using less memory and time. This can be done with the command:
java -mx8000M -classpath . MakeCNEPPredict labelfile chrN portion
where portion is value between 0 to 9, when there are 10 portions per chromosome labelfile should be the name of the label file without the .gz extension if present chrN is a chromosome to predict on which should be present in labellist.txt
This should be called for each combination of the ten portions, label set, and chromosome.
This step combines the predictions from different portions to make single .wig files for each chromosome. This can be done with the command:
java -mx8000M -classpath . CombineFiles chrN
This should be called for each chromosome.
This step averages predictions based on training with labels from different constrained element sets. This can be done with the command:
java -mx8000M -classpath . MakeCNEPAverage chrN
This should be called for each chromosome.
Before running CSS-CNEP, CNEP should be run following the procedures above.
The input files and directories are specified in Constants.java. If they need to be changed then edit Constants.java and type
javac -classpath . *.java
The input files should generally be the same as provided when running CNEP. There is one additional input file with the ConsHMM segmentation file which defaults to GW_segmentation.bed.gz
There is one intermediate directory CSSCNEP_AVERAGES where the average CNEP value for different conservation signatures are written. There is a separate file based on excluding each chromosome but including all others.
The directory with the final CNEP predictions in wig format is CNEPDIR
There is one parameter for the number of states in the ConsHMM model which defaults to 100.
This step computes the average CNEP value for each conservation signature, which is a combination of conservation state and constrained element overlaps. For each target chromosome, this is computed based on all input chromosomes except the target chromosome.
java -classpath . ComputeAveragesCSS_CNEP chrN
This should be called for each chromosome.
This step generate the CSS-CNEP scores.
This can be done with the command:
java -classpath . MakeCSS_CNEPPredict chrN
This should be called for each chromosome.
This step generates an ordered list of features based on their expected CNEP score statistics.
This step only needs to only be run one time, not for each coordinate. Additionally if using the same set of features as in the CNEP manuscript it does not need to be run and instead the file features_conservation.txt can be used.
Optionally, if the file name for the feature conservation needs to be changed modify EXPECTCNEPFILE = "features_conservation.txt" in Constants.java and then type
javac -classpath . *.java
To run MakeFeatureConservationList type the command
java -classpath . MakeFeatureConservationList
This step prints to the terminal, or a file if the terminal output is redirected, the set of features overlapping a given coordinate. The coordinate is specified by a chromosome including the chr prefix and an integer for a coordinate position. The features are scanned and displayed from those features that have the highest expected CNEP score statistic, meaning the feature is associated with a high level of non-exonic constraint, to the ones with the lowest CNEP score staistics.
java -classpath . GetOverlapCoord chrom position
The output includes the rank of the feature among those found intersecting the coordinate, the identity of the feature, its expected CNEP score statistic, and the features rank among all features, where rank is based on the expected CNEP score statistic.
Grujic O, Phung TN, Kwon SB, Arneson A, Lee Y, Lohmueller KE, Ernst J. Identification and characterization of constrained non-exonic bases lacking predictive epigenomic and transcription factor binding annotations. Nat Commun 11, 6168 (2020); doi: https://www.nature.com/articles/s41467-020-19962-9