Skip to content

Computing Enrichments For External Annotations

asperlea edited this page Feb 13, 2018 · 2 revisions

After learning a ConsHMM annotation for a genome of interest, it can be useful to compute the enrichment of the states for some external annotation in order to further understand the biological significance of the states. Using the included ChromHMM software, these enrichments can be computed either relative to each state, or relative to a set of anchor points. Both these type of enrichments were computed in the original ConsHMM paper.

The steps below use the hg19 100 state segmentation based on the Multiz 100-way alignment, which can be downloaded here. Depending on the size of the external annotations, these enrichments can take several hours to compute.

Enrichment of each state for external annotations

The coords folder provides an example set of external annotations, which is used in the example below. Any external annotations must be provided in .bed format and may be gzipped to save space. To compute the enrichments of the 100 states in the hg19 segmentation for the example coords run

java -jar ChromHMM/ChromHMM.jar OverlapEnrichment -lowmem -b 1 GW_segmentation.bed.gz coords/hg19/ hg19_multiz100way_enrichments

The flags -lowmem and -b 1 are necessary because the ConsHMM state annotation has single nucleotide resolution. The output of this command will be a file named hg19_multiz100way_enrichments.txt where each row is a state and each column contains the enrichments of the states for one of the external annotations in the coords/hg19/ directory.

Positional enrichment of states relative to an external set of anchor points

The anchorFiles folder provides an example set of anchor points, which is used in the example below. Any external annotations must be provided as a file with one anchor point per line determined by chromosome coordinate and an optional strand field. Gzipping is accepted. To compute the enrichments of the hg19 100 states within 200 bases of exon starts at single nucleotide resolution run

java -mx25000M -jar ChromHMM/ChromHMM.jar NeighborhoodEnrichment -lowmem -b 1 -s 1 -l 200 -r 200 GW_segmentation.bed.gz anchorFiles/GENCODE_exons_start.txt.gz hg19_multiz100way_positional_enrichments

The flags -lowmem and -b 1 are necessary because the ConsHMM state annotation has single nucleotide resolution. The outptu of this command will be a file named hg19_multiz100_positional_enrichments.txt where each row is a state and each column contains the enrichment of the states at a position relative to the anchor point, in this case ranging from -200 to +200.