Skip to content

Latest commit

 

History

History
115 lines (95 loc) · 16 KB

tutorial.md

File metadata and controls

115 lines (95 loc) · 16 KB

The following step-by-step tutorial shows you how to run CSREP to either (1) calculate the summary chromatin state map for one or multiple group(s) of samples, or (2) calculate the differential chromatin state map between two groups of samples. Right now, we implement CSREP such that users only need to modify paramters in a config file config/config.yaml and use snakemake to run the pipeline.

We provided some test data for ease of following this tutorial (in folder ./testdata/raw_data). These are data of chromatin state segmentation in chromosome 22 for 5 samples of the ESC group and 7 samples of the Brain group from Roadmap Epigenetic Project. We chose to include only chromosome 22 so that the tutorial can be run in reasonable time for demonstration purposes. We used the segmentation data from the 18-state model, which was trained on 6 histone modification marks: H3K4me3, H3K27ac, H3K4me3, H3K36me3, H3K27me3, H3K9me3

File structure for reference in running CSREP

You can briefly read through sections of step 0-2 (outlined below) first before going back to reading this section in details. Also, this can serve as an appendix in your first exploration of CSREP.

The following text shows you the file structure in which CSREP organizes input and output files. We write them in the terms of the variable names in the config file config/config.yml. You can look at this file structure for reference, which will be helpful when you fill in parameters for CSREP in the config file. We also provide the cheatsheet cheatsheet_csrep_manual.pdf, which also layouts the file structures of CSREP and can be helpful when you run CSREP for your own application.

testdata/raw_data corresponding to raw_user_input_dir: the folder where all input segmentation data files of all samples are stored. The samples may be from one or from potentially overlapping groups of samples. Inside this folder: |__<sample_id>: each of these folders correspond to a sample. In the example testdata/raw_data, there are 12 of these folders. Inside each folder: |__|__<sampleID><input_filename_suffix>: This file shows one sample's input segmentation data. In the example provided with the tutorial, each of these files are in the format <sampleID>_chr22_core_K27ac_segments.bed.gz, so input_filename_suffix variable in config/config.yml will be _chr22_core_K27ac_segments.bed.gz. Users will prepare data in this folder. The format of this file requires, at the minimum, the first 4 columns to be of the form:

<chrom>	<start_bp>	<end_bp>	<state>

And example of the file format is as follows:

chr10   0       94400   E18
chr10   94400   112000  E13
chr10   112000  117200  E18

Note about state format: The states are usually of the form E<state_index> (where <state_index> is one-based), which is the default format if the segmentation is generated by ChromHMM. State annotations in this file can also be of the form <one_character><state_index>. Examples for state 1: E1, U1, S1, etc. In other cases, your states may be the full state names or the state mnenomic (ex: TSSA, Quies, EnhA, etc.) --> if this is the case, your state_annot_fn (variable in config/config.yml) should have at least three columns: state, itemRgb and mnenomic to specify the meaning of states and their one-based indices. CSREP will try to convert the state names/mnemonic in the input files into a state_number (one-based). Therefore, if your input data contain non-numeric state formats that are inconsistent with state names specified in state_annot_fn, it may corrupt the entire CSREP pipeline. We provide a sample state_annot_fn in testdata/state_annot.txt, with more details provided below.

testdata/state_annot.txt corresponding to state_annot_fn. This file specifies different chracteristics of the states. Users will prepare data in this file. The format of this file requires, at the minimum, columns state, itemRgb and mnenomic. Column state should show the states' indices (one-based), column mnenomic should show the state names (Many repositories such as Epimap provides segmentation data (bed files) with state mnenomics such as TSSA, Quies, EnhA, etc.). Column itemRgb shows the state colors, in r,g,b format. States' color will be useful when CSREP output summary chromatin state maps that will then be displayed on UCSC genome browser. We provide an example state annotation file testdata/state_annot.txt for your reference.

testdata/csrep_output corresponding to <all_cg_out_dir>: The output folder where the output data (representative/differenntial chromatin state maps) for all groups of samples are stored. All data inside this folder will be produced as part of the CSREP pipeline, except for sample.list files, which users will have to prepare.

testdata/roadmap_18state_chromlength.bed corresponding to chrom_length_fn: bed file showing the length of each chromosome. This file can be produced in Step 0 by users using the scripts utils/get_chrom_length_from_segmentation.sh(details below).

testdata/sample_genome.bed.gz corresponding to sample_genome_fn: where the data of genomic regions included in for training data are stored.

Next, we outline the output file structure of CSREP in testdata/csrep_output (corresponding to variable all_cg_out_dir in config). Except for the files listing sample IDs in each group (files with names sample.list below), all the content inside this folder is created by CSREP. test_data/csrep_output |__ Brain: the folder where we can get summary chromatin state map data for samples in Brain group. The structure of this folder is similar to that of ESC folder as shown below. If the config file specifies through variable cell_group_list a list of cell groups, then CSREP will calculate the summary chromatin state maps for all these groups, each group's output will be stored in its own folder similarly to the folder ESC below.
|__ ESC: the folder where we can get summary chromatin state map data for samples in ESC group.
|__|__ sample.list: file storing all the sampleIDs in the ESC group. Each sampleID is on a separate line (E003, E008, E014, etc.). This is a file where the users have to create and place them within this folder in order for Snakemake to work. This step allows the seamless configuration of jobs by Snakemake.
|__|__ CSREP: where output from CSREP's summary chromatin state maps are stored
|__|__|__ representative_data
|__|__|__|__ average_predictions: folder containing the representative chromatin state maps for the group. This is the main output that you are interested in for summarizing a group of samples' chromatin state maps. Each file in this folder has the format chr<chrom>_<region_index, 0-based>_avg_pred.txt.gz, corresponding to a region spanning at most 50,000 chromatin state bins. In other words, there are <= 50,001 lines in each file (one header line), with each line corresponding to one chromatin state bin (typically 200bp by default ChromHMM settings. Therefore, a file of 50,000 genomic bins will typically span 10Mb). File chr22_0_avg_pred.txt.gz shows results in the first 10Mbp window of chromosome 22. File chr22_5_avg_pred.txt.gz shows results in the last 1304400 bp in the chromosome, since chr22 has length 51304400 bp. In each file, the first line shows column names (corresonding to different chromatin states); following lines show the probabilities of state assignment at each bin. One file in this folder will be name summary_state_track.bed.gz, showing the state that have the max assignment probabilities at each genomic position. This file can be read into UCSC genome browser, given the user-provided state_annot_fn showing the state names and colors (example file provided in testdata/state_annot.txt).
|__|__|__|__ pred_<sampleID>: example: pred_E003, a folder containing the data of chromatin state map prediction of the sample (E003) by training a multivariate logistic regression model using input from other samples in the group (see the paper for more details). These folders are redundant and may not need to be looked at for the final output of representative chromaitn state map, therefore, they can be deleted manually by users. We do not incorporate deleting these folders in the snakemake pipeline, because it helps snakemake not rerun certain steps of the pipeline in cases where the computing cluster halts certain jobs.
|__|__ baseline: where output from the base_count method for summary chromatin state maps are stored, if users specify through variable train_mode_list in config/config.yml
|__|__|__ representative_data
|__|__|__|__ average_predictions: Formatted similarly as the counterpart in the CSREP folder (outline above). The results show the output summary chromatin state maps by calculating the frequencies of each state across samples at each position (refer to the Supplementary Methods of the manuscript for details).
|__ ESC_minus_Brain: Where the data of differential chromatin state maps between the two groups ESC and Brain are stored.
|__|__ CSREP: data of differential chromatin state maps calculated using CSREP. Within this folders, each file shows the output for a region spanning <= 50,000 genomic bins. All file names and file format as similar to that of the folder containing representative chromatin state maps for one group fo samples.
|__|__ baseline: data of differential chroamtin state maps calculated using base_count.
testdata/training_data corresponding to training_data_folder: the folder where we will store chromatin state segmentation data used for training, which corresponds to 10% of the genome, for each sample. Data in this folder is produced by CSREP.\

Step 0: Preparing some input

Getting chromatin state maps for input samples

For each input sample, users will first need to obtain the chromatin state map, saved in a file name with name format <sampleID><input_filename_suffix> and stored in raw_user_input_dir/<sampleID>. For example, in ./testdata/raw_data (corresponding to raw_user_input_dir variable in config), chromatin state map file for sample E003 is E003/E003_chr22_core_K27ac_segments.bed.gz. We already prepared the chromatin state maps for input samples in this step inside ./testdata/raw_data.

Getting data of chromosome length

One input to the CSREP framework is a bed file that show the length of each chromosome. An example file is is as follows:

chr1    0       249250600
chr2    0       243199200
chr3    0       198022400
chr4    0       191154200
chr5    0       180915200
chr6    0       171115000
chr7    0       159138600
chr8    0       146364000
chr9    0       141213400
chr10   0       135534600
chr11   0       135006400
chr12   0       133851800
chr13   0       115169800
chr14   0       107349400
chr15   0       102531200
chr16   0       903546000
chr17   0       811952000
chr18   0       780772000
chr19   0       591288000
chr20   0       630254000
chr21   0       481298000
chr22   0       513044000
chrX    0       155270400
chrY    0       593734000

Note: We want the length of the chromosome is a multiple of 200, which is the resolution of the chromatin state segmentation. We recommend using our code utils/get_chrom_length_from_segmentation.sh to produce this file. The command-line format for running this file is:

./utils/get_chrom_length_from_segmentation.sh <input segmentation file> <output file> <chrom_list_fn>

For our example, you can run:

./utils/get_chrom_length_from_segmentation.sh ./testdata/raw_data/E072/E072_chr22_core_K27ac_segments.bed.gz ./testdata/roadmap_18state_chromlength.bed ./utils/chrom.human

This will produce an output file where the length of chromosomes that are not chr22 are all left blank. The reason is because the provided input data, for tutorial purposes, only contains segmentation data in chr22. All the downstream tasks that take in this file as input will not be affected.

Prepare the list of samples for each group

Inside ./testdata/raw_data, there are 12 segmentation files in the form <sampleID><input_filename_suffix>, for example:

E003_chr22_core_K27ac_segments.bed.gz
E008_chr22_core_K27ac_segments.bed.gz
E014_chr22_core_K27ac_segments.bed.gz

In such cases, the samples' ID are E003, E008, E014, <input_filename_suffix> will then be _chr22_core_K27ac_segments.bed.gz. For each group of samples, we need to create a file sample.list inside <all_cg_out_dir>/<group> folder to list all the <sampleID> associated with this group. In our example, they are sample.list files in testdata/csrep_output/ESC or testdata/csrep_output/Brain.

Step 1: Adjust parameters in config file

Users will modify the config/config.yml file to adjust parameters to run CSREP. All details about what each parameter means are contained in the Modify parameters to CSREP section within the README.md file. The section File structure for reference in running CSREP and cheatsheet cheatsheet_csrep_manual.pdf also help visualize the file structures. We recommend just running snakemake using parameters that we currently sets in the config file for the example from testdata folder first. Then, the tutorial will be much easier to understand.

Step 2: Running CSREP using Snakemake

  • Please make sure that all the installation requirements are met. We recommend create a new conda environment for CSREP (refer to Installing CSREP section from README.md, and also refer to our tutorial report if needed.)
  • If you run this tutorial using your personal computer, you can run snakemake on the command line from the current working directory, which contains file Snakefile.
  • If you have the computing capacity of computer clusters/cloud computing, you can run snakemake -j 100 --cluster "qsub -V -l h_rt=4:00:00,h_data=2G -o <path/to/log/folder> -e <path/to/log/folder>" on the command line. Snakemake will submit jobs in parallel whenever possible, and manage your jobs. The joblog and terminal output files from each job will be saved in the folder <path/to/log/folder> that you specify based on your local file paths.
  • Snakemake should handle all the job-sitting work for you, including submitting jobs in parallel where possible, waiting for jobs to finish and then run the next jobs that aggregate output data from previously finished jobs. Sometimes, mostly due to file system latency, snakemake will encounter errors and stop running. In this case, you can wait for all the jobs to finish (depending on whether ther are jobs already submitted in your computing cluster when snakemake crashes, the wait time can be a few minutes to until when all the submitted jobs finish). Then, you can try rerunning snakemake (first snakemake -n to see all the jobs that are about to run/be submiited, then snakemake -j 100 --cluster "qsub -V -l h_rt=4:00:00,h_data=2G -o <path/to/log/folder> -e <path/to/log/folder>" to actually submit remaining jobs). Snakemake will know to pick up where it left off. In cases where you think there may be an error that's not snakemake's failure but rather in our scripts, you can use the code provided in folder process_logs, with detailed documentation to see the error message of failed jobs. Please reach out to us (jason dot ernst at ucla dot edu and havu73 at ucla dot edu) about your problems.