-
Notifications
You must be signed in to change notification settings - Fork 9
Reproducing the results in Graph Peak Caller Paper
The results presented in the Graph Peak Caller Paper are produced through Jenkins, and are re-run every night. The latest experiment report is available here.
If you wish to re-run these experiments yourself, or get a quick access to the scripts used, you can easily pull our docker repository. All the experiments on Arabidopsis Thaliana should be possible to run on a normal laptop as long as there is enough disk space to store the raw fastq reads and alignments (which quickly will be a few hundre gigabytes). However, it is easy to only run a subset of the transcription factors.
To get the specific version of the Docker repository used to produce the results in the manuscripts, run:
docker pull ivargr/graph_peak_caller@sha256:e3f80fb93dada1d7164897eaaab6ab37e9e94477dc59bda682587edd5874d00e
... or to get the latest version of the Docker repository run:
docker pull ivargr/graph_peak_caller
If you don't have docker installed, check out these instructions. On Ubuntu, you should be able to install Docker easily with sudo apt install docker.io
.
This can e.g. by done by doing:
docker run -it ivargr/graph_peak_caller
You are now inside the "virtual machine" where all software and data is.
Start by navigating to the graph peak caller directory:
cd /home/graph_peak_caller/benchmarking/
git pull origin master # To make sure you have newest code, alternative checkout the version you want to use
Choose one of the two following guides, depending on whether you want to run a single transcription factor (TF), or everything.
If you want to use the docker repository to call peaks on a specific TF, or you want to reproduce the results for a given TF, the following shows you how. This example call peaks on the ERF115 TF:
arabidopsis_graph_dir=/home/data/arabidopsis/
./simple_chip_seq_pipeline.sh SRR931836 1 ARABIDOPSIS_ERF115 1,2,3,4,5 135000000 $arabidopsis_graph_dir/reference1-5.fa $arabidopsis_graph_dir/wg.xg $arabidopsis_graph_dir/wg.gcsa $arabidopsis_graph_dir/
The above command will run our simple chip seq pipeline using NCBI data with accession number SRR931836
. You can of course change this to something else. 1
is the replicate number (set to 1 if there is only one replicate). 135000000
is the genome size, and 1,2,3,4,5
are the chromosomes we will run on.
Running this will take a few hours. Most of the time is spent on downloading the data and running vg map
. When it is done, you can run the analysis scripts from the same directory to get motif enrichment plots and comparison results with MACS2:
arabidopsis_graph_dir=/home/data/arabidopsis/
./analyse_peak_calling_results.sh SRR931836 1 ARABIDOPSIS_ERF115 1,2,3,4,5 https://hyperbrowser.uio.no/graph-peak-caller/static/graph_peak_caller_data/motifs/ARABIDOPSIS_ERF115.meme $arabidopsis_graph_dir/ $arabidopsis_graph_dir/reference1-5.fa 135000000 &
This takes a few minutes. When done, there will be a txt file and a figure in the /home/graph_peak_caller/benchmarking/figures_tables
directory.
You can compile a nice html report (formated similar to what is shown in the manuscript) like this:
wget -O figures_tables/bootstrap.min.css https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css
python3 generate_jenkins_report.py ARABIDOPSIS_ERF115
Simply run the jenkins.sh script (this is the same that is run on our jenkins server every night):
cd /home/graph_peak_caller/benchmarking/
git pull origin master # To make sure you have newest code, alternative checkout the version you want to use
./jenkins.sh /home/data/arabidopsis/
Note that you can edit the jenkins.sh
file as you want. For instance, you can comment out lines so that you only are running one or a few transcription factors. Running everything on a laptop might take a day or two. This will be a lot faster on a cluster or server with access to multiple cores, as vg map
can use all available cores.
Note that you can also run the Drosophila Melanogaster and human experiments by downloading their data and sending in the additional data paths as arguments to jenkins.sh like this: jenkins.sh /data/arabidopsis/ /path/to/drosophila/ /path/to/human/
.
The html-report containing the figures and table shown in the paper will be created in the /home/graph_peak_caller/benchmarking/figures_tables/
directory. All data files, peaks etc will be in the /home/graph_peak_caller/benchmarking/data/
directory.
--
NB: When running the jenkins.sh
script, a lot of data files will be written to /home/graph_peak_caller/benchmarking/data/
. If the process for some reason has to be stopped or is interrupted, and you have to start over again, you should in most cases be able to just run the command again, and the pipeline script will continue where it left of. However, if this seems to cause errors, you can always delete the whole contents of the data-directory and start over again.