The first step of most biofinformatic analyses is to assess the quality of the data you have recieved. In this example, we are working with real DNA sequencing data from a research project studying E. coli. We will use a common software, FastQC, to assess the quality of the data.
Before we start, let us make sure we are in out tutorial-fastqc
directory by printing our working directory:
cd ~/tutorial-fastqc
pwd
We should see /home/<username>/tutorial-fastqc
.
First, we need to download our sequencing data to that we want to analyze for our research project. For this tutorial, we will be downloading data used in the Data Carpentry workshop. This data includes both the genome of Escherichia coli (E. coli) and paired-end RNA sequencing reads obtained from a study carried out by Blount et al. published in PNAS. Additional information about how the data was modified in preparation for this analysis can be found on the Data Carpentry's workshop website.
We have a script called download_data.sh
that will download our bioinformatic data. Let's go ahead and run this script to download our data.
./download_data.sh
Our sequencing data files, all ending in .fastq, can now be found in a folder called /data.
Now that we have our data, we need to install the software we want to use to analyze it.
There are different ways to install and use software, including installing from source, using pre-compiled binaries, and containers. In the biology domains, many software packages are already available as pre-built containers. We can fetch one of these containers and have HTCondor set it up for our job, which means we do not have to install the FastQC software or it's dependencies.
We will use a Docker container built by the State Public Health Bioinformatics Community (staphb), and convert it to an apptainer container by creating an apptainer definition file:
ls software/
cat software/fastqc.def
And then running a command to build an apptainer container (which we won't run, but is listed here for future reference):
$ apptainer build fastqc.sif software/fastqc.def
Instead, we will download our ready-to-go apptainer .sif file:
./download_software.sh
ls software/
We need to create an executable to pass to our HTCondor jobs, so that HTCondor knows what to run on our behalf.
Let's take a look at our executable, fastqc.sh
:
cat fastqc.sh
Now we create our HTCondor submit file, which tells HTCondor what to run and how many resources to make available to our job:
cat fastqc.submit
We are ready to submit our first job!
condor_submit fastqc.submit
We can check on the status of our job in HTCondor's queue using:
condor_q
By using transfer_output_remaps in our submit file, we told HTCondor to store our FastQC output files in the results directory. Let's take a look at our scientific results:
ls results/
It's always good practice to look at our standard error, standard out, and HTCondor log files to catch unexpected output:
ls logs/
To queue a job to analyze each of our sequencing data files, we will take advantage of HTCondor's queue
statement. First, let's create a list of files we want analyzed:
ls data/ | cut -f1 -d "." > list_of_samples.txt
Let us take a look at the contents of this file:
cat list_of_samples.txt
Edits the Submit File to Queue a Job to Analyze Each Biological Sample
HTCondor has different queue
syntaxes to help researchers automatically queue many jobs. We will use queue <variable> from <list.txt>
to queue a job for each of of our samples in list_of_samples.txt
.
Once we define <variable>
, we can also use it elsewhere in the submit file.
Let's replace each occurence of the sample identifier with the $(sample)
variable, and then iterating through our list of samples as shown in list_of_samples.txt
.
cat many-fastqc.submit
# HTCondor Submit File: fastqc.submit
# Provide our executable and arguments
executable = fastqc.sh
arguments = $(sample).trim.sub.fastq
# Provide the container for our software
universe = container
container_image = software/fastqc.sif
# List files that need to be transferred to the job
transfer_input_files = data/$(sample).trim.sub.fastq
should_transfer_files = YES
# Tell HTCondor to transfer output to our /results directory
transfer_output_files = $(sample).trim.sub_fastqc.html
transfer_output_remaps = "$(sample).trim.sub_fastqc.html = results/$(sample).trim.sub_fastqc.html"
# Track job information
log = logs/fastqc.log
output = logs/fastqc.out
error = logs/fastqc.err
# Resource Requests
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
# Tell HTCondor to run our job once:
queue sample from list_of_samples.txt
And then submit many jobs using this single submit file!
condor_submit many-fastqc.submit
Notice that using a single submit file, we now have multiple jobs in the queue.
We can check on the status of our multiple jobs in HTCondor's queue by using:
condor_q
When ready, we can check our results in our results/
directory:
ls results/
Congratulations on finishing the first step of a sequencing analysis pipeline!