Throughout this exercise, note down the provenance information (metadata) that you think is relevant for each step. At the end of the session we will have a discussion of what should be included in the provenance metadata for the outputs of the workflow that are published/shared with others.
To log in to your account, please open a terminal and run
ssh <username>@ceta.ualg.pt
Once you are logged in, you may check your present working directory and what is in there:
pwd
/home/<username>
ls
Use the GIT version control system to copy/clone the hackathon2022
repository to your home directory:
git clone https://github.com/emo-bon/hackathon2022.git
Once you have downloaded the repo, you may see what's there by moving into it and listing the files and folders.
cd hackathon2022
ls
Additional_exercise.md Commands_Cheatsheet.md Docker hack_wf.cwl hack_wf.yml handson_instructions.md input_files README.md slurm-example.sh tools
Navigate to the input_files
folder where you will find a small metagenomic sequencing sample.
There are two files: one with the forward reads, and another with the reverse reads, of the sample.
This indicates that the data are from a paried-end sequencing Illumina sequencing experiment.
Under the hackathon22/tools
folder, you will find two commonly-known bioinformatics software:
- fastp: for fast, all-in-one preprocessing for
.fastq
formatted sequence read files - SeqPrep: to merge paired-end Illumina reads that are overlapping into a single longer read.
For each of these pieces of software, you will find a folder: within which is a workflow written in the Common Workflow Language in a file with a .cwl
extension, and a corresponding YAML .yml
configuration file. Take a look at the contents of these files using the more
command:
more <filename>
Under the hackathon22/input_files
folder you can see that the two data input files that are are described in the YAML configuration file.
Take a look at the contents of these files using the gunzip
and the more
commands
(use the spacebar to display more of the file when necessary and q
to quit):
gunzip <filename>
more <filename>
When, you have finished looking at the files, you need to re-compress them using the gzip
command
gzip <filename>
Under the Docker
folder there is a Dockerfile
that will be discussed later (maybe!).
Finally, in the top-level directory (called hackathon2022
) you will find the files hack_wf.cwl
and its corresponding .yml
file:
these describe the two-step workflow that we will run in this hackathon,
and which will invoke the software fastp
and SeqPrep
.
Take a look at these files and note how data flow in and out of first fastp
then SeqPrep
.
Last but not least, at the top level of the repo, you will also find the slurm-example.sh
script
that you will use as a template to submit jobs on the cluster.
At the top level of this repo, you will also find the
Commands_Cheatsheet.md
with some basic commands that you can always use if you need any help and the Additional_exercise.md file with an extra exercise focusing on Docker.
First, we will run a single tool. Navigate to the tools/fastp
directory.
View the .cwl
script using the nano
editor and note the CommandLineTool
flag in the class
argument at the very top of the file.
(Note that the ^
character in nano
signifies the CONTROL key.)
In the hints
section, check the DockerRequirement
flag: the dockerPull command fetches the corresponding software from DockerHub.
Can you find the microbiomeinformatics/pipeline-v5.fastp:0.20.0
Docker image on DockerHub?
(if not, here it is).
Now, let's run the fastp
tool.
Copy the slurm-example.sh
file to the current fastp
folder, giving it a specific name for the specific job to run:
# current working directory is the top level of the repo, i.e. /home/<username>/hackathon2022
cp slurm-example.sh tools/fastp/slurm-fastp.sh
cd tools/fastp/
Using the nano
editor, add the following command to the slurm-fastp.sh
file and give the job a <9 letter name:
cwltool fastp.cwl fastp.yml
Next submit the job to the SLURM queue:
sbatch slurm-fastp.sh
(Note: you can monitor the progress of your job using the squeue
command - but it will only take ~10 seconds.)
Once the job is finished, let's see the outcome!
ls
fastp.html fastp.json wgs-paired-SRR1620013_1.fastq.fastp.fastq wgs-paired-SRR1620013_2.fastq.fastp.fastq ..
Open a new terminal and download fastp.html
to your local machine using the secure-copy command scp
(or any other file transfer sotfware you have installed) and open it in a web-browser.
scp <username>@ceta.ualg.pt:~/hackathon2022/tools/fastp/fastp.html .
The results are also recorded in JSON format: fastp.json
. JSON is a machine operable format and can be read and manipulated by programming languages. The following command extracts the command used to execute fastp
:
python3 -c "import json; print(json.load(open('fastp.json'))['command'])"
Make sure you understand how the command was built from the .cwl
and .yml
files.
Move to the top-level directry (hackathon2022
).
Let us review the hack_wf.cwl
script there.
The class of this .cwl
workflow is Workflow
and not CommandLineTool
that was used when we were only running a single software (e.g. in fastp.cwl). An extra section is present in this CWL file; in the section steps
we invoke the two pieces of software that are in the tools
folder.
- Can you see how the output of the
fastp
tool is provided as in input in theSeqPrep
?
Let us now run the workflow. Write another SLURM queue submission script with the following command and submit it to the queue:
cwltool --outdir 2step-wf hack_wf.cwl hack_wf.yml
The job will take ~20secs, once completed, check the output directory!
Multiple parameters can be set even in this short, 2-step workflow.
During the hands-on, we will discuss some of those and each one of you will edit a parameter.
Then we will run again the workflow and see what happens!