Table of contents
- Get situated
- Example 1
- Example 2
- Example 3
s, notes, etc.- Write and run initial tests: Use lists with
job submissions- Examine and edit the current job-submission script
- Write a code chunk for a script for the job submission
- Get situated
- Create an appropriate list to be used with modified script
- Write out the list-ready run script (
test) using aHEREDOC
- Write out the submission script (
test) using aHEREDOC
- Do a test run of the script and list
- Write and run tests with multiple individual lists
- Write a code chunk to generate lists of arguments
- Write a chunk to split the complete list into individual lists
- Write out the list-ready run script (
test) using aHEREDOC
- Write out the submission script (
test) using aHEREDOC
- Run an
test using the individual lists
Working through the SLURM
job-array tutorial here
This information will be used to redesign how I am submitting Trinity
parameterization jobs to SLURM
- I need to limit the number of jobs submitted to
at one time; this is because memory is shared by all concurrent jobs, resulting insomemany jobs not having enough memory and thus terminating with errors #TODO
Go back and study up on how memory is handled with FHCC's setup ofSLURM
Code: Get situated
# grabnode # 1 core, default settings
mwd() {
transcriptome \
&& cd "./results/2023-0111" \
|| echo "cd'ing failed; check on this"
# /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111
Notes: Insights from troubleshooting and debugging
*Note that grabnode
is commented out above. This is because "batch jobs submitted from interactive sessions fail," giving an error like this:
Error message
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x200000.
srun: error: Task launch for StepId=7935525.0 failed on node gizmoj7: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
The reason? "This seems to be due to SLURM_CPU_BIND_*
env vars being set in the interactive job, which then (undesirably) propagate to the batch job and cause problems if the job's taskset conflicts with the inherited SLURM_CPU_BIND_*
"Unsetting those env vars at the top of the job submission script seems to prevent the issue from occurring, but isn't something we want to recommend to users. Also, we're concerned that propagation of other env vars from the interactive job to the batch might cause other issues."
Quotes are from—and more details are available at—this link.
Additional details here.
Code: Get situated
dir="tutorial_job-arrays" # echo "${dir}"
if [[ -d "${dir}" ]]; then
mv -f "${dir}" "bak.${dir}"
ex_1="${dir}/example_1" # echo "${ex_1}"
mkdir -p "${ex_1}"
Code: Make commandlist
if [[ -f "${ex_1}/commandlist" ]]; then
rm "${ex_1}/commandlist"
touch "${ex_1}/commandlist"
sleep 5
sleep 4
sleep 8
sleep 2
sleep 6
echo "${contents}" >> "${ex_1}/commandlist"
sed -i '1d' "${ex_1}/commandlist"
# head "${ex_1}/commandlist"
# vi "${ex_1}/commandlist" # :q
Code: Make command_array.sh
if [[ -f "${ex_1}/command_array.sh" ]]; then
rm "${ex_1}/command_array.sh"
touch "${ex_1}/command_array.sh"
#SBATCH --job-name=\"command-array\"
#SBATCH --output=\"${ex_1}/command_array-%A_%a.out\"
#SBATCH --array=1-5
command=\"\$(awk \"NR == \${SLURM_ARRAY_TASK_ID}\" \"${ex_1}/commandlist\")\"
srun \${command}
echo "${contents}" >> "${ex_1}/command_array.sh"
sed -i '1d' "${ex_1}/command_array.sh"
cat -n "${ex_1}/command_array.sh"
# cd "${ex_1}" && rm -- *.out && cd -
# command="$(awk "NR == ${SLURM_ARRAY_TASK_ID}" "tutorial_job-arrays/example_1/commandlist")"
# echo "${command}"
# srun ${command}
*Note above that the variable passed to srun
must not be quoted—or else srun
will attempt to access a file.
Printed to terminal: Make command_array.sh
❯ cat -n "${ex_1}/command_array.sh"
1 #!/bin/sh
2 #SBATCH --job-name="command-array"
3 #SBATCH --output="tutorial_job-arrays/example_1/command_array-%A_%a.out"
4 #SBATCH --array=1-5
6 command="$(awk "NR == ${SLURM_ARRAY_TASK_ID}" "tutorial_job-arrays/example_1/commandlist")"
7 srun ${command}
Notes: Understanding the contents of command_array.sh
- Line 4 tells
to create an array of5
items, numbered1
.- This should be changed to match the number of jobs you need to run.
- In our case, we want this range to match the number of commands in our "
" file.
- Line 6 utilizes one of
’s built in variables, calledSLURM_ARRAY_TASK_ID
.- This accesses the specific task
of the current task in the job array (e.g.,1
for the first task) and can be used like any bash variable. - In this example,
" is being used to get the contents of a particular line in the "commandlist
. - For the first task, the "
" variable will be "sleep 5
- This accesses the specific task
- Line 3 uses a shorthand method of accessing the job array
and array taskID
and embedding them into the name of the output file.- The "
" represents theSLURM_ARRAY_JOB_ID
variable (e.g.,1212985
) and the "%a
" represents theSLURM_ARRAY_TASK_ID
variable (e.g.,1
). - This would generate an output file similar to "
" for the first element of the array.
- The "
Code: Run command_array.sh
sbatch "${ex_1}/command_array.sh"
skal # alias skal="squeue -u kalavatt"
., "${ex_1}"
head -100 "${ex_1}/command_array-"*".out"
Printed to terminal: Run command_array.sh
❯ skal
7935880_1 campus-ne command- kalavatt R 0:01 1 gizmok30 1
7935880_2 campus-ne command- kalavatt R 0:01 1 gizmok33 1
7935880_3 campus-ne command- kalavatt R 0:01 1 gizmok13 1
7935880_4 campus-ne command- kalavatt R 0:01 1 gizmok13 1
7935880_5 campus-ne command- kalavatt R 0:01 1 gizmok13 1
❯ ., "${ex_1}"
total 280K
drwxrws--- 2 kalavatt 288 Jan 14 15:51 ./
drwxrws--- 3 kalavatt 27 Jan 14 13:41 ../
-rw-rw---- 1 kalavatt 0 Jan 14 15:51 command_array-7935880_1.out
-rw-rw---- 1 kalavatt 0 Jan 14 15:51 command_array-7935880_2.out
-rw-rw---- 1 kalavatt 0 Jan 14 15:51 command_array-7935880_3.out
-rw-rw---- 1 kalavatt 0 Jan 14 15:51 command_array-7935880_4.out
-rw-rw---- 1 kalavatt 0 Jan 14 15:51 command_array-7935880_5.out
-rw-rw---- 1 kalavatt 248 Jan 14 15:50 command_array.sh
-rw-rw---- 1 kalavatt 41 Jan 14 15:48 commandlist
❯ head -100 ${ex_1}/command_array-*.out
==> tutorial_job-arrays/example_1/command_array-7935880_1.out <==
==> tutorial_job-arrays/example_1/command_array-7935880_2.out <==
==> tutorial_job-arrays/example_1/command_array-7935880_3.out <==
==> tutorial_job-arrays/example_1/command_array-7935880_4.out <==
==> tutorial_job-arrays/example_1/command_array-7935880_5.out <==
Intro: Example 2
, it is also possible to create jobs that send different parameters to the same set of data. In this next example, we are going to use a simple program written inR
that calculates the area of a triangle given two sides and an angle as input.In this scenario, we will assume that two of the sides of the triangle are known, and we want to calculate how the area changes when the angle between those two sides changes. Let’s consider each integer angle from 1 to 15 degrees. Below is the R program for calculating the area of the triangle and the bash script that calls the program:
Code: Example 2
dir="tutorial_job-arrays" # echo "${dir}"
ex_2="${dir}/example_2" # echo "${ex_2}"
mkdir -p "${ex_2}"
# area_of_triangle.R -----------------
if [[ -f "${ex_2}/area_of_triangle.R" ]]; then
rm "${ex_2}/area_of_triangle.R"
touch "${ex_2}/area_of_triangle.R"
# Take in three integers: two sides of a triangle and the angle between them;
#+ calculate the area of a triangle given the two sides and their shared angle
#+ as input
args <- commandArgs(TRUE)
side_a <- strtoi(args[1], base=10L)
side_b <- strtoi(args[2], base=10L)
angle <- strtoi(args[3], base=10L)
area = (1/2)*side_a*side_b*sin(angle*pi/180)
sprintf(\"The area of a triangle with sides %i and %i with angle %i degrees is %f\", side_a, side_b, angle, area)
echo "${contents}" >> "${ex_2}/area_of_triangle.R"
sed -i '1d' "${ex_2}/area_of_triangle.R"
cat -n "${ex_2}/area_of_triangle.R"
# cd "${ex_2}" && rm -- *.out && cd -
# job_array_triangle.sh --------------
if [[ -f "${ex_2}/job_array_triangle.sh" ]]; then
rm "${ex_2}/job_array_triangle.sh"
touch "${ex_2}/job_array_triangle.sh"
#SBATCH --job-name=\"area of triangles\"
#SBATCH --output=\"${ex_2}/area_of_triangle_5_8_%a.out\"
#SBATCH --array=1-15
module load R
# calculate the area of a triangle with 2 sides given, and a
# variable angle in degrees between them (Side-Angle-Side)
srun Rscript ${ex_2}/area_of_triangle.R 5 8 \${SLURM_ARRAY_TASK_ID}
echo "${contents}" >> "${ex_2}/job_array_triangle.sh"
sed -i '1d' "${ex_2}/job_array_triangle.sh"
cat -n "${ex_2}/job_array_triangle.sh"
# cd "${ex_2}" && rm -- *.out && cd -
# Submit job_array_triangle.sh -------
sbatch "${ex_2}/job_array_triangle.sh"
# Check on outfiles ------------------
., "${ex_2}"
cat "${ex_2}/area_of_triangle_5_8_"*".out"
Printed to terminal: Example 2
❯ cat -n "${ex_2}/area_of_triangle.R"
1 #!/bin/Rscript
3 # Take in 3 integers: 2 sides of a triangle and the angle between them; with
4 #+ the two sides and angle between them known, calculate the area of the
5 #+ triangle
7 args <- commandArgs(TRUE)
9 side_a <- strtoi(args[1], base=10L)
10 side_b <- strtoi(args[2], base=10L)
11 angle <- strtoi(args[3], base=10L)
13 area = (1/2)*side_a*side_b*sin(angle*pi/180)
14 sprintf("The area of a triangle with sides %i and %i with angle %i degrees is %f", side_a, side_b, angle, area)
❯ cat -n "${ex_2}/job_array_triangle.sh"
1 #!/bin/bash
3 #SBATCH --job-name="area of triangles"
4 #SBATCH --output="tutorial_job-arrays/example_2/area_of_triangle_5_8_%a.out"
5 #SBATCH --array=1-15
7 module load R
9 # calculate the area of a triangle with 2 sides given, and a
10 # variable angle in degrees between them (Side-Angle-Side)
11 srun Rscript area_of_triangle.R 5 8 ${SLURM_ARRAY_TASK_ID}
❯ skal
7935901_1 campus-ne area of kalavatt R 0:03 1 gizmok42 1
7935901_2 campus-ne area of kalavatt R 0:03 1 gizmok42 1
7935901_3 campus-ne area of kalavatt R 0:03 1 gizmok10 1
7935901_4 campus-ne area of kalavatt R 0:03 1 gizmok10 1
7935901_5 campus-ne area of kalavatt R 0:03 1 gizmok10 1
7935901_6 campus-ne area of kalavatt R 0:03 1 gizmok22 1
7935901_7 campus-ne area of kalavatt R 0:03 1 gizmok22 1
7935901_8 campus-ne area of kalavatt R 0:03 1 gizmok22 1
7935901_9 campus-ne area of kalavatt R 0:03 1 gizmok94 1
7935901_10 campus-ne area of kalavatt R 0:03 1 gizmok94 1
7935901_11 campus-ne area of kalavatt R 0:03 1 gizmok94 1
7935901_12 campus-ne area of kalavatt R 0:03 1 gizmok94 1
7935901_13 campus-ne area of kalavatt R 0:03 1 gizmok36 1
7935901_14 campus-ne area of kalavatt R 0:03 1 gizmok36 1
7935901_15 campus-ne area of kalavatt R 0:03 1 gizmok35 1
❯ ., "${ex_2}"
total 592K
drwxrws--- 2 kalavatt 741 Jan 14 16:19 ./
drwxrws--- 4 kalavatt 54 Jan 14 16:03 ../
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_10.out
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_11.out
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_12.out
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_13.out
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_14.out
-rw-rw---- 1 kalavatt 82 Jan 14 16:19 area_of_triangle_5_8_15.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_1.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_2.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_3.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_4.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_5.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_6.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_7.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_8.out
-rw-rw---- 1 kalavatt 81 Jan 14 16:19 area_of_triangle_5_8_9.out
-rw-rw---- 1 kalavatt 473 Jan 14 16:13 area_of_triangle.R
-rw-rw---- 1 kalavatt 376 Jan 14 16:19 job_array_triangle.sh
❯ cat "${ex_2}/area_of_triangle_5_8_"*".out"
[1] "The area of a triangle with sides 5 and 8 with angle 10 degrees is 3.472964"
[1] "The area of a triangle with sides 5 and 8 with angle 11 degrees is 3.816180"
[1] "The area of a triangle with sides 5 and 8 with angle 12 degrees is 4.158234"
[1] "The area of a triangle with sides 5 and 8 with angle 13 degrees is 4.499021"
[1] "The area of a triangle with sides 5 and 8 with angle 14 degrees is 4.838438"
[1] "The area of a triangle with sides 5 and 8 with angle 15 degrees is 5.176381"
[1] "The area of a triangle with sides 5 and 8 with angle 1 degrees is 0.349048"
[1] "The area of a triangle with sides 5 and 8 with angle 2 degrees is 0.697990"
[1] "The area of a triangle with sides 5 and 8 with angle 3 degrees is 1.046719"
[1] "The area of a triangle with sides 5 and 8 with angle 4 degrees is 1.395129"
[1] "The area of a triangle with sides 5 and 8 with angle 5 degrees is 1.743115"
[1] "The area of a triangle with sides 5 and 8 with angle 6 degrees is 2.090569"
[1] "The area of a triangle with sides 5 and 8 with angle 7 degrees is 2.437387"
[1] "The area of a triangle with sides 5 and 8 with angle 8 degrees is 2.783462"
[1] "The area of a triangle with sides 5 and 8 with angle 9 degrees is 3.128689"
Breakdown: Example 2
- The
command in the "job_array_triangle.sh
" script is passing the same first two arguments (5
) to each task of the array - However, it is changing the third argument to be whatever the current task
is - So the first task calls "
srun Rscript area_of_triangle.r 5 8 1
" because our first array task starts at 1
Intro: Example 3
There may be times that you would like to send many different files as input to a program. Instead of having to do this one at a time, you can set up a job array to do this automatically. In this next example, we will be using a simple shell script called "
" that takes an input file and an output directory as parameters.
Code: Get situated
dir="tutorial_job-arrays" # echo "${dir}"
ex_3="${dir}/example_3" # echo "${ex_3}"
mkdir -p "${ex_3}"
Code: Make analysis.sh
if [[ -f "${ex_3}/analysis.sh" ]]; then
rm "${ex_3}/analysis.sh"
touch "${ex_3}/analysis.sh"
# analysis.sh
#+ Take two arguments, the first one being a file to be analyzed and the second
#+ a directory to output the analysis, then sleep for a random amount of time
#+ before running md5sum on the infile; the checksum is output to user-defined
#+ directory
#+ \${1} (infile) and \${2} (outdirectory) are the first and second arguments to this
#+ script
# Strip away directory paths, resulting in filenames alone
BASE=\"\$(basename \"\${1}\")\"
# Generate a random number between 1 and 5
RAND=\"\$(( \${RANDOM} % 5+1 ))\"
# Begin the analysis
echo \"Beginning the analysis of \${BASE} at:\"
# The sleep program will sit idle, doing nothing
echo \"Sleeping for \${RAND} seconds …\"
sleep \"\${RAND}\"
# Now, actually do something: Calculate the checksum for the infile
CHKSUM=\"\$(md5sum \${1})\"
echo \"\${CHKSUM}\" > \"\${2}/\${BASE}_sum\"
echo \"Analysis of \${BASE} has been completed at:\"
echo "${contents}" >> "${ex_3}/analysis.sh"
sed -i '1d' "${ex_3}/analysis.sh"
chmod 777 "${ex_3}/analysis.sh"
cat -n "${ex_3}/analysis.sh"
Printed to terminal: Make analysis.sh
❯ cat -n "${ex_3}/analysis.sh"
1 #!/bin/bash
3 # analysis.sh
4 #+
5 #+ Take two arguments, the first one being a file to be analyzed and the second
6 #+ a directory to output the analysis, then sleep for a random amount of time
7 #+ before running md5sum on the infile; the checksum is output to user-defined
8 #+ directory
9 #+
10 #+ ${1} (infile) and ${2} (outdirectory) are the first and second arguments to this
11 #+ script
13 # Strip away directory paths, resulting in filenames alone
14 BASE="$(basename "${1}")"
16 # Generate a random number between 1 and 5
17 RAND="$(( ${RANDOM} % 5+1 ))"
19 # Begin the analysis
20 echo "Beginning the analysis of ${BASE} at:"
21 date
23 # The sleep program will sit idle, doing nothing
24 echo "Sleeping for ${RAND} seconds …"
25 sleep "${RAND}"
27 # Now, actually do something: Calculate the checksum for the infile
28 CHKSUM="$(md5sum ${1})"
29 echo "${CHKSUM}" > "${2}/${BASE}_sum"
31 echo "Analysis of ${BASE} has been completed at:"
32 date
Intro: Make filelist.txt
Let's say we have 5 different files that we would like our program to analyze. We will store the paths to these input files in another file called "
Code: Make filelist.txt
if [[ -f "${ex_3}/filelist.txt" ]]; then
rm "${ex_3}/filelist.txt"
touch "${ex_3}/filelist.txt"
echo "${contents}" >> "${ex_3}/filelist.txt"
sed -i '1d' "${ex_3}/filelist.txt"
cat -n "${ex_3}/filelist.txt"
Printed to terminal: Make filelist.txt
❯ cat -n "${ex_3}/filelist.txt"
1 /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_1.txt
2 /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_2.txt
3 /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_3.txt
4 /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_4.txt
5 /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_5.txt
Code: Make the files in filelist.txt
x="$(cat "${ex_3}/filelist.txt" | wc -l)" # echo "${x}"
y=$(( x - 1 )) # echo "${y}"
for (( i=1; i<=y; i++ )); do
command="$(awk "NR == ${i}" "${ex_3}/filelist.txt")"
echo "${command}"
touch "${command}"
., "${ex_3}"
Printed to terminal: Make the files in filelist.txt
❯ for (( i=1; i<=y; i++ )); do
> command="$(awk "NR == ${i}" "${ex_3}/filelist.txt")"
> echo "${command}"
> touch "${command}"
> done
❯ ., "${ex_3}"
total 264K
drwxrws--- 2 kalavatt 199 Jan 15 10:18 ./
drwxrws--- 5 kalavatt 81 Jan 14 16:25 ../
-rw-rw---- 1 kalavatt 672 Jan 15 09:45 analysis.sh
-rw-rw---- 1 kalavatt 0 Jan 15 10:18 file_1.txt
-rw-rw---- 1 kalavatt 0 Jan 15 10:18 file_2.txt
-rw-rw---- 1 kalavatt 0 Jan 15 10:18 file_3.txt
-rw-rw---- 1 kalavatt 0 Jan 15 10:18 file_4.txt
-rw-rw---- 1 kalavatt 0 Jan 15 10:18 file_5.txt
-rw-rw---- 1 kalavatt 641 Jan 15 10:17 filelist.txt
Notes: Make job_array.sh
...create a script called "job_array.sh
" that uses the command line tool ""sed
variable to get a specific line of that file:
*Note: The use of sed -n
to select line numbers does not exist in sed (GNU sed) 4.4
, the version installed on the FHCC system, so we can't do, e.g.,
sed -n "${SLURM_ARRAY_TASK_ID}"p "${ex_3}/filelist.txt"
Instead, we need to do, e.g.,
awk "NR == ${SLURM_ARRAY_TASK_ID}" "${ex_3}/filelist.txt"
(There are many other ways to do this, including alternative ways to do it with sed
. Anyway, this is why you see the use of "awk
" and not "sed
" in Example 1—something I spent (too much) time troubleshooting yesterday (2023-0114).)
Code: Make job_array.sh
if [[ -f "${ex_3}/job_array.sh" ]]; then
rm "${ex_3}/job_array.sh"
touch "${ex_3}/job_array.sh"
#SBATCH --job-name=\"array_test\"
#SBATCH --output=\"${ex_3}/analysis_%a.out\"
#SBATCH --time=20:00
#SBATCH --cpus-per-task=1
#SBATCH --array=1-5
awk \"NR == \${SLURM_ARRAY_TASK_ID}\" \"${ex_3}/filelist.txt\"
srun \"./${ex_3}/analysis.sh\" \"\${name}\" \"${ex_3}\"
echo "${contents}" >> "${ex_3}/job_array.sh"
sed -i '1d' "${ex_3}/job_array.sh"
cat -n "${ex_3}/job_array.sh"
# Scraps
# name="$(
# awk "NR == ${SLURM_ARRAY_TASK_ID}" "tutorial_job-arrays/example_3/filelist.txt"
# )"
# echo "${name}"
# # /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2023-0111/tutorial_job-arrays/example_3/file_1.txt
Printed to terminal: Make job_array.sh
❯ cat -n "${ex_3}/job_array.sh"
1 #!/bin/bash
2 #SBATCH --job-name="array_test"
3 #SBATCH --output="tutorial_job-arrays/example_3/analysis_%a.out"
4 #SBATCH --time=20:00
5 #SBATCH --cpus-per-task=1
6 #SBATCH --array=1-5
8 name="$(
9 awk "NR == ${SLURM_ARRAY_TASK_ID}" "tutorial_job-arrays/example_3/filelist.txt"
10 )"
12 srun "./tutorial_job-arrays/example_3/analysis.sh" "${name}" "tutorial_job-arrays/example_3"
Run job_array.sh
# Submit job_array_triangle.sh -------
sbatch "${ex_3}/job_array.sh"
# Check on outfiles ------------------
., "${ex_3}"
cat "${ex_3}/analysis_"*".out"
# cd "${ex_3}" && rm -- *.out && cd -
Printed to terminal: Run job_array.sh
❯ skal
7954737_1 campus-ne array_te kalavatt R 0:00 1 gizmok27 1
7954737_2 campus-ne array_te kalavatt R 0:00 1 gizmok35 1
7954737_3 campus-ne array_te kalavatt R 0:00 1 gizmok42 1
7954737_4 campus-ne array_te kalavatt R 0:00 1 gizmok12 1
7954737_5 campus-ne array_te kalavatt R 0:00 1 gizmok12 1
❯ skal
7954737_2 campus-ne array_te kalavatt R 0:03 1 gizmok35 1
7954737_3 campus-ne array_te kalavatt R 0:03 1 gizmok42 1
7954737_4 campus-ne array_te kalavatt R 0:03 1 gizmok12 1
7954737_5 campus-ne array_te kalavatt R 0:03 1 gizmok12 1
❯ skal
7954737_2 campus-ne array_te kalavatt R 0:05 1 gizmok35 1
7954737_5 campus-ne array_te kalavatt R 0:05 1 gizmok12 1
❯ skal
❯ ., "${ex_3}"
❯ cat "${ex_3}/analysis_"*".out"
Beginning the analysis of file_1.txt at:
Tue Jan 24 13:21:16 PST 2023
Sleeping for 3 seconds …
Analysis of file_1.txt has been completed at:
Tue Jan 24 13:21:19 PST 2023
Beginning the analysis of file_2.txt at:
Tue Jan 24 13:21:16 PST 2023
Sleeping for 3 seconds …
Analysis of file_2.txt has been completed at:
Tue Jan 24 13:21:19 PST 2023
Beginning the analysis of file_3.txt at:
Tue Jan 24 13:21:16 PST 2023
Sleeping for 5 seconds …
Analysis of file_3.txt has been completed at:
Tue Jan 24 13:21:21 PST 2023
Beginning the analysis of file_4.txt at:
Tue Jan 24 13:21:16 PST 2023
Sleeping for 2 seconds …
Analysis of file_4.txt has been completed at:
Tue Jan 24 13:21:18 PST 2023
Beginning the analysis of file_5.txt at:
Tue Jan 24 13:21:19 PST 2023
Sleeping for 1 seconds …
Analysis of file_5.txt has been completed at:
Tue Jan 24 13:21:20 PST 2023
Error message: Run job_array.sh
With the use of #SBATCH --partition=core
, we get the following error:
sbatch: error: invalid partition specified: core
sbatch: error: Batch job submission failed: Invalid partition name specified
#TODOs, notes, etc./i>
Reading through the text for this example, I think I see what's going on here—and how I can use this example specifically to run GNU parallel
such that it takes a header-ed, delimited file of entries that are parameter values for a command under the umbrella of GNU parallel
; work on this Done.#TOMORROW
~~~~in the coming days
I'm going to have to write and run a few tests first to get all the pieces working together—but I think the effort and time spent will pay off...
- Draft those during downtime for
ing work (see below—e.g., aligning RNA-seq datasets) - Copy in pertinent code and notes from
- The code around here, I think...
- Going to need to generate header-ed single-line lists of parameters for
, so can draw on...
Also, I really need to get started with the troubleshooting for Alison—get started with that #TOMORROW
Code: Survey the current script
#SBATCH --nodes=1
#SBATCH --cpus-per-task=${threads}
#SBATCH --error=./sh_err_out/err_out/${script_name%.sh}.%A-%a.err.txt
#SBATCH --output=./sh_err_out/err_out/${script_name%.sh}.%A-%a.out.txt
# ${script_name}
# KA
# $(date '+%Y-%m%d')
# ------------------------------------
print_message_exit() {
# Print a message and exit
# :param 1: message to be printed <chr>
echo "${1}"
exit 1
check_directory_exists() {
# Check that a directory exists; exit if it does not
# :param 1: directory, including path <chr>
[[ -d "${1}" ]] ||
echo -e "Exiting: Directory ${1} does not exist.\n"
exit 1
check_file_exists() {
# Check that a file exists; exit if it does not
# :param 1: file, including path <chr>
[[ -f "${1}" ]] ||
echo -e "Exiting: File ${1} does not exist.\n"
exit 1
check_value_integer() {
# Check that a value is an integer; exit if not
# :param 1: value to be checked for positive \"integer\" data type
# :param 2: string specifying what argument is being tests <chr>
[[ ! "${1}" =~ ^[0-9]+$ ]] &&
echo -e "Exiting: Argument for ${2} must be a positive integer.\n"
exit 1
calculate_run_time() {
# Calculate run time for chunk of code
# :param 1: start time in <'date +%s' format>
# :param 2: end time in <'date +%s' format>
# :param 3: message to be displayed when printing the run time <chr>
run_time="$(echo "${2}" - "${1}" | bc -l)"
echo ""
echo "${3}"
printf 'Run time: %dh:%dm:%ds\n' \
$(( run_time/3600 )) $(( run_time%3600/60 )) $(( run_time%60 ))
echo ""
# ------------------------------------
-c {catalog} directory containing .fastq.gz files, including path; to
be mounted to the Trinity container at '/data' <chr>
-l {left_1} first of two .fastq.gz files for 'left' reads <chr>
-b {left_2} second of two .fastq.gz files for 'left' reads <chr>
-r {right_1} first of two .fastq.gz files for 'right' reads <chr>
-d {right_2} second of two .fastq.gz files for 'right' reads <chr>
-o {out} path for Trinity outfiles; prefix for filenames derived
from the following four arguments <chr>
-k {min_kmer_cov} minimum count for k-mers to be assembled by Inchworm;
e.g., using a setting of 2 means that singleton k-mers
will not be included in initial Inchworm contigs
<int >= 1> [default: 1]
-i {min_iso_ratio} minimum fraction of average k-mer coverage between two
Inchworm contigs; required for gluing <float>
[default: 0.05]
-g {min_glue} minimum number of reads needed to glue two Inchworm
contigs together <int >= 1> [default: 2]
-f {glue_factor} fraction of maximum (Inchworm pair coverage) for read glue
support <float> [default: 0.05]
while getopts "c:l:b:r:d:o:k:i:g:f:" opt; do
case "${opt}" in
c) catalog="${OPTARG}" ;;
l) left_1="${OPTARG}" ;;
b) left_2="${OPTARG}" ;;
r) right_1="${OPTARG}" ;;
d) right_2="${OPTARG}" ;;
o) out="${OPTARG}" ;;
k) min_kmer_cov="${OPTARG}" ;;
i) min_iso_ratio="${OPTARG}" ;;
g) min_glue="${OPTARG}" ;;
f) glue_factor="${OPTARG}" ;;
*) print_message_exit "${help}" ;;
[[ -z "${catalog}" ]] && print_message_exit "${help}"
[[ -z "${left_1}" ]] && print_message_exit "${help}"
[[ -z "${left_2}" ]] && print_message_exit "${help}"
[[ -z "${right_1}" ]] && print_message_exit "${help}"
[[ -z "${right_2}" ]] && print_message_exit "${help}"
[[ -z "${out}" ]] && print_message_exit "${help}"
[[ -z "${min_kmer_cov}" ]] && min_kmer_cov=1
[[ -z "${min_iso_ratio}" ]] && min_iso_ratio=0.05
[[ -z "${min_glue}" ]] && min_glue=2
[[ -z "${glue_factor}" ]] && glue_factor=0.05
# ------------------------------------
check_directory_exists "${catalog}"
# check_file_exists "${left_1}"
# check_file_exists "${left_2}"
# check_file_exists "${right_1}"
# check_file_exists "${right_2}"
check_value_integer "${min_kmer_cov}" "{min_kmer_cov}"
check_value_integer "${min_glue}" "{min_glue}"
#TODO 1/2 In the echo test and submission script, check_file_exists() will lead
#TODO 2/2 to exit b/c not accessing container mount
#TODO Check that directory portion of {out} exists
#TODO check_value_float "${min_iso_ratio}" "{min_iso_ratio}"
#TODO check_value_float "${glue_factor}" "{glue_factor}"
# Echo -------------------------------
time_start="$(date +%s)"
parallel --header : --colsep " " -k -j 1 echo \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
::: catalog "${catalog}" \
::: scratch "/fh/scratch/delete30/tsukiyama_t" \
::: j_mem "50G" \
::: j_cor "${SLURM_CPUS_ON_NODE}" \
::: left_1 "${left_1}" \
:::+ left_2 "${left_2}" \
:::+ right_1 "${right_1}" \
:::+ right_2 "${right_2}" \
:::+ out "${out}" \
::: min_kmer_cov "${min_kmer_cov}" \
::: min_iso_ratio "${min_iso_ratio}" \
::: min_glue "${min_glue}" \
::: glue_factor "${glue_factor}"
# Run --------------------------------
parallel --header : --colsep " " -k -j 1 \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
::: catalog "${catalog}" \
::: scratch "/fh/scratch/delete30/tsukiyama_t" \
::: j_mem "50G" \
::: j_cor "${SLURM_CPUS_ON_NODE}" \
::: left_1 "${left_1}" \
:::+ left_2 "${left_2}" \
:::+ right_1 "${right_1}" \
:::+ right_2 "${right_2}" \
:::+ out "${out}" \
::: min_kmer_cov "${min_kmer_cov}" \
::: min_iso_ratio "${min_iso_ratio}" \
::: min_glue "${min_glue}" \
::: glue_factor "${glue_factor}"
time_end="$(date +%s)"
Code: Adapt script to take a header-ed list of arguments
# ${script_name_ech}
# KA
# $(date '+%Y-%m%d')
# ------------------------------------
print_message_exit() {
# Print a message and exit
# :param 1: message to be printed <chr>
echo "${1}"
exit 1
check_file_exists() {
# Check that a file exists; exit if it does not
# :param 1: file, including path <chr>
[[ -f "${1}" ]] ||
echo -e "Exiting: File ${1} does not exist.\n"
exit 1
calculate_run_time() {
# Calculate run time for chunk of code
# :param 1: start time in <'date +%s' format>
# :param 2: end time in <'date +%s' format>
# :param 3: message to be displayed when printing the run time <chr>
run_time="$(echo "${2}" - "${1}" | bc -l)"
echo ""
echo "${3}"
printf 'Run time: %dh:%dm:%ds\n' \
$(( run_time/3600 )) $(( run_time%3600/60 )) $(( run_time%60 ))
echo ""
# ------------------------------------
This script takes in a single file that requires a list of arguments
-a {arguments} space-delimited list of arguments for the below settings and
parameters; list is header-ed with the names of variables for
the arguments (in brackets below)
# -------------------------------------
{catalog} directory containing .fastq.gz files, including path; to
be mounted to the Trinity container at '/data' <chr>
{scratch} scratch directory, including path, to be mounted to the
Trinity container <chr>
{j_mem} max memory to used by Trinity when limiting can be enabled
(e.g., with jellyfish, sorting, etc.); must be in the form
of a nonnegative integer followed by a single uppercase
letter signifying the unit of storage, e.g., '50G' <chr>
{j_cor} number of threads for Trinity to use <int >= 1>
{left_1} first of two .fastq.gz files for 'left' reads <chr>
{left_2} second of two .fastq.gz files for 'left' reads <chr>
{right_1} first of two .fastq.gz files for 'right' reads <chr>
{right_2} second of two .fastq.gz files for 'right' reads <chr>
{out} path for Trinity outfiles; prefix for filenames derived
from the following four arguments <chr>
{min_kmer_cov} minimum count for k-mers to be assembled by Inchworm;
e.g., using a setting of 2 means that singleton k-mers
will not be included in initial Inchworm contigs
<int >= 1>
{min_iso_ratio} minimum fraction of average k-mer coverage between two
Inchworm contigs; required for gluing <float>
{min_glue} minimum number of reads needed to glue two Inchworm
contigs together <int >= 1>
{glue_factor} fraction of maximum (Inchworm pair coverage) for read
glue support <float>
# -------------------------------------
while getopts "a:" opt; do
case "${opt}" in
a) arguments="${OPTARG}" ;;
*) print_message_exit "${help}" ;;
[[ -z "${arguments}" ]] && print_message_exit "${help}"
# ------------------------------------
check_file_exists "${arguments}"
# Echo -------------------------------
time_start="$(date +%s)"
parallel --header : --colsep " " -k -j 1 echo \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
:::: "${arguments}"
time_end="$(date +%s)"
# ${script_name_ech}
# KA
# $(date '+%Y-%m%d')
# ------------------------------------
print_message_exit() {
# Print a message and exit
# :param 1: message to be printed <chr>
echo "\${1}"
exit 1
check_file_exists() {
# Check that a file exists; exit if it does not
# :param 1: file, including path <chr>
[[ -f "\${1}" ]] ||
echo -e "Exiting: File \${1} does not exist.\n"
exit 1
calculate_run_time() {
# Calculate run time for chunk of code
# :param 1: start time in <'date +%s' format>
# :param 2: end time in <'date +%s' format>
# :param 3: message to be displayed when printing the run time <chr>
run_time="\$(echo "\${2}" - "\${1}" | bc -l)"
echo ""
echo "\${3}"
printf 'Run time: %dh:%dm:%ds\n' \
\$(( run_time/3600 )) \$(( run_time%3600/60 )) \$(( run_time%60 ))
echo ""
# ------------------------------------
This script takes in a single file that requires a list of arguments
-a {arguments} space-delimited list of arguments for the below settings and
parameters; list is header-ed with the names of variables for
the arguments (in brackets below)
# -------------------------------------
{catalog} directory containing .fastq.gz files, including path; to
be mounted to the Trinity container at '/data' <chr>
{scratch} scratch directory, including path, to be mounted to the
Trinity container <chr>
{j_mem} max memory to used by Trinity when limiting can be enabled
(e.g., with jellyfish, sorting, etc.); must be in the form
of a nonnegative integer followed by a single uppercase
letter signifying the unit of storage, e.g., '50G' <chr>
{j_cor} number of threads for Trinity to use <int >= 1>
{left_1} first of two .fastq.gz files for 'left' reads <chr>
{left_2} second of two .fastq.gz files for 'left' reads <chr>
{right_1} first of two .fastq.gz files for 'right' reads <chr>
{right_2} second of two .fastq.gz files for 'right' reads <chr>
{out} path for Trinity outfiles; prefix for filenames derived
from the following four arguments <chr>
{min_kmer_cov} minimum count for k-mers to be assembled by Inchworm;
e.g., using a setting of 2 means that singleton k-mers
will not be included in initial Inchworm contigs
<int >= 1>
{min_iso_ratio} minimum fraction of average k-mer coverage between two
Inchworm contigs; required for gluing <float>
{min_glue} minimum number of reads needed to glue two Inchworm
contigs together <int >= 1>
{glue_factor} fraction of maximum (Inchworm pair coverage) for read
glue support <float>
# -------------------------------------
while getopts "a:" opt; do
case "\${opt}" in
a) arguments="\${OPTARG}" ;;
*) print_message_exit "\${help}" ;;
[[ -z "\${arguments}" ]] && print_message_exit "\${help}"
# ------------------------------------
check_file_exists "\${arguments}"
# Echo -------------------------------
time_start="\$(date +%s)"
parallel --header : --colsep " " -k -j 1 echo \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
:::: "\${arguments}"
time_end="\$(date +%s)"
Code: Write a code chunk for a script for the job submission
#SBATCH --job-name=${script_name_ech}
#SBATCH --nodes=1
#SBATCH --cpus-per-task=${threads}
#SBATCH --error=${err_out}/${script_name_ech%.sh}.%A-%a.err.txt
#SBATCH --output=${err_out}/${script_name_ech%.sh}.%A-%a.out.txt
#SBATCH --array=1-${max_id_job}%${max_id_task}
# ${script_name_run}
# KA
# $(date '+%Y-%m%d')
cat "./${store}/${list%.txt}.${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print $10 }'
cat "./${store}/${list%.txt}.${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print $11 }'
cat "./${store}/${list%.txt}.${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print $12 }'
cat "./${store}/${list%.txt}.${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print $13 }'
ln -f \
${err_out}/${script_name_ech%.sh}.${SLURM_ARRAY_JOB_ID}-${SLURM_ARRAY_TASK_ID}.out.txt \
ln -f \
${err_out}/${script_name_ech%.sh}.${SLURM_ARRAY_JOB_ID}-${SLURM_ARRAY_TASK_ID}.err.txt \
srun \
"${sh_err_out}/${script_name_ech}" \
-a "./${store}/${list%.txt}.${SLURM_ARRAY_TASK_ID}.txt"
rm \
rm \
#SBATCH --job-name=${script_name_ech}
#SBATCH --nodes=1
#SBATCH --cpus-per-task=${threads}
#SBATCH --error=${err_out}/${script_name_ech%.sh}.%A-%a.err.txt
#SBATCH --output=${err_out}/${script_name_ech%.sh}.%A-%a.out.txt
#SBATCH --array=1-${max_id_job}%${max_id_task}
# ${script_name_run}
# KA
# $(date '+%Y-%m%d')
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$10 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$11 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$12 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$13 }'
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.out.txt \
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.err.txt \
srun \
"${sh_err_out}/${script_name_ech}" \
-a "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt"
rm \
rm \
Code: Get situated
# grabnode # 1 core, default settings
mwd() {
transcriptome \
&& cd "./results/2023-0111" \
|| echo "cd'ing failed; check on this"
Code: Getting file, directory info into a deduplicated associative array
# Create an array of files of interest, including relative paths -------------
unset d_in_base
typeset -a d_in_base=(
# echoTest "${d_in_base[@]}"
# echo "${#d_in_base[@]}"
# Get necessary file/path info into separate arrays ------
unset f_in
unset d_in
typeset -a f_in
typeset -a d_in
for i in "${d_in_base[@]}"; do
# i="${d_in_base[0]}"
echo "# Working with files in... --------------------------------"
echo "#+ ${i}"
# ., "${i}"
while IFS=" " read -r -d $'\0'; do
f_in+=( "$(echo "$(basename "${REPLY%.?.fq.gz}")" | cut -d $'_' -f 2-)" )
d_in+=( "$(dirname "${REPLY}")" )
done < <(\
find "${i}" \
-type f \
-name "*_Q_IP_*_1_*.?.fq.gz" \
echo ""
echoTest "${f_in[@]}"
echoTest "${d_in[@]}"
# Rejoin the path and file info before dedup'ing ---------
unset d_f_rejoin
typeset -a d_f_rejoin
for i in $(seq 0 $(echo "${#f_in[@]}" - 1 | bc)); do
d_f_rejoin+=( "${d_in[${i}]}/${f_in[${i}]}" )
echoTest "${d_f_rejoin[@]}"
# Remove duplicate elements from the "rejoin" array ------
IFS=" " read -r -a d_f_rejoin \
<<< "$(\
tr ' ' '\n' \
<<< "${d_f_rejoin[@]}" \
| sort -u \
| tr '\n' ' '\
echoTest "${d_f_rejoin[@]}"
# "Unjoin" the "rejoin" array ----------------------------
unset f_in
unset d_in
typeset -a f_in
typeset -a d_in
for i in "${d_f_rejoin[@]}"; do
echo "# Working with... ------------------------------------------"
echo "#+ ${i}"
f_in+=( "$(basename "${i%.?.fq.gz}")" )
d_in+=( "$(dirname "${i}")" )
echo ""
echoTest "${f_in[@]}"
echoTest "${d_in[@]}"
Code: Define variables
store="tutorial_job-arrays/test_list" # cd "${store}"
list="test_list.txt" # echo "${list}"
catalog="$(dirname "$(pwd)")/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd"
d_base="files_Trinity-GF/$(echo "${file_1}" | cut -d "/" -f 1)"
echo $(basename "${file_1}" ".Aligned.sortedByCoord.out.sc_all.1.fq.gz") \
| cut -d $'_' -f 2- \
Code: Set up directory for storing results from these tests
if [[ ! -d "${store}" ]]; then mkdir -p "${store}"; fi
Code: Generate the list
if [[ -f "${store}/${list}" ]]; then
rm "${store}/${list}"
echo "catalog" \
"scratch" \
"j_mem" \
"j_cor" \
"left_1" \
"left_2" \
"right_1" \
"right_2" \
"out" \
"min_kmer_cov" \
"min_iso_ratio" \
"min_glue" \
"glue_factor" \
>> "${store}/${list}"
echo "${catalog}" \
"/fh/scratch/delete30/tsukiyama_t" \
"50G" \
"${left_1}" \
"${left_2}" \
"${right_1}" \
"${right_2}" \
"${out}" \
"${min_kmer_cov}" \
"${min_iso_ratio}" \
"${min_glue}" \
"${glue_factor}" \
>> "${store}/${list}"
# vi "${store}/${list}"
# cat "${store}/${list}"
Code: Write out the list-ready, adapted script (echo test) using a HEREDOC
if [[ -f "${sh_err_out}/${script_name_ech}" ]]; then
rm "${sh_err_out}/${script_name_ech}"
cat << script > "${sh_err_out}/${script_name_ech}"
# ${script_name_ech}
# KA
# $(date '+%Y-%m%d')
# ------------------------------------
print_message_exit() {
# Print a message and exit
# :param 1: message to be printed <chr>
echo "\${1}"
exit 1
check_file_exists() {
# Check that a file exists; exit if it does not
# :param 1: file, including path <chr>
[[ -f "\${1}" ]] ||
echo -e "Exiting: File \${1} does not exist.\n"
exit 1
calculate_run_time() {
# Calculate run time for chunk of code
# :param 1: start time in <'date +%s' format>
# :param 2: end time in <'date +%s' format>
# :param 3: message to be displayed when printing the run time <chr>
run_time="\$(echo "\${2}" - "\${1}" | bc -l)"
echo ""
echo "\${3}"
printf 'Run time: %dh:%dm:%ds\n' \
\$(( run_time/3600 )) \$(( run_time%3600/60 )) \$(( run_time%60 ))
echo ""
# ------------------------------------
This script takes in a single file that requires a list of arguments
-a {arguments} space-delimited list of arguments for the below settings and
parameters; list is header-ed with the names of variables for
the arguments (in brackets below)
# -------------------------------------
{catalog} directory containing .fastq.gz files, including path; to
be mounted to the Trinity container at '/data' <chr>
{scratch} scratch directory, including path, to be mounted to the
Trinity container <chr>
{j_mem} max memory to used by Trinity when limiting can be enabled
(e.g., with jellyfish, sorting, etc.); must be in the form
of a nonnegative integer followed by a single uppercase
letter signifying the unit of storage, e.g., '50G' <chr>
{j_cor} number of threads for Trinity to use <int >= 1>
{left_1} first of two .fastq.gz files for 'left' reads <chr>
{left_2} second of two .fastq.gz files for 'left' reads <chr>
{right_1} first of two .fastq.gz files for 'right' reads <chr>
{right_2} second of two .fastq.gz files for 'right' reads <chr>
{out} path for Trinity outfiles; prefix for filenames derived
from the following four arguments <chr>
{min_kmer_cov} minimum count for k-mers to be assembled by Inchworm;
e.g., using a setting of 2 means that singleton k-mers
will not be included in initial Inchworm contigs
<int >= 1>
{min_iso_ratio} minimum fraction of average k-mer coverage between two
Inchworm contigs; required for gluing <float>
{min_glue} minimum number of reads needed to glue two Inchworm
contigs together <int >= 1>
{glue_factor} fraction of maximum (Inchworm pair coverage) for read
glue support <float>
# -------------------------------------
while getopts "a:" opt; do
case "\${opt}" in
a) arguments="\${OPTARG}" ;;
*) print_message_exit "\${help}" ;;
[[ -z "\${arguments}" ]] && print_message_exit "\${help}"
# ------------------------------------
check_file_exists "\${arguments}"
# Echo -------------------------------
time_start="\$(date +%s)"
parallel --header : --colsep " " -k -j 1 echo \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
:::: "\${arguments}"
time_end="\$(date +%s)"
chmod +x "${sh_err_out}/${script_name_ech}"
# vi "${sh_err_out}/${script_name_ech}" # :q
# cat "${sh_err_out}/${script_name_ech}"
Code: Write out the submission script (echo test) using a HEREDOC
if [[ -f "${store}/${script_name_run}" ]]; then
rm "${store}/${script_name_run}"
cat << script > "${store}/${script_name_run}"
#SBATCH --job-name=${script_name_ech}
#SBATCH --nodes=1
#SBATCH --cpus-per-task=${threads}
#SBATCH --error=${err_out}/${script_name_ech%.sh}.%A-%a.err.txt
#SBATCH --output=${err_out}/${script_name_ech%.sh}.%A-%a.out.txt
#SBATCH --array=1-${max_id_job}%${max_id_task}
# ${script_name_run}
# KA
# $(date '+%Y-%m%d')
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$10 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$11 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$12 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$13 }'
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.out.txt \
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.err.txt \
srun \
"${sh_err_out}/${script_name_ech}" \
-a "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt"
rm \
rm \
# vi "${store}/${script_name_run}" # :q
# cat "${store}/${script_name_run}"
Copy and document the code here in the chunks under 'Write a code chunk for a script for the job submission'; then, move on to adapting this code with the numerous individual lists generated below; then, should be go to get Trinity running (handle UMIs first?)
Code: Do a test run of the script and list
# Get a test file that is named like one of the individual files that I will
#+ ultimately use
if [[ ! -f "./${store}/${list/.txt/.1.txt}" ]]; then
cp \
"./${store}/${list}" \
sbatch "./${store}/${script_name_run}"
cat ./${store}/${list%.txt}.1.txt \
| awk -v OFS='\t' 'FNR == 2 { print $10 }'
Code and printed: Do a test run of the script and list
cd "${store}" || echo "cd'ing failed; check on this..."
cat echo_Trinity-GF_optimization.8530517-1.out.txt
❯ cd "${store}" || echo "cd'ing failed; check on this..."
❯ .,
total 352K
drwxrws--- 2 kalavatt 468 Jan 26 16:50 ./
drwxrws--- 7 kalavatt 197 Jan 26 09:16 ../
-rw-rw---- 1 kalavatt 175 Jan 26 16:49 echo_Trinity-GF_optimization.8530516-1.err.txt
-rw-rw---- 1 kalavatt 0 Jan 26 16:49 echo_Trinity-GF_optimization.8530516-1.out.txt
-rw-rw---- 1 kalavatt 0 Jan 26 16:50 echo_Trinity-GF_optimization.8530517-1.err.txt
-rw-rw---- 1 kalavatt 1.1K Jan 26 16:50 echo_Trinity-GF_optimization.8530517-1.out.txt
-rwxrwx--- 1 kalavatt 4.3K Jan 26 16:37 echo_Trinity-GF_optimization.sh*
-rw-rw---- 1 kalavatt 505 Jan 26 16:47 run_Trinity-GF_optimization.sh
-rw-rw---- 1 kalavatt 4.5K Jan 26 12:28 submit_Trinity-GF_optimization.sh
-rw-rw---- 1 kalavatt 813 Jan 26 16:45 test_list.1.txt
-rw-rw---- 1 kalavatt 813 Jan 26 16:29 test_list.txt
❯ cat echo_Trinity-GF_optimization.8530517-1.out.txt
singularity run --bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data --bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch /home/kalavatt/singularity-docker-etc/Trinity.sif Trinity --verbose --max_memory 50G --CPU ${SLURM_CPUS_ON_NODE} --SS_lib_type FR --seqType fq --left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz --right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz --jaccard_clip --output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd --full_cleanup --min_kmer_cov 1 --min_iso_ratio 0.01 --min_glue 1 --glue_factor 0.005 --max_reads_per_graph 2000 --normalize_max_read_cov 200 --group_pairs_distance 700 --min_contig_length 200
❯ cat echo_Trinity-GF_optimization.8530517-1.out.txt
singularity run \
--bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data \
--bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch \
/home/kalavatt/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory 50G \
--SS_lib_type FR --seqType fq \
--left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
--right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
--jaccard_clip \
--output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
--full_cleanup \
--min_kmer_cov 1 \
--min_iso_ratio 0.01 \
--min_glue 1 \
--glue_factor 0.005 \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200
Looks like all the information is making it into the script
Code: Get situated
mwd() {
transcriptome \
&& cd "./results/2023-0111" \
|| echo "cd'ing failed; check on this"
Code: Getting file, directory info into a deduplicated associative array
# Create an array of files of interest, including relative paths -------------
unset d_in_base
typeset -a d_in_base=(
# echoTest "${d_in_base[@]}"
# echo "${#d_in_base[@]}"
# Get necessary file/path info into separate arrays ------
unset f_in
unset d_in
typeset -a f_in
typeset -a d_in
for i in "${d_in_base[@]}"; do
# i="${d_in_base[0]}"
echo "# Working with files in... --------------------------------"
echo "#+ ${i}"
# ., "${i}"
while IFS=" " read -r -d $'\0'; do
f_in+=( "$(echo "$(basename "${REPLY%.?.fq.gz}")" | cut -d $'_' -f 2-)" )
d_in+=( "$(dirname "${REPLY}")" )
done < <(\
find "${i}" \
-type f \
-name "*_Q_IP_*_1_*.?.fq.gz" \
echo ""
echoTest "${f_in[@]}"
echoTest "${d_in[@]}"
# Rejoin the path and file info before dedup'ing ---------
unset d_f_rejoin
typeset -a d_f_rejoin
for i in $(seq 0 $(echo "${#f_in[@]}" - 1 | bc)); do
d_f_rejoin+=( "${d_in[${i}]}/${f_in[${i}]}" )
echoTest "${d_f_rejoin[@]}"
# Remove duplicate elements from the "rejoin" array ------
IFS=" " read -r -a d_f_rejoin \
<<< "$(\
tr ' ' '\n' \
<<< "${d_f_rejoin[@]}" \
| sort -u \
| tr '\n' ' '\
echoTest "${d_f_rejoin[@]}"
# "Unjoin" the "rejoin" array ----------------------------
unset f_in
unset d_in
typeset -a f_in
typeset -a d_in
for i in "${d_f_rejoin[@]}"; do
echo "# Working with... ------------------------------------------"
echo "#+ ${i}"
f_in+=( "$(basename "${i%.?.fq.gz}")" )
d_in+=( "$(dirname "${i}")" )
echo ""
echoTest "${f_in[@]}"
echoTest "${d_in[@]}"
Code: Define variables
# Basic variables: script name, no. threads, SLURM parameters ----------------
# Location and file for lists of parameters ----------------------------------
store="tutorial_job-arrays/test_list_multi" # cd "${store}"
list="test_list_multi.txt" # head "${list}"
# Variables for directories to be mounted to the Trinity container -----------
catalog="$(dirname "$(pwd)")/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd"
# Variables that define the max memory, no. threads used by Trinity ----------
# Variables and arrays necessary to define the .fq.gz infiles ----------------
d_base="files_Trinity-GF/$(echo "${file_1}" | cut -d "/" -f 1)"
echo $(basename "${file_1}" ".Aligned.sortedByCoord.out.sc_all.1.fq.gz") \
| cut -d $'_' -f 2- \
unset left_1
unset left_2
unset right_1
unset right_2
typeset -a left_1
typeset -a left_2
typeset -a right_1
typeset -a right_2
for i in "${f_in[@]}"; do
left_1+=( "/data/5781_${i}.1.fq.gz")
left_2+=( "/data/5782_${i}.1.fq.gz" )
right_1+=( "/data/5781_${i}.2.fq.gz" )
right_2+=( "/data/5782_${i}.2.fq.gz" )
echoTest "${left_1[@]}"
echoTest "${left_2[@]}"
echoTest "${right_1[@]}"
echoTest "${right_2[@]}"
# Variables necessary to define Trinity outdirectories -----------------------
unset out
typeset -a out
out=( "${d_base}/${d_mid}" )
echoTest "${out[@]}"
# Variables necessary to define Trinity model parameters ---------------------
typeset -a min_kmer_cov=(1 2 4 8 16 32)
typeset -a min_iso_ratio=(0.005 0.01 0.05 0.1)
typeset -a min_glue=(1 2 4)
typeset -a glue_factor=(0.005 0.01 0.05 0.1)
echoTest "${min_kmer_cov[@]}"
echoTest "${min_iso_ratio[@]}"
echoTest "${min_glue[@]}"
echoTest "${glue_factor[@]}"
Code: Set up directory for storing results from these tests
if [[ ! -d "${store}" ]]; then mkdir -p "${store}"; fi
Code: Start the list with a header
# Header
if [[ -f "${store}/test_list_multi.txt" ]]; then
rm "${store}/test_list_multi.txt"
# mv "${store}/test_list_multi.txt" "${store}/test_list_multi.1.txt"
echo "catalog \
scratch \
j_mem \
j_cor \
left_1 \
left_2 \
right_1 \
right_2 \
out \
min_kmer_cov \
min_iso_ratio \
min_glue \
glue_factor" \
> "${store}/test_list_multi.txt"
# vi "${store}/test_list_multi.txt" # :q
# cat "${store}/test_list_multi.txt"
Code: Write code for generating lists with permutations of parameters
# Body
parallel --header : --colsep " " -k -j 1 echo \
"{catalog} \
{scratch} \
{j_mem} \
{j_cor} \
{left_1} \
{left_2} \
{right_1} \
{right_2} \
{out}/trinity_mkc-{min_kmer_cov}_mir-{min_iso_ratio}_mg-{min_glue}_gf-{glue_factor} \
{min_kmer_cov} \
{min_iso_ratio} \
{min_glue} \
{glue_factor}" \
::: catalog "${catalog}" \
::: scratch "${scratch}" \
::: j_mem "${j_mem}" \
::: j_cor "${j_cor}" \
:::+ left_1 "${left_1[@]}" \
:::+ left_2 "${left_2[@]}" \
:::+ right_1 "${right_1[@]}" \
:::+ right_2 "${right_2[@]}" \
:::+ out "${out[@]}" \
::: min_kmer_cov "${min_kmer_cov[@]}" \
::: min_iso_ratio "${min_iso_ratio[@]}" \
::: min_glue "${min_glue[@]}" \
::: glue_factor "${glue_factor[@]}" \
>> "${store}/test_list_multi.txt"
# wc -l "${store}/test_list_multi.txt"
# head "${store}/test_list_multi.txt"
Printed: Examine the text printed to "${store}/test_list_multi.txt"
Prior to adding the file stem and header
Various lines selected from the file and formatted with new lines, etc.
Line 2
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
1 \
0.005 \
1 \
Line 65
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
2 \
0.01 \
2 \
Line 119
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
4 \
0.01 \
4 \
Line 222
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
16 \
0.05 \
2 \
Line 278
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd \
32 \
0.1 \
1 \
After adding the file stem but not yet the header
Line selected from the file and formatted with new lines, etc.
Line 4
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-1_mir-0.005_mg-1_gf-0.1 \
1 \
0.005 \
1 \
After adding both the file stem and header
Line selected from the file and formatted with new lines, etc.
Lines 1 and 2
catalog scratch j_mem j_cor left_1 left_2 right_1 right_2 out min_kmer_cov min_iso_ratio min_glue glue_factor
/home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd \
/fh/scratch/delete30/tsukiyama_t \
50G \
8 \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
/data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-1_mir-0.005_mg-1_gf-0.005 \
1 \
0.005 \
1 \
Notes: Examine the text printed to "${store}/test_list_multi.txt"
There should be 288 permutations (from 6 × 4 × 3 × 4), and thus 288 lines, i.e.,
- 6 values for
- 4 values for
- 3 values for
- 4 values for
Code: Write a chunk to split the list into individual files
if [[ -f "${store}/test_list_multi.24.txt" ]]; then
rm "${store}/"*.{?,??,???}.txt
# ., "${store}"
# vi "${store}/test_list_multi.txt" # :q
# cat "${store}/test_list_multi.txt" # :q
typeset -i i=0
sed 1d "${store}/${list}" | while read -r line; do
# Increment with each line
i=$(( i + 1 ))
# File for job submission
individual="${list%.txt}.${i}.txt" # echo "${individual}"
# If present, remove infile with header and single-line body
[[ ! -e "${store}/${individual}" ]] || rm "${store}/${individual}"
# echo "${store}/${individual}"
# Generate infile with header and single-line body
# echo "$(head -n 1 ${list})" >> "${individual}"
head -n 1 "${store}/${list}" >> "${store}/${individual}" # cat "${store}/${individual}"
echo "${line}" >> "${store}/${individual}" # cat "${store}/${individual}"
# echo "Created file: ${store}/${individual}"
# ., "${store}"
# vi "${store}/test_list_multi.24.txt" # :q
# cat "${store}/test_list_multi.24.txt" # :q
Code: Write out the list-ready run script (echo test) using a HEREDOC
if [[ -f "${sh_err_out}/${script_name_ech}" ]]; then
rm "${sh_err_out}/${script_name_ech}"
cat << script > "${sh_err_out}/${script_name_ech}"
# ${script_name_ech}
# KA
# $(date '+%Y-%m%d')
# ------------------------------------
print_message_exit() {
# Print a message and exit
# :param 1: message to be printed <chr>
echo "\${1}"
exit 1
check_file_exists() {
# Check that a file exists; exit if it does not
# :param 1: file, including path <chr>
[[ -f "\${1}" ]] ||
echo -e "Exiting: File \${1} does not exist.\n"
exit 1
calculate_run_time() {
# Calculate run time for chunk of code
# :param 1: start time in <'date +%s' format>
# :param 2: end time in <'date +%s' format>
# :param 3: message to be displayed when printing the run time <chr>
run_time="\$(echo "\${2}" - "\${1}" | bc -l)"
echo ""
echo "\${3}"
printf 'Run time: %dh:%dm:%ds\n' \
\$(( run_time/3600 )) \$(( run_time%3600/60 )) \$(( run_time%60 ))
echo ""
# ------------------------------------
This script takes in a single file that requires a list of arguments
-a {arguments} space-delimited list of arguments for the below settings and
parameters; list is header-ed with the names of variables for
the arguments (in brackets below)
# -------------------------------------
{catalog} directory containing .fastq.gz files, including path; to
be mounted to the Trinity container at '/data' <chr>
{scratch} scratch directory, including path, to be mounted to the
Trinity container <chr>
{j_mem} max memory to used by Trinity when limiting can be enabled
(e.g., with jellyfish, sorting, etc.); must be in the form
of a nonnegative integer followed by a single uppercase
letter signifying the unit of storage, e.g., '50G' <chr>
{j_cor} number of threads for Trinity to use <int >= 1>
{left_1} first of two .fastq.gz files for 'left' reads <chr>
{left_2} second of two .fastq.gz files for 'left' reads <chr>
{right_1} first of two .fastq.gz files for 'right' reads <chr>
{right_2} second of two .fastq.gz files for 'right' reads <chr>
{out} path for Trinity outfiles; prefix for filenames derived
from the following four arguments <chr>
{min_kmer_cov} minimum count for k-mers to be assembled by Inchworm;
e.g., using a setting of 2 means that singleton k-mers
will not be included in initial Inchworm contigs
<int >= 1>
{min_iso_ratio} minimum fraction of average k-mer coverage between two
Inchworm contigs; required for gluing <float>
{min_glue} minimum number of reads needed to glue two Inchworm
contigs together <int >= 1>
{glue_factor} fraction of maximum (Inchworm pair coverage) for read
glue support <float>
# -------------------------------------
while getopts "a:" opt; do
case "\${opt}" in
a) arguments="\${OPTARG}" ;;
*) print_message_exit "\${help}" ;;
[[ -z "\${arguments}" ]] && print_message_exit "\${help}"
# ------------------------------------
check_file_exists "\${arguments}"
# Echo -------------------------------
time_start="\$(date +%s)"
parallel --header : --colsep " " -k -j 1 echo \
'singularity run \
--bind {catalog}:/data \
--bind {scratch}:/loc/scratch \
~/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory {j_mem} \
--CPU {j_cor} \
--SS_lib_type FR \
--seqType fq \
--left {left_1},{left_2} \
--right {right_1},{right_2} \
--jaccard_clip \
--output {out} \
--full_cleanup \
--min_kmer_cov {min_kmer_cov} \
--min_iso_ratio {min_iso_ratio} \
--min_glue {min_glue} \
--glue_factor {glue_factor} \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200' \
:::: "\${arguments}"
time_end="\$(date +%s)"
chmod +x "${sh_err_out}/${script_name_ech}"
# vi "${sh_err_out}/${script_name_ech}" # :q
# cat "${sh_err_out}/${script_name_ech}"
Code: Write out the submission script (echo test) using a HEREDOC
if [[ -f "${store}/${script_name_run}" ]]; then
rm "${store}/${script_name_run}"
cat << script > "${store}/${script_name_run}"
#SBATCH --job-name=${script_name_ech}
#SBATCH --nodes=1
#SBATCH --cpus-per-task=${threads}
#SBATCH --error=${err_out}/${script_name_ech%.sh}.%A-%a.err.txt
#SBATCH --output=${err_out}/${script_name_ech%.sh}.%A-%a.out.txt
#SBATCH --array=1-${max_id_job}%${max_id_task}
# ${script_name_run}
# KA
# $(date '+%Y-%m%d')
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$10 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$11 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$12 }'
cat "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt" \
| awk -v OFS='\t' 'FNR == 2 { print \$13 }'
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.out.txt \
ln -f \
${err_out}/${script_name_ech%.sh}.\${SLURM_ARRAY_JOB_ID}-\${SLURM_ARRAY_TASK_ID}.err.txt \
srun \
"${sh_err_out}/${script_name_ech}" \
-a "./${store}/${list%.txt}.\${SLURM_ARRAY_TASK_ID}.txt"
rm \
rm \
# vi "${store}/${script_name_run}" # :q
# cat "${store}/${script_name_run}"
Code: Run an sbatch echo test using the individual lists
sbatch "./${store}/${script_name_run}"
sbatch "./${store}/${script_name_run}"
Printed: Run an sbatch echo test using the individual lists
Spot check
❯ skal
8612984_[61-288%10 campus-ne echo_Tri kalavatt PD 0:00 1 (JobArrayTaskLimit) 1
8612902_[141-288%1 campus-ne echo_Tri kalavatt PD 0:00 1 (JobArrayTaskLimit) 1
❯ cat trinity_mkc-8_mir-0.005_mg-4_gf-0.005.8612902-153.out.txt
singularity run \
--bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data \
--bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch \
/home/kalavatt/singularity-docker-etc/Trinity.sif Trinity \
--verbose \
--max_memory 50G \
--CPU 1 \
--SS_lib_type FR \
--seqType fq \
--left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
--right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
--jaccard_clip \
--output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-8_mir-0.005_mg-4_gf-0.005 \
--full_cleanup \
--min_kmer_cov 8 \
--min_iso_ratio 0.005 \
--min_glue 4 \
--glue_factor 0.005 \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200
❯ cat trinity_mkc-8_mir-0.1_mg-4_gf-0.005.8612984-189.out.txt
singularity run \
--bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data \
--bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch \
/home/kalavatt/singularity-docker-etc/Trinity.sif \
Trinity \
--verbose \
--max_memory 50G \
--CPU 1 \
--SS_lib_type FR \
--seqType fq \
--left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz \
--right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz \
--jaccard_clip --output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-8_mir-0.1_mg-4_gf-0.005 \
--full_cleanup \
--min_kmer_cov 8 \
--min_iso_ratio 0.1 \
--min_glue 4 \
--glue_factor 0.005 \
--max_reads_per_graph 2000 \
--normalize_max_read_cov 200 \
--group_pairs_distance 700 \
--min_contig_length 200
❯ skal
8612984_[61-288%10 campus-ne echo_Tri kalavatt PD 0:00 1 (JobArrayTaskLimit) 1
8612902_[141-288%1 campus-ne echo_Tri kalavatt PD 0:00 1 (JobArrayTaskLimit) 1
❯ cat trinity_mkc-8_mir-0.005_mg-4_gf-0.005.8612902-153.out.txt
singularity run --bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data --bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch /home/kalavatt/singularity-docker-etc/Trinity.sif Trinity --verbose --max_memory 50G --CPU 1 --SS_lib_type FR --seqType fq --left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz --right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz --jaccard_clip --output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-8_mir-0.005_mg-4_gf-0.005 --full_cleanup --min_kmer_cov 8 --min_iso_ratio 0.005 --min_glue 4 --glue_factor 0.005 --max_reads_per_graph 2000 --normalize_max_read_cov 200 --group_pairs_distance 700 --min_contig_length 200
❯ cat trinity_mkc-8_mir-0.1_mg-4_gf-0.005.8612984-189.out.txt
singularity run --bind /home/kalavatt/tsukiyamalab/kalavatt/2022_transcriptome-construction/results/2022-1201/files_processed-full/fastq_trim-rcor-cor_split/EndToEnd:/data --bind /fh/scratch/delete30/tsukiyama_t:/loc/scratch /home/kalavatt/singularity-docker-etc/Trinity.sif Trinity --verbose --max_memory 50G --CPU 1 --SS_lib_type FR --seqType fq --left /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.1.fq.gz --right /data/5781_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz,/data/5782_Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd.Aligned.sortedByCoord.out.sc_all.2.fq.gz --jaccard_clip --output files_Trinity-GF/files_processed-full/Q_IP_merged.trim-rcor.multi-hit-mode_1_EndToEnd/trinity_mkc-8_mir-0.1_mg-4_gf-0.005 --full_cleanup --min_kmer_cov 8 --min_iso_ratio 0.1 --min_glue 4 --glue_factor 0.005 --max_reads_per_graph 2000 --normalize_max_read_cov 200 --group_pairs_distance 700 --min_contig_length 200