Demultiplex_notes.txt

BI622 Demultiplexing

############################## 26 Jul 2022

Part 1 assigned

Working on Talapas, cloned repo: /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/

Bash commands for data exploration:

(base) [bmeluch@talapas-ln1 Demultiplex]$ srun --account=bgmp --partition=bgmp --nodes=1 --ntasks-per-node=1 --time=2:00:00 --cpus-per-task=1 --pty bash
srun: job 21723133 queued and waiting for resources
srun: job 21723133 has been allocated resources
# Storage usage in GB as of Tue Jul 26 13:01:03 2022
Fileset          User             UsedByUser  UsedByAll      Quota  Use%
home             bmeluch                   1          -         25     3
bgmp             bmeluch                  34      11124      65536    17
packages         bmeluch                   0      11802      16384    72
(base) [bmeluch@n278 Demultiplex]$ 

Read lengths
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R1_001.fastq.gz | head -2 | tail -1 | wc
      1       1     102
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R2_001.fastq.gz | head -2 | tail -1 | wc
      1       1       9
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R3_001.fastq.gz | head -2 | tail -1 | wc
      1       1       9
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R4_001.fastq.gz | head -2 | tail -1 | wc
      1       1     102

Phred encoding
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R4_001.fastq.gz | head -4 | tail -1
#AAFAFJJ-----F---7-<FA-F<AFFA-JJJ77<FJFJFJJJJJJJJJJAFJFFAJJJJJJJJFJF7-AFFJJ7F7JFJJFJ7FFF--A<A7<-A-7--

there's a # so this is Phred+33

File lengths
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R1_001.fastq.gz | wc -l
1452986940
that took so long frankly I do not think I want to do that for all four files

activating conda environment to run distribution plot script
(base) [bmeluch@n278 Assignment-the-first]$ conda activate bgmp_py310
(bgmp_py310) [bmeluch@n278 Assignment-the-first]$ 

Script done /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/base_distribution.py

Running for the read1 file using sbatch
out/err files here: /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm

(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725496
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725507
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725523
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725526
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725528

it took several tries and small edits to the script - I needed to use gzip with the correct option 'rt' - but now it's running so let's see

AHH I set the output directory wrong. Cancelling job.

Trying again without saving a histogram to the shared directory :P
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun 
Submitted batch job 21725571

/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/dist_hist1_21725571.err

	Command being timed: "/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/base_distribution.py -f /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz -o /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/histograms/read1.png"
	User time (seconds): 9092.21
	System time (seconds): 2.23
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:31:40
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 66168
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 62865
	Voluntary context switches: 1925
	Involuntary context switches: 2983
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Run the other three in one script overnight:
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/histogram_i1_i2_r2.srun
(base) [bmeluch@talapas-ln1 slurm]$ sbatch histogram_i1_i2_r2.srun 
Submitted batch job 21731338

############################## 27 July 2022
It didn't work, it doesn't know what matplotlib is, I need to activate my conda environment in my slurm script :(

added "conda activate bgmp_py310" before commands in script

Running again
(base) [bmeluch@talapas-ln1 slurm]$ sbatch histogram_i1_i2_r2.srun 
Submitted batch job 21738732

Success! Histograms saved /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/histograms

############################## 28 Jul 2022
Working on pseudocode for the algorithm
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/pseudocode.txt

############################## 29 Jul 2022
Finished pseudocode
Wrote unit tests
Set up 'N' nucleotide count as a slurm script

Okay that only took a few minutes to run, I could have done it on the command line.

Commands:
/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz | \
grep -A 1 "^@" | grep -v "^@" | grep -c "N"

/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz | \
grep -A 1 "^@" | grep -v "^@" | grep -c "N"

Output:
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/n_count_21768476.out

Index 1: 3976613
Index 2: 3328051

Finishing answers, submitting to github.

############################## 02 Aug 2022

Starting to write demux script for real, now that I have peer review feedback

Cloned git repository to local computer so while I work with unit tests I don't have to be connected to Talapas, but now Pylance is showing errors in base_distribution.py. And while technically that part of the assignment is done, I still wish it would not!
Matplotlib is definitely installed in my base environment on my computer.

Line 58
    mean[position] = qscores[position]/(lcount/4)

(variable) position: int
Argument of type "int" cannot be assigned to parameter "__s" of type "slice" in function "__setitem__"
  "int" is incompatible with "slice"PylancereportGeneralTypeIssues
Argument of type "float" cannot be assigned to parameter "__o" of type "Iterable[int]" in function "__setitem__"
  "float" is incompatible with protocol "Iterable[int]"
    "__iter__" is not presentPylancereportGeneralTypeIssues

WHY, PYLANCE. THIS RAN JUST FINE.

okay I figured it out. List mean[] was filled with integer 0, not float 0, so when I passed it a float it gave a type error. Changed the list initiation to fill it with 0.0. Committed change.

Base environment has python 3.10.4, so when I move it to Talapas to do the actual analysis it will be running the same 3.10 at least.
(base) bmeluch@LAPTOP-M108AT0U:~/bioinfo/Bi622$ which python
/home/bmeluch/miniconda3/bin/python
(base) bmeluch@LAPTOP-M108AT0U:~/bioinfo/Bi622$ python --version
Python 3.10.4

Python interpreter set to base environment, which fixed Pylance not recognizing modules, and which I should have thought of first.
I think I'm ready to start writing now!

############################## 09 Aug 2022

Continuing work on demultiplex.py

- learned about itertools islice to grab 4 lines at a time

############################## 10 Aug 2022

Script is set up to output counts as data files so I can write a separate plotting script and fiddle around with it more easily

Fiddle with unit tests to test out index N replacement

Transfer to Talapas

Git threw a fit about my gitignores not being merged. but it looks like everything pulled properly.

Running!
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/demux.srun

1st: failed, had AND instead of OR when testing for unknown indices
2nd: cancelled, input the wrong output folder
3rd: corrected output folder this BETTER WORK

Slurm job 21960369 "demux"

############################## 11 Aug 2022

using 'du .' to check output directory sizes, squeue -A bgmp to check runtime

Still running. almost 11 hours in and it has produced 29 gigs of output. (~44M per minute) -> will take like 18 hours
Other people reported anywhere from 1 to 12 hours...

demux2 job to see if I can speed it up
	Tried to change error correction to use str.find() instead of for loop through string. 
	38 minutes in, 1.4 gigs output (~35M per minute)
	So that's even slower! and I'm definitely past the first 1000 or so reads that we know are low quality.
	Cancelling job  21961553 "demux2"
	Deleting partially completed output files in /output2/ folder

Add an 'if N in string' before correction logic, let's see if it's faster
reusing demultiplex2.py, demux2.srun
	Submitted batch job 21962897
	this is still slower than the original. 2.47 gigs at 64 minutes (~38M per minute)
	Cancelling job 21962897
	Deleting partially completed output files in /output2/ folder

Let's just take out the error correction entirely...
reusing demultiplex2.py, demux2.srun
	Submitted batch job 21963372

WELL, APPARENTLY 18 hours is a reasonable amount of time for writing to gzip files to take. so I'm going to cancel the other attempt and just wait for my first one.
	scancel 21963372

DEMUX RUN 1 COMPLETE
	Assignment-the-third/slurm/demux_21960369.err
	Assignment-the-third/slurm/demux_21960369.out
Talapas email
	"Slurm Job_id=21960369 Name=demux Ended, Run time 14:43:00, COMPLETED, ExitCode 0"
	3:10 pm
	Hooray!

Working on processing output into nice plots.

I want to get the line counts for all my input files and output files and make sure they add up properly
fastq_line_counts.srun

	/usr/bin/time -v /usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/*.fastq.gz | wc -l >> \
	/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/FASTQ_Line_Counts.txt

	/usr/bin/time -v zcat /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/*.fastq.gz | wc -l >> \
	/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/FASTQ_Line_Counts.txt

############################## 11 Aug 2022

That just output two sums of the lines in each directory.
	/output/FASTQ_Line_Counts.txt
	5811947760
	2905973880

	Command being timed: "zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz"
	User time (seconds): 1568.07
	System time (seconds): 66.64
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 27:18.95
	Exit status: 0
	Command being timed: "/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz"
	User time (seconds): 1568.07
	System time (seconds): 66.64
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 27:18.95
	Exit status: 0
	Command being timed: < it listed out all the filenames in the output directory and that's way too long>
	User time (seconds): 1155.67
	System time (seconds): 53.37
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 20:14.45
	Exit status: 0

The total in the input files is twice as many as the output files because I included the index fastqs, which were not written to demuxed files

How do I get it to output to one line per file? In the terminal, zcat *.fastq.gz | wc -l outputs count and filename all nicely ordered together. Why isn't that what wrote to my output text file? 

(lots of googling) maybe I need to use echo

Trying again
/slurm/fastq_line_counts.srun
	# cd to output folder
	cd /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output

	# count lines in R1 (read1) and R4 (read2) input files
	/usr/bin/time -v echo "$(zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz | wc -l) 1294_S1_L008_R1_001.fastq.gz" >> FASTQ_Line_Counts.txt
	/usr/bin/time -v echo "$(zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz | wc -l) 1294_S1_L008_R1_001.fastq.gz" >> FASTQ_Line_Counts.txt

	# count lines in output files
	/usr/bin/time -v \
	for filename in ./*.fastq.gz; do
		echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
	done

The first two worked but my for loop didn't
	/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts_21983585.err
	/cm/local/apps/slurm/var/spool/job21983585/slurm_script: line 21: syntax error near unexpected token `do'
	/cm/local/apps/slurm/var/spool/job21983585/slurm_script: line 21: `for filename in ./*.fastq.gz; do'

Try again to fix the for loop. Pete says it probably doesn't recognize my file path so I have changed the syntax a little bit? based on more googling?
	# count lines in output files
	/usr/bin/time -v \
	for filename in "$folder"/*.fastq.gz; do
    	echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
	done

It failed immediately.
	/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 13: folder: command not found
	/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 22: syntax error near unexpected token `do'
	/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 22: `for filename in "$folder"/*.fastq.gz; do'

It works without usr/bin/time attached! So Pete thinks you probably can't time a for loop. He says to put it in a separate bash script and then time the bash script
	fastq_line_counts.srun
		/usr/bin/time -v /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts.bash
	fastq_line_counts.bash
		#!/bin/bash
		cd /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output
		for filename in ./*.fastq.gz; do
    		echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
		done

IT'S RUNNING

Demultiplex/Assignment-the-third/slurm/fastq_line_counts_21984557.err
	Command being timed: "/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts.bash"
	User time (seconds): 1223.80
	System time (seconds): 100.00
	Percent of CPU this job got: 105%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 20:51.38
	Exit status: 0

Wrote Demultiplex/Assignment-the-third/demux_calcs.py to do some percentages math and plots. Wrote everything up in Summary.md. Uploading to github.

HOORAY