generated from Leslie-C/Demultiplexing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDemultiplex_notes.txt
326 lines (246 loc) · 14.7 KB
/
Demultiplex_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
BI622 Demultiplexing
############################## 26 Jul 2022
Part 1 assigned
Working on Talapas, cloned repo: /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/
Bash commands for data exploration:
(base) [bmeluch@talapas-ln1 Demultiplex]$ srun --account=bgmp --partition=bgmp --nodes=1 --ntasks-per-node=1 --time=2:00:00 --cpus-per-task=1 --pty bash
srun: job 21723133 queued and waiting for resources
srun: job 21723133 has been allocated resources
# Storage usage in GB as of Tue Jul 26 13:01:03 2022
Fileset User UsedByUser UsedByAll Quota Use%
home bmeluch 1 - 25 3
bgmp bmeluch 34 11124 65536 17
packages bmeluch 0 11802 16384 72
(base) [bmeluch@n278 Demultiplex]$
Read lengths
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R1_001.fastq.gz | head -2 | tail -1 | wc
1 1 102
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R2_001.fastq.gz | head -2 | tail -1 | wc
1 1 9
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R3_001.fastq.gz | head -2 | tail -1 | wc
1 1 9
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R4_001.fastq.gz | head -2 | tail -1 | wc
1 1 102
Phred encoding
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R4_001.fastq.gz | head -4 | tail -1
#AAFAFJJ-----F---7-<FA-F<AFFA-JJJ77<FJFJFJJJJJJJJJJAFJFFAJJJJJJJJFJF7-AFFJJ7F7JFJJFJ7FFF--A<A7<-A-7--
there's a # so this is Phred+33
File lengths
(base) [bmeluch@n278 2017_sequencing]$ zcat 1294_S1_L008_R1_001.fastq.gz | wc -l
1452986940
that took so long frankly I do not think I want to do that for all four files
activating conda environment to run distribution plot script
(base) [bmeluch@n278 Assignment-the-first]$ conda activate bgmp_py310
(bgmp_py310) [bmeluch@n278 Assignment-the-first]$
Script done /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/base_distribution.py
Running for the read1 file using sbatch
out/err files here: /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725496
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725507
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725523
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725526
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725528
it took several tries and small edits to the script - I needed to use gzip with the correct option 'rt' - but now it's running so let's see
AHH I set the output directory wrong. Cancelling job.
Trying again without saving a histogram to the shared directory :P
(bgmp_py310) [bmeluch@n278 slurm]$ sbatch histogram_read1.srun
Submitted batch job 21725571
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/dist_hist1_21725571.err
Command being timed: "/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/base_distribution.py -f /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz -o /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/histograms/read1.png"
User time (seconds): 9092.21
System time (seconds): 2.23
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:31:40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 66168
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 62865
Voluntary context switches: 1925
Involuntary context switches: 2983
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Run the other three in one script overnight:
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/histogram_i1_i2_r2.srun
(base) [bmeluch@talapas-ln1 slurm]$ sbatch histogram_i1_i2_r2.srun
Submitted batch job 21731338
############################## 27 July 2022
It didn't work, it doesn't know what matplotlib is, I need to activate my conda environment in my slurm script :(
added "conda activate bgmp_py310" before commands in script
Running again
(base) [bmeluch@talapas-ln1 slurm]$ sbatch histogram_i1_i2_r2.srun
Submitted batch job 21738732
Success! Histograms saved /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/histograms
############################## 28 Jul 2022
Working on pseudocode for the algorithm
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/pseudocode.txt
############################## 29 Jul 2022
Finished pseudocode
Wrote unit tests
Set up 'N' nucleotide count as a slurm script
Okay that only took a few minutes to run, I could have done it on the command line.
Commands:
/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz | \
grep -A 1 "^@" | grep -v "^@" | grep -c "N"
/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz | \
grep -A 1 "^@" | grep -v "^@" | grep -c "N"
Output:
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-first/slurm/n_count_21768476.out
Index 1: 3976613
Index 2: 3328051
Finishing answers, submitting to github.
############################## 02 Aug 2022
Starting to write demux script for real, now that I have peer review feedback
Cloned git repository to local computer so while I work with unit tests I don't have to be connected to Talapas, but now Pylance is showing errors in base_distribution.py. And while technically that part of the assignment is done, I still wish it would not!
Matplotlib is definitely installed in my base environment on my computer.
Line 58
mean[position] = qscores[position]/(lcount/4)
(variable) position: int
Argument of type "int" cannot be assigned to parameter "__s" of type "slice" in function "__setitem__"
"int" is incompatible with "slice"PylancereportGeneralTypeIssues
Argument of type "float" cannot be assigned to parameter "__o" of type "Iterable[int]" in function "__setitem__"
"float" is incompatible with protocol "Iterable[int]"
"__iter__" is not presentPylancereportGeneralTypeIssues
WHY, PYLANCE. THIS RAN JUST FINE.
okay I figured it out. List mean[] was filled with integer 0, not float 0, so when I passed it a float it gave a type error. Changed the list initiation to fill it with 0.0. Committed change.
Base environment has python 3.10.4, so when I move it to Talapas to do the actual analysis it will be running the same 3.10 at least.
(base) bmeluch@LAPTOP-M108AT0U:~/bioinfo/Bi622$ which python
/home/bmeluch/miniconda3/bin/python
(base) bmeluch@LAPTOP-M108AT0U:~/bioinfo/Bi622$ python --version
Python 3.10.4
Python interpreter set to base environment, which fixed Pylance not recognizing modules, and which I should have thought of first.
I think I'm ready to start writing now!
############################## 09 Aug 2022
Continuing work on demultiplex.py
- learned about itertools islice to grab 4 lines at a time
############################## 10 Aug 2022
Script is set up to output counts as data files so I can write a separate plotting script and fiddle around with it more easily
Fiddle with unit tests to test out index N replacement
Transfer to Talapas
Git threw a fit about my gitignores not being merged. but it looks like everything pulled properly.
Running!
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/demux.srun
1st: failed, had AND instead of OR when testing for unknown indices
2nd: cancelled, input the wrong output folder
3rd: corrected output folder this BETTER WORK
Slurm job 21960369 "demux"
############################## 11 Aug 2022
using 'du .' to check output directory sizes, squeue -A bgmp to check runtime
Still running. almost 11 hours in and it has produced 29 gigs of output. (~44M per minute) -> will take like 18 hours
Other people reported anywhere from 1 to 12 hours...
demux2 job to see if I can speed it up
Tried to change error correction to use str.find() instead of for loop through string.
38 minutes in, 1.4 gigs output (~35M per minute)
So that's even slower! and I'm definitely past the first 1000 or so reads that we know are low quality.
Cancelling job 21961553 "demux2"
Deleting partially completed output files in /output2/ folder
Add an 'if N in string' before correction logic, let's see if it's faster
reusing demultiplex2.py, demux2.srun
Submitted batch job 21962897
this is still slower than the original. 2.47 gigs at 64 minutes (~38M per minute)
Cancelling job 21962897
Deleting partially completed output files in /output2/ folder
Let's just take out the error correction entirely...
reusing demultiplex2.py, demux2.srun
Submitted batch job 21963372
WELL, APPARENTLY 18 hours is a reasonable amount of time for writing to gzip files to take. so I'm going to cancel the other attempt and just wait for my first one.
scancel 21963372
DEMUX RUN 1 COMPLETE
Assignment-the-third/slurm/demux_21960369.err
Assignment-the-third/slurm/demux_21960369.out
Talapas email
"Slurm Job_id=21960369 Name=demux Ended, Run time 14:43:00, COMPLETED, ExitCode 0"
3:10 pm
Hooray!
Working on processing output into nice plots.
I want to get the line counts for all my input files and output files and make sure they add up properly
fastq_line_counts.srun
/usr/bin/time -v /usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/*.fastq.gz | wc -l >> \
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/FASTQ_Line_Counts.txt
/usr/bin/time -v zcat /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/*.fastq.gz | wc -l >> \
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output/FASTQ_Line_Counts.txt
############################## 11 Aug 2022
That just output two sums of the lines in each directory.
/output/FASTQ_Line_Counts.txt
5811947760
2905973880
Command being timed: "zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz"
User time (seconds): 1568.07
System time (seconds): 66.64
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 27:18.95
Exit status: 0
Command being timed: "/usr/bin/time -v zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R3_001.fastq.gz /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz"
User time (seconds): 1568.07
System time (seconds): 66.64
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 27:18.95
Exit status: 0
Command being timed: < it listed out all the filenames in the output directory and that's way too long>
User time (seconds): 1155.67
System time (seconds): 53.37
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 20:14.45
Exit status: 0
The total in the input files is twice as many as the output files because I included the index fastqs, which were not written to demuxed files
How do I get it to output to one line per file? In the terminal, zcat *.fastq.gz | wc -l outputs count and filename all nicely ordered together. Why isn't that what wrote to my output text file?
(lots of googling) maybe I need to use echo
Trying again
/slurm/fastq_line_counts.srun
# cd to output folder
cd /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output
# count lines in R1 (read1) and R4 (read2) input files
/usr/bin/time -v echo "$(zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R1_001.fastq.gz | wc -l) 1294_S1_L008_R1_001.fastq.gz" >> FASTQ_Line_Counts.txt
/usr/bin/time -v echo "$(zcat /projects/bgmp/shared/2017_sequencing/1294_S1_L008_R4_001.fastq.gz | wc -l) 1294_S1_L008_R1_001.fastq.gz" >> FASTQ_Line_Counts.txt
# count lines in output files
/usr/bin/time -v \
for filename in ./*.fastq.gz; do
echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
done
The first two worked but my for loop didn't
/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts_21983585.err
/cm/local/apps/slurm/var/spool/job21983585/slurm_script: line 21: syntax error near unexpected token `do'
/cm/local/apps/slurm/var/spool/job21983585/slurm_script: line 21: `for filename in ./*.fastq.gz; do'
Try again to fix the for loop. Pete says it probably doesn't recognize my file path so I have changed the syntax a little bit? based on more googling?
# count lines in output files
/usr/bin/time -v \
for filename in "$folder"/*.fastq.gz; do
echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
done
It failed immediately.
/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 13: folder: command not found
/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 22: syntax error near unexpected token `do'
/cm/local/apps/slurm/var/spool/job21984456/slurm_script: line 22: `for filename in "$folder"/*.fastq.gz; do'
It works without usr/bin/time attached! So Pete thinks you probably can't time a for loop. He says to put it in a separate bash script and then time the bash script
fastq_line_counts.srun
/usr/bin/time -v /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts.bash
fastq_line_counts.bash
#!/bin/bash
cd /projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/output
for filename in ./*.fastq.gz; do
echo "$(zcat $filename | wc -l) $filename" >> FASTQ_Line_Counts.txt
done
IT'S RUNNING
Demultiplex/Assignment-the-third/slurm/fastq_line_counts_21984557.err
Command being timed: "/projects/bgmp/bmeluch/bioinfo/Bi622/Demultiplex/Assignment-the-third/slurm/fastq_line_counts.bash"
User time (seconds): 1223.80
System time (seconds): 100.00
Percent of CPU this job got: 105%
Elapsed (wall clock) time (h:mm:ss or m:ss): 20:51.38
Exit status: 0
Wrote Demultiplex/Assignment-the-third/demux_calcs.py to do some percentages math and plots. Wrote everything up in Summary.md. Uploading to github.
HOORAY