-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmim2_bioinformatics.qmd
2109 lines (1604 loc) · 75.6 KB
/
mim2_bioinformatics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "DADA2 bioinformatic pipeline for the Mimulus ITS sequence dataset"
author: "Bolívar Aponte Rolón and Mareli Sánchez Juliá"
date: "Last edited: `r format(Sys.time(), '%B %d, %Y')`"
format:
html:
toc: true
toc-location: left
toc-depth: 2
number-sections: true
number-depth: 1
theme: lumen
highlight-style: github
code-overflow: wrap
code-fold: false
code-copy: true
code-link: false
code-tools: false
code-block-border-left: "#0C3823"
code-block-bg: "#eeeeee"
fig-cap-location: margin
linestretch: 1.25
fontsize: "large"
embed-resources: true
execute:
echo: true
keep-md: true
editor:
markdown:
wrap: 72
---
```{r setup}
knitr::opts_chunk$set(out.width ='70%', fig_align = 'center', echo = TRUE, collapse = TRUE, eval=FALSE)
```
# DADA2 Pipeline
## Installation
See https://benjjneb.github.io/dada2/dada-installation.html for more
details. The main R script can be found
[here](https://www.bioconductor.org/packages/devel/bioc/vignettes/dada2/inst/doc/dada2-intro.R)
Following DADA2 ITS Pipeline tutorial,
\[v1.18\](https://benjjneb.github.io/dada2/ITS_workflow.html. Also Emily
Farrer's
[code](https://github.com/ecfarrer/LAmarshGradient2/blob/master/BioinformaticsITS.R)
on github "LAmarshGradient2 ITS code.
::: callout-info
**To carry out this pipeline in an HPC cluster see section: [Cypress:
How to submit this as a SLURM job.]**
:::
### Packages required
```{r, Pre-requisites}
#| eval: true
#| echo: false
#| tidy: true
#| warning: false
# Activate commands as needed.
# change the ref argument to get other versions
# DADA2 package and associated
# if (!require("BiocManager", quietly = TRUE)){
# install.packages("BiocManager", repo="http://cran.rstudio.com/")
# }
# BiocManager::install(version = "3.17")
# BiocManager::install(c("dada2","ShortRead", "Biostrings"))
#
# install.packages("usethis")
# install.packages("devtools")
# install.packages("Rcpp")
# devtools::install_github("benjjneb/dada2")
# if (!require("BiocManager", quietly = TRUE)){ #Another way of installing the latest version of dada2
# install.packages("BiocManager")}
#
#The following initializes usage of Bioconductor devel version
#BiocManager::install(version='devel', ask = FALSE) #BiocManager 3.17 (dada2 1.28.0) or developer version (dada2 1.29.0)
#BiocManager::install(c("dada2", "ShortRead", "BioStrings"))
# Loading packages
library("usethis")
library("devtools")
library("Rcpp")
library("dada2")
library("ShortRead")
library("Biostrings") #This will install other packages and dependencies.
packageVersion("dada2") #checking if it is the latest version
packageVersion("ShortRead")
packageVersion("Biostrings")
```
## File preparation: name clean-up
The DADA2 pipeline has the capacity to work with unzipped ".fastq"
files. It is good practice to ensure that all your files names have the
same amount of fields and characters in there names. This ensures that
scripts treat your files equally and you don't accidentally merge files
(e.g. *R1.fastq* + *R2.fastq* = *R1.fastq*). Note that DNA extraction
controls and PCR controls are a little bit different than field samples
and treated as such by using a slightly different script. For this I
have prepared the following scripts.
**Samples** No need to unzip files if `cutadapt` and `dada2` can handle
zipped files.
```{bash}
#!/bin/bash
for file in *.fastq.gz
do
echo "Unziping"
gzip -d "$file"
done
#Rename the files
for file in *.fastq
do
echo "Renaming"
newname=$(echo $file | cut -d_ -f1,2,5).fastq
echo "Renaming $file as $newname"
mv $file $newname
done
##Script from HPC workshop 2 3/16/2023
##Updated on 6/7/2023 with troubleshooting with ChatGPT.
#Various iterations offered were not exactly what
# I needed.-BAR
```
**Notes on the name clean-up for Aponte_8756_23110702 run:**
- Repeated samples were concatenated.
- SPLB_L004_1_R1.fastq
- SPLB_L004_2_R1.fastq
- SPLB_L004_1_R2.fastq
- SPLB_L004_2_R2.fastq
**Controls** No need to unzip files if `cutadapt` and `dada2` can handle
zipped files.
```{bash}
#!/bin/bash
for file in *.fastq.gz
do
echo "Unziping"
gzip -d "$file"
done
#Rename the files
for file in *.fastq
do
echo "Renaming"
newname=$(echo $file | cut -d_ -f1,4 | sed 's/-//g').fastq
echo "Renaming $file as $newname"
mv $file $newname
done
##Script from HPC workshop 2 3/16/2023
##Updated on 6/7/2023 with troubleshooting with ChatGPT.
#Various iterations offered were not exactly what
# I needed.-BAR
```
When working from Mac OS **bash** and Linux **console** make the shell
scripts executable
```{bash}
chmod +x FILENAME.sh
```
Put them in your PATH by placing it in my \~/bin directory and adding
the following line to \~/.bashrc:
```{bash}
export PATH=$HOME/bin:$PATH
```
This can also be done through the `miniconda3` command-prompt.
These files live in the same directory as the samples and controls,
respectively. The were executed through the MinGW-64 terminal that
RStudio provides. It emulates a UNIX/POSIX environment on Windows. This
is an alternative in Windows OS or you can establish a virtual machine
with [Virtual Box](https://www.virtualbox.org/) and install a Linux OS
(i.e. Ubuntu) and proceed the same. In Mac OS this is not necessary.
## Repository and file paths
Make various directories beforehand to keep your file output organized.
Starting with `raw_sequences`, where *raw_sequences* is the directory
name of where the main `fastq` or `fasta` files are. This is were the
raw sequence files are. The file names have already been cleaned, see
sections above. Create `filtN` as a sub-directory of `raw_sequences`.
The final directories: `ASV_tables`, `preprocess`, `qc_reports`, and
`taxonomy`. As you work through this tutorial the files will be saved in
their respective directories.
For this project, I will be working with two separate sequencing runs:
*Aponte_8450_23052601* and *Aponte_8756_23110702*. Both contain samples.
They had to be split due to a large number of samples and how laboratory
work was conducted. The document will be structure by the two run
numbers.
```{r, Paths}
#| eval: true
#| echo: false
#| tidy: true
#Aponte_8450_23052601
path_1 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8450_23052601"
out_dir <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics"
list.files(path_1)
fnFs_seq1 <- sort(list.files(path_1, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq1 <- sort(list.files(path_1, pattern = "R2.fastq.gz", full.names = TRUE))
#Aponte_8756_23110702
path_2 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8756_23110702"
list.files(path_2)
fnFs_seq2 <- sort(list.files(path_2, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq2 <- sort(list.files(path_2, pattern = "R2.fastq.gz", full.names = TRUE))
```
# Aponte_8450_23052601
Software and package versions used as of 03/FEB/2024: - FastQC
(v0.12.1-4) -\> installed through [BioArchLinux
repository](https://github.com/BioArchLinux/Packages) - MultiQC
(v1.19-1) -\> install through the Arch Linux User Repository (AUR). -
Cutadapt (v4.6-1) -\> installed through [BioArchLinux
repository](https://github.com/BioArchLinux/Packages)
### FastQC reports: raw sequences
Inspect your sequence data before jumping in to cut and trim. You want
to get a sense of how the sequencing run went and what you need to do in
downstream processes. [Why do quality
control?](https://www.bioinformatics.babraham.ac.uk/training/Sequence_QC_Course/Sequencing%20Quality%20Control.pdf)
Assuming you have installed FastQC and MultiQC, go to the directory were
your sequence files (.fastaq.gz) are and generate reports. This can be
through the bash command-line in Mac and Linux or the Miniconda3
command-prompt.
- [FastQC how to:](https://www.youtube.com/watch?v=9_l_hWESuCQ) Mac
and Linux OS
```{bash}
#!/bin/bash
mkdir qc_reports #Change to your preferred directory name.
#Concatenate all forward (R1) and reverse (R2) reads
cat in_directory/*_R1.fastq >> out_directory/all_R1_rawreads.fastq
cat in_directory/*_R2.fastq >> in_directory/all_R2_rawreads.fastq
#Execute fastqc on files
fastqc qc_reports/all_R1_rawreads.fastq -o qc_reports/all_R1_rawreads
fastqc qc_reports/all_R2_rawreads.fastq -o qc_reports/all_R2_rawreads
#This can take a while. It is generating .html reports and associated compressed files. Execute within the directory or from.
```
- Repeat as necessary for number of sequencing runs.
Windows (miniconda3) In windows you can use a GUI to select the file you
want to create a report. It is not fully support through the
command-line. Make sure to use the `miniconda3` command-prompt and that
it is installed properly.
```{bash}
fastqc *.fastq.gz
```
How do can we interpret these reports? What kind of sequence data do I
have? Does this look OK? See here: + Galaxy Training!: [Quality
Control](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) +
[EDAMAME
tutorial](https://github.com/edamame-course/FastQC/blob/master/for_review/2016-06-22_FastQC_tutorial.md)
In the Van Bael lab we produce 16S and ITS amplicon sequence data, for
the most part. Amplicon sequences are considered "low" diversity due
thei enriched proportion of nucleotides (per base sequence). Hence, when
the sequence platform calls nucleotides it tends to be poor in the
intiall base pairs.
### MultiQC: raw sequences
This command will search for all FastQC reports and create summary
report of all of them. You can create a shell script and execute in a
similar way as `fastqc`.
```{bash}
#In the directory that you have the reports, execute:
multiqc .
# To execute in a subdirectory
multiqc directory/
```
### Identifiying primers
Each project and organism type will have its own set of primers and
adapters. Take the time to figure out which ones you used and their
proper bases and lengths. You should received the data from the
sequencing core demultiplexed since the barcodes (indexes **i5** and
**i7**) are submitted with the order. Here is a list of the most
commonly used in the Van Bael Lab.
- [VBL Culture
primers](https://drive.google.com/open?id=0B9v0CdUUCqU5YVhJck1zT1VTZ28&resourcekey=0-1Nyzv3mGzpJLqDvoo-ls9A&usp=drive_fs)
- [VBL NGS
primers](https://drive.google.com/open?id=1bUY7dy_JNlkpvzcXcW1ImQAMSckexeSj&usp=drive_fs)
```{r, ITS1f_adapt_ITS2r_adapt}
#| eval: true
#| echo: false
#| tidy: true
FWD<-"CACTCTTTCCCTACACGACGCTCTTCCGATCTCTTGGTCATTTAGAGGAAGTAA" # 5'- 3' Forward ITS1f_adapt modified with the Illumina TruSeq adaptor
nchar(FWD) #Number of primer nucleotides.
REV<-"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCTGCGTTCTTCATCGATGC" # 5'- 3' Reverse primer ITS2r_adapt modified with the Illumina TruSeq adaptor
nchar(REV)
```
### Verifying the orientation of the primers
```{r, orientation_primers}
#| eval: true
#| echo: false
#| tidy: true
allOrients <- function(primer){# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna,
Complement = Biostrings::complement(dna),
Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
#FWD2 <- FWD.orients[["Complement"]] #Use if you suspect an orientation mix-up.
#REV2 <- REV.orients[["Complement"]]
```
### Filter and Trim Reads for ambiguous bases (N)
We are filtering sequences for presence of ambiguous bases (N) before
cutting primerd off. This pre-filtering step improves maping of short
primer sequences. No other filtering is performed.
```{r, filtering_ambiguous_bases_N}
#| eval: true
#| echo: false
#| tidy: true
fnFs_seq1_filtN <- file.path(path_1, "filtN", basename(fnFs_seq1)) # Put N-filtered forward read files in filtN/ subdirectory
fnRs_seq1_filtN <- file.path(path_1, "filtN", basename(fnRs_seq1)) #Reverse reads
# filterAndTrim(fnFs_seq1, fnFs_seq1_filtN,
# fnRs_seq1, fnRs_seq1_filtN,
# maxN = 0, multithread = TRUE)
```
### Checking for primer hits
Have primers been removed?
```{r, primer_hits}
#| eval: true
#| echo: false
#| tidy: true
set.seed(123)
#Once this has been completed there is no need to run again when working on the script
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq1_filtN[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq1_filtN[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq1_filtN[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq1_filtN[[1]]))
```
We see the reverse complement of primers present in the FWD.ReverseReads
and the Rev.ForwardReads.
### Cutadapt: removal of primers
In previous runs of the pipeline we have encountered a
`Warning: Zero-length sequences detected during dereplication` or the
function `plotQualityProfile` not being able to plot well. These
problems are due to zero-length sequences. Here we use the cut adapt
tool to discard reads less than 20 bp. In later steps, we discard those
less than 50 bp.
```{r, cutadapt}
#| eval: true
#| echo: false
#| tidy: true
#Once this has been completed there is no need to run again when working on the script
cutadapt <- "/usr/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R
```
```{r, path_parameters_cuttting}
#| eval: true
#| echo: false
#| tidy: true
path.cut_1 <- file.path(path_1, "cutadapt") #Remember where this "out" directory path leads to.
print(path.cut_1) #Checking if the path is correct.
if(!dir.exists(path.cut_1)) dir.create(path.cut_1)
fnFs_seq1_cut <- file.path(path.cut_1, basename(fnFs_seq1))
fnRs_seq1_cut <- file.path(path.cut_1, basename(fnRs_seq1))
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
```
```{r, Running_cutadapt}
#| eval: true
#| echo: false
#| tidy: true
# Run Cutadapt
#Once this has been completed there is no need to run again when working on the script.
#
# for(i in seq_along(fnFs_seq1)) {
# system2(cutadapt, args = c(R1.flags, R2.flags,
# "-n", 2, # -n 2 removes FWD and REV from reads
# "-m", 20, # -m 20 removes reads shorter than 20 bp
# "-o", fnFs_seq1_cut[i],
# "-p", fnRs_seq1_cut[i], # output files
# fnFs_seq1_filtN[i],
# fnRs_seq1_filtN[i])) #input files
# }
# String for changing file extension (i.e. .fastq > .fa)
# new_extension <- ".fa"
# for (i in seq_along(fnFs_seq1)) {
# output_file1 <- str_replace(fnFs_seq1_cut[i], "\\.fastq", new_extension)
# output_file2 <- str_replace(fnRs_seq1_cut[i], "\\.fastq", new_extension) This can be added in the code above if desired.
```
### Re-inspecting if all primers were removed.
```{r, Re-inspect_primer_presence}
#| eval: true
#| echo: false
#| tidy: true
#Once this has been completed there is no need to run again when working on the script
#
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq1_cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq1_cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq1_cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq1_cut[[1]]))
# Forward and reverse fastq filenames have the format:
cutFs_seq1 <- sort(list.files(path.cut_1, pattern = "R1.fastq.gz", full.names = TRUE))
cutRs_seq1 <- sort(list.files(path.cut_1, pattern = "R2.fastq.gz", full.names = TRUE))
#allcutF <- sort(list.files(all.cut, pattern = "R1.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_R")[[1]][1] #String in commas needs to be updated according to naming convention. If you have multiple underscores in the name then select the underscore next to the "R", like above, or any other unique identifier in the character string.
sample.namesF <- unname(sapply(cutFs_seq1, get.sample.name))
sample.namesR <- unname(sapply(cutRs_seq1, get.sample.name))
head(sample.namesF)
head(sample.namesR)
```
**All primers were successfully removed.**
### Inspect the read quality
The `dada2` package provides a way to visualize this with the
`plotQualityProfile()`function. This will plot the quality scores of
reads per sample. You can also create a concatenated file of all the
forward or reverse reads to evaluate them in one plot. Plotting a
concatenated file may take a while to plot or it may fail. Another way
of inspecting the quality of the reads if using
[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
which has a GUI options or a command-line approach. Both ways are good.
The FastQC approach works better with concatenated files and outputs a
lot more information. This step is to inform you of the quality of the
reads and decide how to cut, trim and truncate your samples. [FastQC in
Linux environment](https://www.youtube.com/watch?v=5nth7o_-f0Q)
```{r}
#| eval: true
#| echo: false
plotQualityProfile(cutFs_seq1[6:2])
plotQualityProfile(cutRs_seq1[6:2])
```
The quality of the reads improves after removal of N calls and adapter
sequences. The `plotQualityProfile` function allows only a subset of the
reads to be plotted. This is not as useful when you have a large number
of samples and you want to inspect the quality of the reads. We will
create new FastQC reports and MultiQC reports to inspect the quality of
the reads and establish which parameters to use for filtering, trimming
and truncation.
### FastQC reports: mismatches (Ns) filtered and cut
See previous [FastQC reports: raw sequences] section. The output should
show an improvement in the quality of the reads.
### MultiQC repots: mismatches (Ns) filtered and cut
Se previous [MultiQC: raw sequences] section. The out put should include
all cut and trimmed FastQC report. It will inform the following steps:
trimming and truncating.
<!-- ### FIGARO: tool for deciding what parameters to use for filtering, trimming and truncation -->
<!-- This step can be performed on the raw reads as well. Here we focus on -->
<!-- the cut, filtered and trimmed reads. -->
<!-- ```{python} -->
<!-- # from figaro import figaro -->
<!-- # resultTable, forwardCurve, reverseCurve = figaro.runAnalysis( -->
<!-- # sequenceFolder = path.cut, -->
<!-- # ampliconLength = 500, #Maximum expected size of the amplicon -->
<!-- # forwardPrimerLength= 54, -->
<!-- # reversePrimerLength = 54, -->
<!-- # minimumOverlap = 20, -->
<!-- # fileNamingStandard, -->
<!-- # trimParameterDownsample, -->
<!-- # trimParameterPercentile) -->
<!-- ``` -->
## Filter and trim
Results from MultiQC and FIGARO inform parameter selection.
For Aponte_8450_23052601 we see that the forwards reads have a decent
quality up to 200-230 bp.
<!-- The reverse reads are of lower quality. We will use a `truncLen` of 210 and 200 for forward and reverse reads, respectively. -->
With further thought, we believe that the best parameters for the
truncation of the reads will be with the use of `maxEE`. These serves as
a primary quality filter as opposed to a size selection filter. See in
this [post](https://github.com/benjjneb/dada2/issues/232) and in [Edgar
and Flyvberg 2015](https://doi.org/10.1093/bioinformatics/btv401). We
will use a `maxEE` and `trunQ` of 2 for both forward and reverse reads.
This is a little relaxed but we are working with poor quality reads. We
will also use a `minLen` of 50 bp and maxN of 0.
The previous parameters where a quality of 2 (Phred score 20) in
arguments `trunQ` and `minQ`, which translates to a read length of about
215 bp, which is decent. We are being a little stringent here since the
quality of these libraries has been poor since the PCR stages. The
average ITS amplicon size for this library pool was 385 bp (see Duke
report and MultiQC report in repository). Around 230 bp is where the
average read drops below a Phred score of \~25.
```{r, assign_file_names}
#| eval: true
#| echo: false
#| tidy: true
filtFs_seq1 <- file.path(out_dir, "filtered/filt_8450", basename(cutFs_seq1))
filtRs_seq1 <- file.path(out_dir, "filtered/filt_8450", basename(cutRs_seq1))
names(filtFs_seq1) <- sample.namesF
names(filtRs_seq1) <- sample.namesR
head(filtFs_seq1, 10)
head(filtRs_seq1, 10)
```
```{r, truncating}
#| eval: true
#| echo: false
#| tidy: true
# Truncating Forward and Reverse reads to ~ 230 bp based on trunQ = 2.
# The reverse maxEE is relaxed due to overall low quality of reads
out_seq1 <- filterAndTrim(cutFs_seq1, filtFs_seq1,
cutRs_seq1, filtRs_seq1,
#truncLen=c(210,200), #Truncate reads after truncLen bases. Reads shorter than this are discarded.
truncQ = 2,
#minQ = 2,
maxN = 0,
maxEE = c(2,2),
minLen = 50, #Not necessary when using truncLen
rm.phix = TRUE,
compress = TRUE,
multithread = TRUE) # minLen: Remove reads with length less than minLen. minLen is enforced after trimming and truncation. #enforce min length of 50 bp
#Once it is completed there is no need to run again unless you are changing parameters.
#Saving file
#saveRDS(out_seq1, file.path(out_dir, "preprocess/pre_8450/out_8450.rds"))
#Loading file
out_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/out_8450.rds"))
```
### Dereplication
Dereplication combines all identical reads into one unique sequence with
a corresponding abundance equal to the number of reads with that unique
sequence. It is done because it reduces computation time by eliminating
redundancy -- From DADA2 tutorial, v1.8
```{r, dereplication}
#| eval: true
#| echo: false
#| tidy: true
derepFs_seq1 <- derepFastq(filtFs_seq1, n = 1000, verbose = TRUE) #n prevents it from reading more than 1000 reads at the same time. This controls the peak memory requirement so that large fastq files are supported.
derepRs_seq1 <- derepFastq(filtRs_seq1, n = 1000, verbose = TRUE)
#Save file
#saveRDS(derepFs_seq1, file.path(out_dir, "preprocess/pre_8450/derepFs_8450.rds"))
#saveRDS(derepRs_seq1, file.path(out_dir, "preprocess/pre_8450/derepRs_8450.rds"))
#Load file
derepFs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/derepFs_8450.rds"))
derepRs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/derepRs_8450.rds"))
#name the dereplicated reads by the sample names
names(derepFs_seq1) <- sample.namesF
names(derepRs_seq1) <- sample.namesR
```
### Learn error rates from dereplicated reads
```{r, error_rates}
#| eval: true
#| echo: false
#| tidy: true
set.seed(123)
errF_seq1 <- learnErrors(derepFs_seq1, randomize = TRUE, multithread = TRUE) #multithread is set to FALSE in Windows. Unix OS is =TRUE.
errR_seq1 <- learnErrors(derepRs_seq1, randomize = TRUE, multithread = TRUE)
# Save file
#saveRDS(errF_seq1, file.path(out_dir, "preprocess/pre_8450/errF_8450.rds"))
#saveRDS(errR_seq1, file.path(out_dir, "preprocess/pre_8450/errR_8450.rds"))
# Load file
#errF <- readRDS(file.path(out_dir, "preprocess/pre_8450/errF_8450.rds"))
#errR <- readRDS(file.path(out_dir, "preprocess/pre_8450/errR_8450.rds"))
# Plot errors
plotErrors(errF_seq1, nominalQ = TRUE)
plotErrors(errR_seq1, nominalQ = TRUE)
```
### Sample Inference
```{r, inference1}
#| eval: true
#| echo: false
#| tidy: true
dadaFs_seq1 <- dada(derepFs_seq1, err = errF_seq1, multithread = TRUE)
dadaRs_seq1 <- dada(derepRs_seq1, err = errR_seq1, multithread = TRUE)
# Save file
#saveRDS(dadaFs_seq1, file.path(out_dir, "preprocess/pre_8450/dadaFs_8450.rds"))
#saveRDS(dadaRs_seq1, file.path(out_dir, "preprocess/pre_8450/dadaRs_8450.rds"))
# Load file
#dadaFs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/dadaFs_8450.rds"))
#dadaRs_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/dadaRs_8450.rds"))
```
### Merge paired reads
```{r, merge}
#| eval: true
#| echo: false
#| tidy: true
mergers_seq1 <- mergePairs(dadaFs_seq1, filtFs_seq1,
dadaRs_seq1, filtRs_seq1,
minOverlap = 20,
maxMismatch = 0,
verbose = TRUE)
# Save file
#saveRDS(mergers_seq1, file.path(out_dir, "preprocess/pre_8450/mergers_8450.rds"))
# Load file
mergers_seq1 <- readRDS(file.path(out_dir, "preprocess/pre_8450/mergers_8450.rds"))
# Inspect the merger data.frame from the first sample
head(mergers_seq1[[1]])
```
### Construct ASV Sequence Table
We can now construct an amplicon sequence variant table (ASV) table, a
higher-resolution version of the OTU table produced by traditional
methods.
```{r, ASV}
#| eval: true
#| echo: false
#| tidy: true
seqtab_seq1 <- makeSequenceTable(mergers_seq1)
dim(seqtab_seq1)
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab_seq1)))
#Save file R object, and .csv
#saveRDS(seqtab_seq1, file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds")) #Functions to write a single R object to a file, and to restore it.
#write.csv(seqtab_seq1, file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds"))
#Open from here in case R crashes
seqtab <- readRDS(file.path(out_dir, "clean_data/ASV_tables/01-ASV_8450_seqtable_raw.rds"))
```
### Remove chimeras
The core `dada` method corrects substitution and indel errors, but
chimeras remain.
```{r, chimeras}
#| eval: true
#| echo: false
#| tidy: true
seqtab_seq1_nochim <- removeBimeraDenovo(seqtab_seq1,
method = "consensus",
multithread = FALSE,
verbose = TRUE) #Multithread = FALSE in Windows
#Frequency of chimeras
sum(seqtab_seq1_nochim)/sum(seqtab_seq1)
#Save file
#saveRDS(seqtab_seq1_nochim, file.path(out_dir, "clean_data/ASV_tables/02-ASV_8450_seqtable_nochim_denoise.rds"))
#write.csv(seqtab_seq1_nochim, file.path(out_dir, "clean_data/ASV_tables/02-ASV_8450_seqtable_nochim_denoise.csv")) # Long file name but it indicates this file has gone through all the steps in the pipeline.
seqtab_seq1_nochim <- readRDS(file.path(out_dir,"ASV_tables/02-ASV_8450_seqtable_nochim_denoise.rds"))
```
Inspect distribution of sequence lengths
```{r, eval=TRUE}
table(nchar(getSequences(seqtab_seq1_nochim)))
```
### Track reads through the pipeline
We now inspect the the number of reads that made it through each step in
the pipeline to verify everything worked as expected.
```{r, pipeline_tracking}
#| eval: true
#| echo: false
#| tidy: true
getN_seq1 <- function(x) sum(getUniques(x))
track_seq1 <- cbind(out_seq1,
sapply(dadaFs_seq1, getN_seq1),
sapply(dadaRs_seq1, getN_seq1),
sapply(mergers_seq1, getN_seq1),
rowSums(seqtab_seq1_nochim))
# If processing a single sample, remove the sapply calls: e.g. replace
# sapply(dadaFs, getN) with getN(dadaFs)
colnames(track_seq1) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track_seq1) <- sample.namesF
head(track_seq1)
# Save file
#saveRDS(track_seq1, file.path(out_dir, "preprocess/pre_8450/track_8450.rds"))
# Load file
#track <- readRDS(file.path(out_dir, "preprocess/pre_8450/track_8450.rds"))
```
#### Tracking summary
Doesn't look good. We lose a majority of the reads in the filter and
trim step. This is a large drop-off in reads. Current parameters using
`trunQ = 2` and `maxEE = 2` are more stringent than
`truncLen = c(230, 180)` used on June 2023 preliminary analyses.
::: callout-warning
<font size="4"> [**Note from DADA2 ITS
tutorial**](https://benjjneb.github.io/dada2/ITS_workflow.html).
**Considerations for your own data:** This is a great place to do a last
sanity check. Outside of filtering (depending on how stringent you want
to be) there should no step in which a majority of reads are lost. If a
majority of reads were removed as chimeric, you may need to revisit the
removal of primers, as the ambiguous nucleotides in un-removed primers
interfere with chimera identification. If a majority of reads failed to
merge, the culprit could also be un-removed primers, but could also be
due to biological length variation in the sequenced ITS region that
sometimes extends beyond the total read length resulting in no overlap.
:::
## Taxonomy assignment
Congratulations! You've made it to a checkpoint in the pipeline. If you
have save the ASV tables, specially after removing chimeras, if not go
and do that. This section can take a big toll on your local machine. It
is best to perform in the Van Bael Lab Mac or HPC Cypress.
Download the latest "full" [UNITE
release](https://unite.ut.ee/repository.php). This will serve as your
reference for assigning taxonomy. Use the appropriate data base to
assing taxonomy to your data or project!
```{r, taxonomy}
#| eval: true
#| echo: false
#| tidy: true
unite.ref <- file.path(out_dir, "clean_data/taxonomy/sh_general_release_dynamic_s_25.07.2023.fasta") # CHANGE ME to location on your machine
taxa_seq1 <- assignTaxonomy(seqtab_seq1_nochim,
unite.ref,
multithread = TRUE,
tryRC = TRUE) #Multithread = FALSE in Windows. TRUE in Mac/Linux.
#Loading from the files saved. In case it crashes, we start from here.
#seqtab.nochim2 <- readLines(file.path(out_dir, "output.txt"))
#seqtab.nochim2 <- read.csv(file.path(out_dir, "clean_data/ASV_tables", "/ASV_nochim_denoise_filt.csv"))
#seqtab.matrix <- as.matrix(seqtab.nochim2) #assignTaxonomy needs a vector matrix
## unqs <- lapply(fn, getUniques)
# seqtab <- makeSequenceTable(unqs)
# dim(seqtab)
```
Inspecting the taxonomic assignments:
```{r, taxa_inspection}
#| eval: true
#| echo: false
#| tidy: true
taxa.print_seq1 <- taxa_seq1 # Removing sequence rownames for display only
rownames(taxa.print_seq1) <- NULL
head(taxa.print_seq1)
# Save file
#write.csv(taxa.print_seq1, file.path(out_dir, "clean_data/taxonomy/01-assigned_tax_8450.csv"))
```
Done!
You have successfully taken the sequences through the DADA2 pipeline.
You can do the same for your samples. This was a small number of samples
and your local machine can handle it. When you obtain your sequence
files from the sequencing core it is about 25 Gb of data. It is best to
work on this through the Cypress HPC cluster. Let's move on to that.
# Aponte_8756_23110702
Remember to change the file paths to your files and directories. As
well, as to where `cutadapt` is installed in your local machine or HPC
cluster. Use the appropriate data base to assigning taxonomy to your
data or project! Notice that `multithread = TRUE` here to take advantage
of HPC's capacity to run jobs in parallel.
Due to the low quality of Aponte_8756_23110702 sequences, we will
attempt to merge sequences first then filter out chimeras. This is a
different approach from the Aponte_8450_23052601 sequences. But it is an
attempt to use these sequences in downstream analyses. We have made
several attempts at relaxing the `maxEE` parameters to 5 and 7 for
forward and reverse reads, respectively, but only two samples actually
pass the filter. We will also use a `maxEE` and `trunQ` of 2 for both
forward and reverse reads. This approach would go like this: Derep \>
Merge \> Filter and trim \> Dereplicate \> Learn error rates \> Sample
inference \> Construct ASV Sequence Table \> Remove chimeras \> Track
reads through the pipeline \> Taxonomy assignment
```{r}
#DADA2 pipeline
#Modified by Bolívar Aponte Rolón for bioinformatic analyses of ITS amplicon sequences
#14/june/2023
# Loading packages
# Activate commands as needed.
# DADA2 package and associated
library("usethis") #Doesn't install on the HPC cluster for some reason.
library("devtools") #Doesn't install on the HPC cluster for some reason.
library("Rcpp")
library("dada2")
library("ShortRead")
library("Biostrings") #This will install other packages and dependencies.
### File Paths
#Aponte_8756_23110702
path_2 <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics/Aponte_8756_23110702"
out_dir <- "/home/baponte/Boxx/Dissertation/Mimulus/Data/CH2/bioinformatics"
list.files(path_2)
fnFs_seq2 <- sort(list.files(path_2, pattern = "R1.fastq.gz", full.names = TRUE))
fnRs_seq2 <- sort(list.files(path_2, pattern = "R2.fastq.gz", full.names = TRUE))
# Identifying primers
#CHANGE primers accordingly
FWD<-"CACTCTTTCCCTACACGACGCTCTTCCGATCTCTTGGTCATTTAGAGGAAGTAA"# Forward ITS1f_adapt from IDT
nchar(FWD) #Number of primer nucleotides.
REV<-"GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCTGCGTTCTTCATCGATGC"# Reverse primer ITS2r_adapt from IDT
nchar(REV)
### Verifying the orientation of the primers
allOrients <- function(primer) {# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna,
Complement = Biostrings::complement(dna),
Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
#FWD2 <- FWD.orients[["Complement"]] #Use if you suspect an orientation mix-up.
#REV2 <- REV.orients[["Complement"]]
```
Contrary to the Aponte_8450 sequences, the Aponte_8756 sequencing run
has a lot of N call in the first 20-30 bp. We will trim out these first
bases and then filter out any reads with N calls. This is important
because `dada` does not support N calls. This will potentially remove
some of the primers present. We will remove them in subsequent steps.
```{r}
### Filter Reads for ambiguous bases (N)
fnFs_seq2_filtN <- file.path(path_2, "filtN", basename(fnFs_seq2)) # Put N-filtered forward read files in filtN/ subdirectory
fnRs_seq2_filtN <- file.path(path_2, "filtN", basename(fnRs_seq2)) #Reverse reads
# filterAndTrim(fnFs_seq2, fnFs_seq2_filtN,
# fnRs_seq2, fnRs_seq2_filtN,
# trimLeft = c(20,20), #Trim 20 bases from the 5' end of each read
# maxN = 0, multithread = TRUE) #multithread = TRUE on Mac OS, FALSE in Windows
```
```{r}
### Checking for primer hits
set.seed(123)
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs_seq2_filtN[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs_seq2_filtN[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs_seq2_filtN[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs_seq2_filtN[[1]]))
### Cutadapt: removal of primers
#Once this has been completed there is no need to run again when working on the script
cutadapt <- "/usr/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R