-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many "alignment score too low, or score drop too high" during pair-end merge compared with FLASH #526
Comments
There are other more sophisticated rules in the merging algorithm that will discard read pairs with a high fraction of mismatches, but these rules are not controlled by user-facing variables. So, it is currently not possible for end-users to adjust the |
@billzt using your 12S data sample I tried to show the difference between the two merging approaches, and the benefit-risk of each. To do so, I use taxonomic assignment as our reference metric, with the idea that amplicons with higher similarity percentages with reference sequences are less likely to be false-positives. To summarize, The testing dataset initially contains 6,714 pairs. Of these, 5,050 are merged in exactly the same way by both methods (75.2% overlap). See code below for future reference: cd ./issue_526_20230704/
mkdir -p data results
## versions
flash --version # FLASH v1.2.11
vsearch --version # vsearch v2.22.1
## raw data
(cd ./data/
wget \
ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_1.fastq.gz \
ftp.sra.ebi.ac.uk/vol1/fastq/DRR126/DRR126155/DRR126155_2.fastq.gz
)
## run flash
(cd ./data/
flash \
--output-directory=../results/ \
DRR126155_1.fastq.gz \
DRR126155_2.fastq.gz 2>&1 | \
tee ../results/flash.log
)
(cd ./results/
vsearch \
--fastq_filter "out.extendedFrags.fastq" \
--fastq_ascii 33 \
--fasta_width 0 \
--fastaout "DRR126155_flash.fasta"
rm out.*
)
## run vsearch
(cd ./data/
vsearch \
--fastq_mergepairs DRR126155_1.fastq.gz \
--reverse DRR126155_2.fastq.gz \
--fastq_allowmergestagger \
--quiet \
--fasta_width 0 \
--fastaout ../results/DRR126155_vsearch.fasta 2>&1 | \
tee ../results/vsearch.log
)
## dereplication (intra)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
(( $(grep -c "^>" "${FASTA_V}") == $(grep -v "^>" "${FASTA_V}" | sort -u | wc -l) )) && \
echo "no duplicates in vsearch results"
(( $(grep -c "^>" "${FASTA_F}") == $(grep -v "^>" "${FASTA_F}" | sort -u | wc -l) )) && \
echo "no duplicates in flash results"
)
## common (n = 5050) most entries are common!
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -12 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - \
-B 1 "${FASTA_V}" > "DRR126155_common.fasta"
)
## entries specific to vsearch (n = 8)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -13 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - \
-B 1 "${FASTA_V}" > "DRR126155_vsearch_exclusive.fasta"
)
## entries specific to flash (n = 1281)
(cd ./results/
FASTA_V="DRR126155_vsearch.fasta"
FASTA_F="DRR126155_flash.fasta"
comm -23 \
<(grep -v "^>" "${FASTA_F}" | sort) \
<(grep -v "^>" "${FASTA_V}" | sort) | \
grep \
--no-group-separator \
-F -f - -B 1 "${FASTA_F}" > "DRR126155_flash_exclusive.fasta"
)
## taxonomic assignments with a 12S reference database from:
## https://github.com/zjgold/FishCARD
function taxonomic_assignment() {
local -ir THREADS=4
local -ir IDDEF=2
local -r IDENTITY="0.5"
local -ir MAXREJECTS=32
local -r SUBJECTS="../data/GenBank_.fasta"
local ASSIGNMENTS="${1/\.fasta/.hits}"
vsearch \
--usearch_global "${1}" \
--threads "${THREADS}" \
--dbmask none \
--qmask none \
--rowlen 0 \
--notrunclabels \
--userfields query+id${IDDEF}+target \
--maxaccepts 0 \
--maxrejects "${MAXREJECTS}" \
--top_hits_only \
--output_no_hits \
--db "${SUBJECTS}" \
--id "${IDENTITY}" \
--iddef "${IDDEF}" \
--userout "${ASSIGNMENTS}"
}
(cd ./results/
taxonomic_assignment "DRR126155_common.fasta"
taxonomic_assignment "DRR126155_vsearch_exclusive.fasta"
taxonomic_assignment "DRR126155_flash_exclusive.fasta"
) library(tidyverse)
setwd("issue_526_20230704/results/")
colnames <- c("amplicon", "similarity", "reference")
sources <- c("common", "flash_exclusive", "vsearch_exclusive")
sources %>%
purrr::map_chr(~ str_c("DRR126155_", .x, ".hits")) %>%
purrr::map_dfr(~ read_tsv(.x,
col_names = colnames,
show_col_types = FALSE),
.id = "source") %>%
select(-reference) %>%
distinct() -> d
d %>%
ggplot(aes(x = source, y = similarity)) +
geom_violin(scale = "count") +
coord_cartesian(ylim = c(95, 100))
ks.test(d %>% filter(source == 2) %>% pull(similarity),
d %>% filter(source == 1) %>% pull(similarity),
alternative = "two.sided",
exact = NULL)
ks.test(d %>% filter(source == 2) %>% pull(similarity),
d %>% filter(source == 1) %>% pull(similarity),
alternative = "great",
exact = NULL)
## D^+ = 0.35122, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies above that of y |
Thank you |
Thanks for explaining and testing, @frederic-mahe ! |
Testing data: https://www.ncbi.nlm.nih.gov/sra/?term=DRR126155 6714 pairs
FLASH (v1.2.7), using all default parameter, i.e.
The result is:
The merging rate is very high.
vsearch (v2.22.1_linux_x86_64), command is
That is, making maximum percentage diff. bases in overlap = 25%, equal with FLASH
The result is:
The percentage of not merging is significantly higher than FLASH, due to many reads regarded as "alignment score too low, or score drop too high"
How can I adjust the parameters to get similar results as FLASH?
The text was updated successfully, but these errors were encountered: