-
Notifications
You must be signed in to change notification settings - Fork 3
Filtering
The filter settings are applied in the order as described here.
First, samples can be renamed, for formats such as sync
or (m)pileup
that do not store sample names by default. This affects the sample names in the output, such as column header names for the statistics being computed. It also is used in the sample filters, where certain samples can be excluded if not needed.
Furthermore, we offer an option to group samples by merging their counts. This simple adds up the counts in each group, as if the underlying reads were first combined into one fastq
file before mapping, or one combined sam
/bam
file after mapping. Note that this can lead to high read depth values, which might affect some statistics.
These options allow to subset the computation and output to certain regions of interest. As of now, we still have to read through the whole input, for implementation reasons (jumping in files is tricky, and not all formats support it). Hence, when testing analyses, it might be beneficial to subset the data to a region of interest once, by creating a sync
file that only contains that region, and then using this for the analyses. See the sync command for details.
This filters are applied before any statistics computations, in the order in which they are listed. Any position for a sample or the total that fails due to any of the provided filter settings is marked as not passing. Some commands such as fst additionally apply a SNP filter, where the position needs to have non-zero counts in two or more bases (ACGT
). This can additionally be refined with the SNP-related filters.