Skip to content

Filtering

Lucas Czech edited this page Apr 29, 2024 · 10 revisions

The filter settings are applied in the order as described here.

Sample names

First, samples can be renamed, for formats such as sync or (m)pileup that do not store sample names by default. This affects the sample names in the output, such as column header names for the statistics being computed. It also is used in the sample filters, where certain samples can be excluded if not needed.

Furthermore, we offer an option to group samples by merging their counts. This simple adds up the counts in each group, as if the underlying reads were first combined into one fastq file before mapping, or one combined sam/bam file after mapping. Note that this can lead to high read depth values, which might affect some statistics.

Region filters

These options allow to subset the computation and output to certain regions of interest. As of now, we still have to read through the whole input, for implementation reasons (jumping in files is tricky, and not all formats support it). Hence, when testing analyses, it might be beneficial to subset the data to a region of interest once, by creating a sync file that only contains that region, and then using this for the analyses. See the sync command for details.

Numerical filters

This filters are applied before any statistics computations, in the order in which they are listed. Any position for a sample or the total that fails due to any of the provided filter settings is marked as not passing. Some commands such as fst additionally apply a SNP filter, where the position needs to have non-zero counts in two or more bases (ACGT). This can additionally be refined with the SNP-related filters.

Clone this wiki locally