-
Notifications
You must be signed in to change notification settings - Fork 3
Filtering
The filter settings are applied in the order as described here, which is the same order as they are listed in the command line options.
First, samples can be renamed, for formats such as sync
or (m)pileup
that do not store sample names by default. This affects the sample names in the output, such as column header names for the statistics being computed. It also is used in the sample filters, where certain samples can be excluded if not needed.
Furthermore, we offer an option to group samples by merging their counts. This simple adds up the counts in each group, as if the underlying reads were first combined into one fastq
file before mapping, or one combined sam
/bam
file after mapping. Note that this can lead to high read depth values, which might affect some statistics.
NB: Our implementation of the sync
format does support an annotation for sample names in a header row, see Input formats: sync for details. This is preferred to renaming the samples explicitly, as it keeps all relevant data consistent.
These options allow to subset the computation and output to certain regions of interest. This means, only the genome positions (loci) provided via these regions are used in the downstream analyses, while all other positions are completely removed, as if they were not in the input data.
As of now, with these region filters, we still have to read through the whole input, for implementation reasons (jumping in files is tricky, and not all formats support it). Hence, when needed to run multiple quick iterations of some analyses, it might be beneficial to subset the data to a region of interest once, by creating a sync
file that only contains that region, and then using this for the analyses. See the sync command for details.
For some algorithms and statistics, a mask file can be provided to determine which positions (loci) to use in the computation. For instance, this can be used to provide the filtering results of an external tool, instead of using the internal filters that we provide here. The mask then indications positions that are "good" to use for the computation.
In detail, if a mask file is provided, the following happens: All positions in the input are tagged as either being masked out or "good". If the input data does not contain all positions (i.e., it contains gaps along the genome with zero read depth), these gaps are filled in with zero-counts positions, before the mask is applied. This way, even if the data is incomplete, we force the mask to be considered "ground truth", so that the zero-count positions are still being taken into account.
If additionally any numerical filters are provided (see below), these are applied as well, on all positions that are not already tagged as masked out. This might not be a typical use case, but is offered here as well for completeness.
Lastly, whatever positions remain are then used for the statistics computation. Note that these are typically not meant to be all SNPs, but rather all "good" positions, according to the filtering and masking criteria. Typically, those are both the SNP positions, and those positions where we would have been able to call SNPs if there were any, given that we consider the position to be good according to the filters. The actual SNP detection is then applied as described below. For instance, the denominator computation of the diversity metrics can then properly normalize the window averages by using the number of good positions in the window, instead of just relying on window size, which might underestimate diversity, if not all positions are of high quality data.
The numerical filters provide a way of removing low quality data and spurious positions. Internally, we use a system similar to the VCF specification, where samples and positions are "tagged" when a filter does not pass. We provide two levels of filters: One level that acts on each sample (each pooled population) individually, and sets a tag for the sample at that position, and one level that acts on the whole data for this position, across all samples.
The filters are applied before any statistics computations, in the order in which they are listed. Any position for a sample or the total that fails due to any of the provided filter settings is tagged as not passing, and not considered in the analyses.
Note that this implies that the filtering might yield different results depending on which samples are being provided. For instance, a position might be invariant between two samples, but a third sample then contains a variation, and hence changes the results of how that position is considered. Hence, for some algorithms, results can differ if analyses are run on individual samples or on all of them combined in one run. This is a necessity of what is considered a variant or SNP. In particular, this can change the window positions of the Queue window approach.
Some commands such as fst apply a simple SNP filter, which can be seen as an ad-hoc SNP detection mechanism. After all of the above filtering, all positions that remain are considered to be "good", i.e., they passed all filters, have sufficient depth, minimum counts, etc. Then, we simply use the base counts of all four nucleotides (ACGT
) to check which ones of them are non-zero.
If two ore more counts are non-zero, we consider the position a SNP. If exactly two are non-zero, it is a biallelic SNP. All other positions (where at most one of the four bases as a non-zero count) are considered invariant. These are however still considered "good" (as they passed all other filters) for the computation of, for instance, the denominator of the diversity estimators.