-
Notifications
You must be signed in to change notification settings - Fork 3
Filtering
The filter settings are applied in the order as described here, which is the same order as they are listed in the command line options.
First, samples can be renamed, for formats such as sync
or (m)pileup
that do not store sample names by default. This affects the sample names in the output, such as column header names for the statistics being computed. It also is used in the sample filters, where certain samples can be excluded if not needed.
Furthermore, we offer an option to group samples by merging their counts. This simple adds up the counts in each group, as if the underlying reads were first combined into one fastq
file before mapping, or one combined sam
/bam
file after mapping. Note that this can lead to high read depth values, which might affect some statistics.
NB: Our implementation of the sync
format does support an annotation for sample names in a header row, see Input: sync for details. This is preferred to renaming the samples explicitly, as it keeps all relevant data consistent.
These options allow to subset the computation and output to certain regions of interest. This means, only the genome positions (loci) provided via these regions are used in the downstream analyses, while all other positions are completely removed, as if they were not in the input data.
The region filter option --filter-region-mask-fasta
and its two settings are following the format described below in the Masking section. We provide this format of specifying a region filter for convenience. Note however that this is conceptually different from using such a mask file as an actual mask: When using it as a region filter (via the --filter-region-mask-fasta
option), we simply remove all positions that are masked. When using it as an actual mask instead (as described below), the mask "fills in" any missing positions as well, and tags them as masked or not. See below for details.
As of now, with these region filters, we still have to read through the whole input, for implementation reasons (jumping in files is tricky, and not all formats support it). Hence, when needed to run multiple quick iterations of some analyses, it might be beneficial to subset the data to a region of interest once, by creating a sync
file that only contains that region, and then using this for the analyses. See the sync command for details.
For some algorithms and statistics, a mask file can be provided to determine which positions (loci) to use in the computation. For instance, this can be used to provide the filtering results of an external tool, instead of using the internal filters that we provide here. The mask then indicates positions that are valid to use for the computation, versus those that are marked as "missing" (or "not valid"). This can be used to overwrite which positions are considered for the statistics computation. For instance, a position with missing data in the input, but that is marked as not masked out, will still be considered as valid for the computation.
As mentioned above, this conceptually differs from a region filter: In the region filters above, we completely remove any filtered positions from the data stream. With a mask however, we simply tag them as being masked or not, so that the downstream algorithms can take this information into account. This makes for instance a difference for certain window averaging policies: Positions that are specifically marked as not masked will contribute as "valid" positions to the denominator when computing window averages of estimators.
In detail, if a mask file is provided, the following happens: All positions in the input are tagged as either being masked out or valid. If the input data does not contain all positions (i.e., it contains gaps along the genome with zero read depth), these gaps are filled in with zero-counts positions, before the mask is applied. This way, even if the data is incomplete, we force the mask to be considered "ground truth", so that the zero-count positions are still being taken into account.
If additionally any numerical filters are provided (see below), these are applied as well, on all positions that are not already tagged as masked out. Thus, any additional filters can then be used to again filter out positions that have previously been tagged as valid by the mask. This might not be a typical use case, but is offered here as well for completeness.
Lastly, whatever positions remain are then used for the statistics computation. Note that these are typically not meant to be all SNPs, but rather all valid positions, according to the filtering and masking criteria. Typically, those are both the SNP positions, and those positions where we would have been able to call SNPs if there were any, given that we consider the position to be good according to the filters. The actual SNP detection is then applied as described below. For instance, the denominator computation of the diversity metrics can then properly normalize the window averages by using the number of good positions in the window, instead of just relying on window size, which might underestimate diversity, if not all positions are of high quality data.
We provide two formats to specify mask files:
- As a BED file, where all regions listed in the file are masked out, i.e. tagged as not valid. This is similar to the mask option of smcpp.
- As a FASTA-like file as described below, where individual positions can be masked. This is similar to the mask option of vcftools.
The format for the FASTA-like mask file is identical to the one used by vcftools: The mask file contains a sequence of integer digits (between 0 and 9, both inclusive), one for each position on the chromosomes. The digits specify if the position should be masked or not. Any position with a digit that is above the given min value is then tagged as being masked out. For example:
>1
0000011111222...
>2
2222211111000...
Here, the first 5 positions of the start of chromosome 1 are marked as valid (not masked), whereas positions 6 onward are tagged as being masked. Similarly, the first 10 positions in chromosome 2 are being masked out. Using the minimum threshold and invert options allows further flexibility for the masking.
The numerical filters provide a way of removing low quality data and spurious positions. Internally, we use a system similar to the VCF specification, where samples and positions are "tagged" when a filter does not pass. We provide two levels of filters: One level that acts on each sample (each pooled population) individually, and sets a tag for the sample at that position, and one "total" level that acts on the whole data for this position, across all samples. The per-sample filters are applied first, then the total filters. If none of the samples pass their filters, the whole position is marked as "empty" and treated the same as if a total filter failed.
Certain filters that are needed in the statistical estimators change the site frequency spectrum. For instance, the --filter-sample-min-count
used to compute the diversity metrics is meant to remove sequencing errors and other erroneous singleton sites. For computing Tajima's D, this has to be set to 2, meaning single counts (a count of 1 in a nucleotide ACGT
) are removed by setting them to 0 in the data. This singleton removal is a consequence of the derivation of the Pool-seq correction for Tajima's D.
When removing minimum counts that way from samples, note that the remaining counts of the sample are however kept, as the site might contain otherwise valid information. The effect on the site frequency spectrum is usually not a large concern (unless in very low read depth conditions), but is a necessity of the estimators. For instance, a sample with counts A=1, C=2, G=3, T=4
with a singleton minimum counts filter will become A=0, C=2, G=3, T=4
.
The filters are applied before any statistics computations, in the order in which they are listed. Any position for a sample or the total that fails due to any of the provided filter settings is tagged as not passing, and not considered in the analyses.
Note that this implies that the filtering might yield different results depending on which samples are being provided. For instance, a position might be invariant between two samples, but a third sample then contains a variation, and hence changes the results of how that position is considered. Hence, for some algorithms, results can differ if analyses are run on individual samples or on all of them combined in one run. This is a necessity of what is considered a variant or SNP. In particular, this can change the window positions of the Queue window approach.
Some commands such as fst apply a simple SNP filter, which can be seen as an ad-hoc SNP detection mechanism. After all of the above filtering, all positions that remain are considered to be valid, i.e., they passed all filters, have sufficient depth, minimum counts, etc. Then, we simply use the base counts of all four nucleotides (ACGT
) to check which ones of them are non-zero.
If two ore more counts are non-zero, we consider the position a SNP. If exactly two are non-zero, it is a biallelic SNP. All other positions (where at most one of the four bases as a non-zero count) are considered invariant. These are however still considered valid (as they passed all other filters) for the computation of, for instance, the denominator of the diversity estimators. Additional filters can be specified for this step, such as a minimum allele frequency, or sub-setting to only biallelic SNPs (exactly two counts are non-zero).
Note that when a mask file is provided, we still need to apply the SNP detection in order to know which positions to use for the statistics computation, and for the window average denominator. Hence, any position that is not masked, but only contains non-zero counts in the data, is still considered invariant here. Only positions that are not masked, and do contain multiple alleles (non-zero counts) are considered SNPs. If additionally any numerical filters are provided, these are also applied beforehand.