Skip to content

Windowing

Lucas Czech edited this page Sep 16, 2024 · 16 revisions

Many statistics are typically computed within windows along the genome. We offer several window types, which should be sufficient for any type of analysis.

Unless noted otherwise, all window measurements are measured in positions (i.e., base pairs) along chromosomes. For instance, a window Chr1:123-456 would express the window from positions 123 to 456, both inclusive, and starting to count at 1 for the first base pair, on chromosome Chr1.

Window types

Sliding interval window

This is the classical approach, where each window has a fixed length of genomic coordinates (positions on the genome, in base pairs). The window width and the stride (how far to move with each window) can be configured as needed. This is hence independent of the actual data - missing data, invariant positions, etc, will simply be part of the windows as they come in the data.

When a reference genome fasta, dict, or fai file is given, the chromosome lengths from there will be used, so that intervals at the end of chromosomes, where there might be missing data, are still constructed properly. Without such reference, we can not distinguish missing data from the end of the chromosome, and just stop with the windows whenever the data ends.

Queue window

This is also a sliding window approach, but instead of fixed window lengths along the chromosome, it uses a fixed number of positions in each window. For instance, each window then contains the same number of relevant positions (typically SNPs), independently of how far across these are spread out across the chromosome.

With x marking SNPs, and - marking either invariant or missing positions, and using a size of 2, we would for instance get the following windows:

---x--x---x-xx---x-xx-x--x----x-x-x---x--
[     ][    ][   ][ ][   ][     ][      ]

Each window contains 2 SNPs, as well as the data from the positions in front of each SNP. The last window then also contains the trail of non-SNP data. This way, each input position is used in exactly one window (at least for the default case where the stride is equal to the count).

Note that the selection of which positions are relevant here (the x positions) is done after all numerical filters have been applied. That is, any position that did not pass any of the filters will be considered as not relevant (the - positions). Typical statistics (diversity, FST) will then also add a filter for invariant positions, so that only SNPs remain for these windows. This allows statistics to properly normalize their values per window.

This also means that the window positions along the genome will vary depending on the input. For instance, adding another population sample to the input might introduce new SNP positions that were invariant in the previous samples. In particular when running analyses on multiple samples, it hence makes sense to analyze all of them at once, in order to make sure that the same window boundaries are used for all of them.

Single positions

This simply treats each position of the input as a "window" of size 1. This is hence equivalent to not using windows at all, and instead compute statistics per position. Depending on the filter settings, this might visit every position in the genome, or only those that have data, or only those that also pass all filter settings.

Region windows

This is the most flexible approach, where the user defines the windows via some file, typically containing one window per line (chromosome and start and end position of the window). These regions can be nested or overlapping as needed. With this approach, any a-typical use case that is not covered by the other window approaches can be achieved.

The statistics are then computed for each such defined region, which for instance can be genes, LD-blocks, etc, depending on the analysis needs. This can also be used to define fixed windows of numbers of SNPs, similar to the above queue window, when not all samples are analyzed at the same time. In such cases, adding a new sample might introduce new variant positions, hence changing the boundaries of the queue SNP window. This would make downstream analyses difficult, as (in the extreme case) any pair of samples might need different window boundaries (different intervals) to contain, say, 100 SNPs per window. Using a pre-defined list of windows with this option here can thus help to avoid these issues. To this end, windows can for instance be defined so that they each contain a desired number of SNPs across all samples.

Chromosomes and Genome

These two window types simply treat whole chromosomes, or the whole genome, as large windows. That is, they compute a single value of the statistic either per chromosome, for one value for the whole genome.

Memory requirements

The sliding, queue, and regions window types need to keep data in memory on the order of the window size, to allow for strides smaller than the window size and overlapping regions to work in one pass through the data. On the other hand, the single, chromosomes, and genome window types only keep some small buffer of data in memory for speed, so that we never exceed typically available memory, even when processing whole genomes for many samples.

Window averaging policy

When computing statistics such as diversity metrics or FST in windows, we often want to compute an average across each window. Data might have positions that are missing, have low read depth, fail some other filter, or simply might only consist of the SNP data itself, if some SNP calling was applied before and only those positions are available in the input.

For each of these scenarios, we need a fitting strategies for how to meaningfully compute the per-site average across windows. These are set via --window-average-policy, with the following options:

  • window-length: Simply use the window length as the denominator for the average. This hence does not depend on the data, and might underestimate the metric in regions with low coverage and missing data.
  • available-loci: This uses the count of all positions for which there was data in the input (unless it was completely removed by a region filter, or marked as missing in formats that support this), independent of whether they passed the filters or are SNPs. This can be useful when SNP calling was applied beforehand, so that only those positions are available in the input.
  • valid-loci: This uses the number of valid or "good" loci in the input, i.e., all positions that passed the numerical filters and are thus considered to have sufficient read depth, minimum counts, etc. That is, these positions are of high quality, and include both the SNPs and the invariant positions. In the absence of any particular circumstances, this is the recommended option for input that contains data for all positions (such as sam/bam files). This can also be used in combination with a mask file, in order to specify loci that are to be considered valid, even in the absence of actual data in the input.
  • valid-snps: Only use the number of SNPs as the denominator for the averaging. This will overestimate the average, as it does not take the invariant high quality positions into account. Nonetheless, this can be useful for some types of input data - but it should be kept in mind that the resulting values are an overestimate.
  • sum: This policy does not apply any averaging across the window, and simply yields the sum of the per-site values. This is useful if none of the other available options is fitting, so that the user can compute an average as needed later.
  • provided-loci: Instead of using a denominator for the averaging that is informed by the data, use a provided mask. This is the most flexible option, as it can be used to enforce any window denominator, for instance based on some external quality filtering that has been conducted before. When used, a mask file in BED or FASTA-like format has to be given. For each window, the positions marked in this file are then counted in that window, and this value is used as the denominator for computing the average statistic in the window.

In a sense, the options from top to bottom are decreasing the denominator for the averaging (with the exception of the provided-loci option), and hence tend to go from underestimation to overestimation. Depending on the input data and research question, there should hence be a policy that fits the analysis needs.

The commands that compute window averages, such as diversity and fst require this option to be provided, in order to enforce correct usage. Note that for the numerical policies, there only is a difference if any numerical filters are applied. Without any filters, all variant positions are considered "valid", and so the above policies that take validity into account will yield the same result. The "fixed" window policies (window length, and provided loci list), of course are independent of the numerical filter settings.