Skip to content

Overview

Lucas Czech edited this page May 28, 2024 · 11 revisions

Most of the options are listed in the same order in which they are applied. That way, the list of options can be read top to bottom, to see what is happening to the flow of data at each point. We here walk through the options in that order as well.

Data flow

The general flow of data is as follows:

Data Flow.

That is, we start with the given input files, see Input formats. All input is read per position along the genome, and converted into a common data structure, which we named a (potential) Variant. It stores the chromosome and position, as well as the reference base and potentially the alternative base at the position. Then, for each sample (pooled population) of the input, we store the counts of all four nucleotides (ACGT), as well as counts for "any" (N) and "deletion" (D). This structure is very similar to the sync file format, see Input formats: sync. If multiple files are provided as input, their data at a given position in the genome is simple added as different samples to the same Variant.

In other words, our internal data representation consists of six counts (integer numbers), for ACGTND, as shown above. All input formats are converted to this, for instance by counting the number of bases at the given position.

All downstream steps work on that count data. In most commands, we first apply Filtering, for instance to subset the analysis to a certain region, or apply some numerical quality filters. See there also for an explanation of how we identify SNPs from these counts. Then, the data stream is potentially assembled into windows along the genome, in which the statistics are computed. See Windowing for details.

Terminology

We often use the terms "locus" and "position" in the genome interchangeably, and for some formats even refer to this as "coordinates". They are all used identically here, typically in form of a chromosome name followed by a position (Chr:123) or interval (Chr:123-456). Note that we always assume positions to start at 1, unless noted otherwise (which is for instance the case for the BED file format).

In order to avoid confusion, we have decided to use the term "read depth" throughout to mean the number of reads that are present at a particular position in the genome. This is opposed to the ambiguous term "coverage", which is instead used by PoPoolation and npstat. However, this might lead to confusion about whether this is meant to indicate "coverage depth" (that is, "read depth") or "coverage breadth". Ideally, we would even want to distinguish the average depth (e.g., 30X "coverage depth"), from the specific "read depth" at a particular locus. As we only have the latter use case here, we have hence decided to simply use "read depth" throughout.

Clone this wiki locally