-
Notifications
You must be signed in to change notification settings - Fork 3
Overview
Most of the options are listed in the same order in which they are applied. That way, the list of options can be read top to bottom, to see what is happening to the flow of data at each point. We here walk through the options in that order as well.
The general flow of data is as follows:
That is, we start with the given input files, see Input. All input is read per position along the genome, and converted into a common data structure, which we named a (potential) Variant. It stores the chromosome and position, as well as the reference base and potentially the alternative base at the position. Then, for each sample (pooled population) of the input, we store the counts of all four nucleotides (ACGT
), as well as counts for "any" (N
) and "deletion" (D
). This structure is very similar to the sync
file format, see Input: sync. If multiple files are provided as input, their data at a given position in the genome is simple added as different samples to the same Variant.
In other words, our internal data representation consists of six counts (integer numbers), for ACGTND
, as shown above. All input formats are converted to this, for instance by counting the number of bases at the given position.
All downstream steps work on that count data. In most commands, we first apply Filtering, for instance to subset the analysis to a certain region, or apply some numerical quality filters. See there also for an explanation of how we identify SNPs from these counts. Then, the data stream is potentially assembled into windows along the genome, in which the statistics are computed. See Windowing for details.
Most commands offer options to specify the output file paths and names, namely --out-dir
, --file-prefix
, --file-suffix
, and --compress
. Generally, the command name (such as fst
) is used as the base name for the output files, which are then stored at
<out-dir>/<file-prefix><base-name><file_suffix>.<ext>[.gz]
using the correct extension in each case, as well as .gz
if the output is compressed. The prefix and suffix hence allow to distinguish the output files for different input files.
We often use the terms "locus" and "position" in the genome interchangeably, and for some formats even refer to this as "coordinates". They are all used identically here, typically in form of a chromosome name followed by a position (Chr:123
) or interval (Chr:123-456
). We always measure positions and intervals along the genome in base pairs. Note that we always assume positions to start at 1, unless noted otherwise (which is for instance the case for the BED
file format).
In order to avoid confusion, we have decided to use the term "read depth" throughout to mean the number of reads that are present at a particular position in the genome. This is opposed to the ambiguous term "coverage", which is instead used by PoPoolation and npstat. However, this might lead to confusion about whether this is meant to indicate "coverage depth" (that is, "read depth") or "coverage breadth". Ideally, we would even want to distinguish the average depth (e.g., 30X "coverage depth"), from the specific "read depth" at a particular locus. As we only have the latter use case here, we have hence decided to simply use "read depth" throughout.