-
Notifications
You must be signed in to change notification settings - Fork 3
Overview
Most of the options are listed in the same order in which they are applied. That way, the list of options can be read top to bottom, to see what is happening to the flow of data at each point. We here walk through the options in that order as well.
The general flow of data is as follows:
That is, we start with the given input files, see Input. All input is read per position along the genome, and converted into a common data structure, which we named a (potential) Variant. It stores the chromosome and position, as well as the reference base and potentially the alternative base at the position. Then, for each sample (pooled population) of the input, we store the counts of all four nucleotides (ACGT
), as well as counts for "any" (N
) and "deletion" (D
). This structure is very similar to the sync
file format, see Input: sync. If multiple files are provided as input, their data at a given position in the genome is simple added as different samples to the same Variant.
In other words, our internal data representation consists of six counts (integer numbers), for ACGTND
, as shown above. All input formats are converted to this, for instance by counting the number of bases at the given position. This allows us to extract the relevant information from different types of input, and work on it in a uniform way.
All downstream steps work on that count data. In most commands, we first apply Filtering, for instance to subset the analysis to a certain region, or apply some numerical quality filters. See there also for an explanation of how we identify SNPs from these counts. Then, the data stream is potentially assembled into windows along the genome, in which the statistics are computed. See Windowing for details.
Finally, the statistics are computed on the windows assembled from the data stream. For details of these, please refer to the individual commands, as well as to our publication and in particular our equations document, where we derive the Pool-seq estimators for Theta Pi, Theta Watteron, Tajima's D, and several flavors of FST. If you are interested in additional statistics, please let us know.
The results are then written to files as described in Output. See there also for an explanation of the columns of the output tables. Lastly, we also provide an Example that works through a typical Evolve-and-Resequence Pool-sequencing data set.
We often use the terms "locus" and "position" in the genome interchangeably, and for some formats even refer to this as "coordinates". They are all used identically here, typically in form of a chromosome name followed by a position (Chr:123
) or interval (Chr:123-456
). We always measure positions and intervals along the genome in base pairs. Furthermore, we always assume positions on each chromosome to start at 1, unless noted otherwise (which is for instance the case for the BED
file format).
In order to avoid confusion, we have decided to use the term "read depth" throughout to mean the number of reads that are present at a particular position in the genome. Most often in our context, this number is used to indicate the per-position read depth, that is, the number of bases (in ACGT
) that are present at that position and hence used to compute the reference and alternative counts/frequencies, as described above.
This is opposed to the ambiguous term "coverage", which is instead used by PoPoolation and npstat for this number. However, this might lead to confusion about whether this is meant to indicate "coverage depth" (what we call "read depth" here) or "coverage breadth" (how many positions of the genome have some coverage). Ideally, we would even want to distinguish a third meaning, the average depth (e.g., 30X "coverage depth"), from the specific "read depth" at a particular locus. As we only have the latter use case in grenedalf, we have hence decided to use "read depth" as the term for the per-position number of reads (or bases).