Skip to content

Example

Lucas Czech edited this page Jun 6, 2024 · 3 revisions

We here walk through the usage of grenedalf with some exemplary data. See the example directory in the main grenedalf directory for the data.

Data description

As an exemplary data set, we are using samples from our GrENE-net project. In short, GrENE-net is a large distributed Evolve-and-Resequence experiment in Arabidopsis thaliana. The founder generation consists of a common seed mix with accessions of A. thaliana from all over its native range, representing its genetic diversity. The mix was planted in 45 climatically diverse locations, in 12 biological replicate plots per location. Over several years (and hence several generations), during the flowering season, pooled samples of the flowers have been sequenced for each surviving plot at the locations. The scale of the GrENE-net project was what motivated us to develop grenedalf in the first place, as existing tools were simply incapable of processing the number of samples in a reasonable time. This makes GrENE-net an ideal showcase experiment for grenedalf, as we have full genomic information on the founder seed mix, as well as pooled samples from each generation thereafter, along with the pool sizes (number of flowers in each pool).

At the time of writing (06/2024), more than 2,000 pooled samples have been processed, and the manuscript presenting GrENE-net is in preparation. In the likely case we forget to update this page once the manuscript is out, if you are interested in the details, please check out grene-net.org, or do a quick online search to see if the paper is out yet. We however also describe preliminary analyses of the project here:

Monitoring rapid evolution of plant populations at scale with Pool-Sequencing. Lucas Czech, Yunru Peng, Jeffrey P. Spence, Patricia L.M. Lang, Tatiana Bellagio, Julia Hildebrandt, Katrin Fritschi, Rebecca Schwab, Beth A. Rowan, GrENE-net consortium, Detlef Weigel, J.F. Scheepens, François Vasseur, Moises Exposito-Alonso. bioRxiv, 2022.02.02.477408; doi:10.1101/2022.02.02.477408

For this example, we are here picking three of these samples, selected so that we have two similar biological replicates (same generation, same location, different plots), as well as one distinct third sample:

ID Location Site Plot Date Pool size
MLFH570120180919 Utrecht 57 01 2018-09-19 38
MLFH570320181003 Utrecht 57 03 2018-10-03 16
MLFH320120180131 Mallorca 32 01 2018-01-31 28

With this, we expect to see some variation in the pools, and, as these samples are from the first sampled generation of the experiment, maybe even some selection towards locally adapted genotypes.

Data processing

The raw fastq files from the project have been processed with our grenepipe workflow (by now the naming scheme of all our tools is probably pretty obvious, and probably also slightly confusing). The grenepipe workflow offers a variety of standard tools for the common tasks of read trimming, read mapping, and variant calling, as well as a plethora of auxiliary steps, such as damage assessment and quality control. It was mainly written for individual-based data, but can be used for pooled data as well. However, as we describe in our preliminary study, running typical variant calling tools that were written for individual data often produce sub-optimal results for pool sequencing data. This is because Pool-seq breaks the statistical assumptions that are made by these tools to call the variants.

We hence recommend to either use a variant caller that was specifically written for Pool-seq, or to simply base the analyses on the raw mapping results instead. This is by far the simplest solution, and provides all the information that we need. To this end, we used grenepipe to run up to the mapping step, as described here. This results in bam files, which we use as input here.

Evolve-and-Resequence

In Evolve-and-Resequence experiments, such as GrENE-net, the founder generation is usually known. That means we have information on the haloptypes of the founders, and can use this to obtain more reliable allele frequencies of our later generations. This can particularly help in low coverage pool sequencing experiments, and yield accurate allele frequencies even under those conditions. For details, see these two publications:

Accurate Allele Frequencies from Ultra-low Coverage Pool-Seq Samples in Evolve-and-Resequence Experiments.
Tilk S, Bergland A, Goodman A, Schmidt P, Petrov D, Greenblum S.
G3: Genes|Genomes|Genetics, 9(12), 4159 LP - 4168, 2019. doi:10.1534/g3.119.400755

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data.
Kessner D, Turner T, Novembre J.
Molecular Biology and Evolution 30 (5): 1145–58, 2013. doi:10.1093/molbev/mst016

We used the implementation of the first of these two, called HAF-pipe in order to get more accurate allele frequencies for our data as well. HAF-pipe takes as input the bam files that we produced above, as well as a vcf describing the variation in the founder generation. We used a high-quality imputed vcf of our GrENE-net founders, which however is outside the scope of this example. To run HAF-pipe, we again used grenepipe, as described here.

The outcome of running HAF-pipe is a set of frequency table files. In itself, the tool unfortunately creates a separate file for each chromosome. However, our implementation in grenepipe stitches these back together, and creates a table containing all chromosomes for the samples in one table. This is also the data that we use in the example here, in form of simple csv table files.

The HAF-pipe output tables contain allele frequencies. In order to work with those in grenedalf, we need to convert them to integer counts instead, such as we are getting from counting base pairs at each position in a sam file. By default, if only provided with frequencies and no corresponding read depth or other information, our frequency table input processing simply multiplies the frequencies by 1,000,000 to turn them into counts of sufficient resolution. This is obviously higher than any realistic read depth, and just meant as a default to avoid loss of precision. With HAF-pipe, we can do better though: Using their Effective Coverage Calculator, we can get an estimate of the effective coverage (what we call read depth here) of the tables. For our high-quality GrENE-net data, with a properly imputed founder VCF, and an original read depth in the input bam files of 14, for the first generation of the experiment, we get an effective read depth of 1,200. This will be useful later.

Subsetting

In order to keep the size of this example manageable, we have subset the bam and the csv files to a region of 1 mio base pairs, centered on the Flowering Locus C (FLC) on chromosome 5, which is thought to be an important driver of differentiation seen in our experiment. The locus is at chromosome 5, 3173382-3179448 bp, see here for details on the locus.

We selected the 5:2676415-3676415 region from the bam files as follows:

for FILE in ./*.bam ; do
    NAME="$(basename ${FILE} .bam)"
    samtools index ${FILE}
    samtools view -h ${FILE} 5:2676415-3676415 | gzip > ./${NAME}.sam.gz
done

This uses samtools. We are storing the files as compressed sam files instead of bam, just because those are slightly smaller for our exemplary data. It is recommended to use the -b flag though to produce bam files for the general use case.

Similarly, for the HAF-pipe tables, we used the following command for subsetting:

for FILE in ./*.csv ; do
    NAME="$(basename ${FILE} .csv)"
    awk -F, 'NR == 1 || ($1 == 5 && $2 >= 2676415 && $2 <= 3676415)' ${FILE} | gzip > ./${NAME}.csv.gz
done

These are finally all the files we need to get going with the example. They are located in the example directory in the grenedalf repository.

Preparation

The three sam.gz files contain the mapped reads of the three samples we are interested in. In our case, we do not want to use the sam read groups here, and simply take all reads per file as a single pooled sample. If your data preparation makes used of sam read groups, you can instead use those (via --sam-split-by-rg), and their names will be used for the sample names instead.

Here though, using each sam file as a sample, we do not have sample names given by the format itself. Hence, the file base name (without path and extension) is used by grenedalf as the sample name instead. We can use that to create a table with the pool sizes for each sample. In our experiment, the following number of flowers were samples for each of the three pools:

MLFH570120180919,38
MLFH570320181003,16
MLFH320120180131,28

This file, which is provided in example/pool-sizes.csv, is then needed to tell grenedalf the pool sizes of each sample, so that their sampling noise can be corrected for correctly. This file also works for the HAF-pipe frequency csv tables, as those use the same sample names.

Note that some file formats such as sync or mpileup can contain multiple samples in one file, but by default do not provide samples names. Hence, in order to distinguish the different samples in a file from each other, they are also named using the file base name, appended by .1 to .n for all samples in the file. We hence recommend to either use the extension for sync headers, see here, or to provide a proper renaming of the samples via --rename-samples-file.

Calculate diversity

Finally, we can run grenedalf to compute our statistics.

To compute the diversity metrics, within the example directory, we run:

../bin/grenedalf diversity \
    --sam-path ./*.sam.gz \
    --pool-sizes pool-sizes.csv \
    --window-type genome \
    --window-average-policy valid-loci \
    --filter-sample-min-count 2 \
    --filter-sample-min-read-depth 4 \
    --no-extra-columns

We here provide the sam files via glob pattern matching (using the *) to the command. It is also possible to provide a directory instead, and all files with fitting extensions will be used.

To run on the HAF-pipe tables instead, we simply need to replace the --sam-path with an appropriate --frequency-table-path instead. Note though that because our simple example directory mixes the data files with the pool-sizes.csv table, the latter could be confused by the option to be interpreted as an additional sample table when simply providing the whole directory. To avoid this, we generally do not recommend to mix different types of data or tables in the same directory, or to provide the files via some glob pattern matching instead, such as ./MLFH*.csv.gz to match the exact file names we want.

Furthermore, as mentioned above, we want to use the proper effective coverage (aka read depth) of 1,200 when reading frequencies from HAF-pipe. Hence, we use the following options instead of --sam-path to read our HAF-pipe tables:

--frequency-table-path ./MLFH*.csv.gz
--frequency-table-depth-factor 1200

With this, we obtain a whole-genome estimate of Theta Pi, Theta Watteron, and Tajima's D. Note that because our input only contained a region of chromosome 5, this is not a true whole genome value of course, but just limited to that region. Instead of using --window-type genome to just compute a single value for the whole genome (or rather, whole available input data), we could alternatively have supplied

--window-type regions
--window-region 5:2676415-3676415

to get the same result. That's the region our input has data for, and so computing a value for that region as a whole yields the same result here, when using --window-average-policy valid-loci. Note that different policies for how window averages are computed to obtain per-base diversity estimates can change this, as the full genome of course has a different window size than the limited region. See Window averaging policy for details.

Calculate FST

Next, we want to compute FST. Similar to above, we can run:

../bin/grenedalf fst \
    --sam-path ./*.sam.gz \
    --window-type genome \
    --pool-sizes pool-sizes.csv \
    --method unbiased-nei \
    --window-average-policy valid-loci \
    --no-extra-columns

This gives us all-to-all pairwise whole-genome (or rather for all input data) estimates of FST, i.e., between all pairs of samples. For the special case of all-to-all whole-genome, we produce a simple matrix output in addition to the normal output, which is easier to work with:

fst MLFH320120180131 MLFH570120180919 MLFH570320181003
MLFH320120180131 0 0.106876 0.184315
MLFH570120180919 0.106876 0 0.0729988
MLFH570320181003 0.184315 0.0729988 0

We can see that the differentiation between the two samples from the same location is lower than the other two pairs, as expected.

We can also compute FST in intervals along the genome, by using

--window-type interval
--window-interval-width 100000

instead of the above whole-genome window. With this, we here get 10 windows, which again show that the differentiation is generally lowest for the two samples from the same location.