Skip to content
Lucas Czech edited this page Jul 19, 2024 · 7 revisions

We offer several standard and widely used file formats, along with most of their idiosyncratic options, as input to grenedalf commands. For efficiency, grenedalf generally streams through the input files, and runs all computations in one pass through the data. That is, it linearly reads through the files, and only keeps the data in memory that is needed. See Windowing for details on the memory requirements.

This one-pass approach necessitates the data to be sorted by genome coordinates. When working with a single input file, the order of chromosomes does not matter. However, in order to simultaneously stream through multiple files, we also need to know the order of chromosomes, to synchronize the positions. If chromosomes in the data are not sorted lexicographically, a reference genome fasta file, or a dict or fai file needs to be provided to specify the order.

The internal data representation of grenedalf is similar to the sync format (see below), in that we store nucleotide base counts for all four bases ACGT per sample and position in the genome. This allows multiallelic SNPs to be considered, which would not be possible with a simple reference vs alternative (or major vs minor) allele frequency. Thus, any of the file formats below is internally converted to a simple count of nucleotide bases.

SAM/BAM/CRAM

We support all three file formats, sam/bam/cram, with their typical idiosyncratic options, mostly following the approach of samtools. As these formats represent reads mapped against a reference genome, we convert them into our internal data structure of ACGT counts by simply tallying up the bases of each read at each position, for all reads and bases that pass the minimum quality thresholds (if provided).

The advantage of sam/bam/cram files is that the usually contain the most "raw" state of the data after mapping, that comes with the fewest statistical assumptions on allele frequencies or variants. This is opposed to, e.g., VCF, which is often not adequate for pool sequence data. See below for details. Hence, using sam/bam/cram as input to grenedalf is a good place to start. We however recommend to convert those files to sync if multiple analyses are to be run on the same data, as the sync format is significantly faster to work with.

By default, we interpret all reads contained in a file as being one sample, and name that sample using the file base name (without directory and extensions). Hence, a file /path/to/my-data.sam.gz would result in a sample in grenedalf called my-data. If on the other hand the read groups that can be defined in these formats are important, use --sam-split-by-rg to split the reads by read groups, and name the samples according to the read group names.

(m)pileup

For completeness, we support (m)pileup files, although we do not recommend to use them. This is because generally, sam/bam/cram files are use to create (m)pileup files in the first place anyway, at lower storage costs, and with more flexibility in terms of the options. For instance, at the moment, we do not consider the mapping quality in (m)pileup files, meaning that low quality mappings need to be discarded in the sam/bam/cram to (m)pileup conversion step already.

As the format does not offer an internal way to naming samples, we simply use the file base name (without directory and extensions), appended by .1 to .n for all samples contained in the mpileup file. For instance, with /path/to/my-data.mpileup, the samples would be named my-data.1 to my-data.3, for three samples. Use the --rename-samples-file table to rename those if needed for convenience.

sync

This simple format has been suggested by PoPoolation2, see here, as a space-efficient and very fast format for pooled data. It simply contains the nucleotide base counts for each pooled sample at each position in the genome:

  • Column 1: reference chromosome/contig
  • Column 2: position on the reference chromosome/contig
  • Column 3: reference base
  • Columns 4...: base counts for all populations in the form A:T:C:G:N:D, with N being "any" base, and D being deletions.

For example:

Chr1  79  G   0:0:0:15:0:0    0:0:0:38:0:0
Chr1  80  A   12:0:0:0:0:0    38:0:0:0:0:0
Chr1  81  A   14:0:0:0:0:0    43:0:0:0:0:0
Chr1  82  A   14:0:0:0:0:0    42:0:0:0:0:0

Note that the numbers are counts, although PoPoolation2 refers to them as "allele frequencies". Typically, when converting sam/bam/cram to sync, these counts simply represent the number of reads that have that particular base at the position in the genome. This implies that the total read depth per position is simply the sum of the ATCG counts in the sync format.

Similar to the mpileup above, the sync format does not offer an internal way to naming samples. By default, we here again also use the file base name (without directory and extensions), appended by .1 to .n for all samples contained in the sync file. For instance, with /path/to/my-data.sync, the samples would be named my-data.1 to my-data.3, for three samples. Use the --rename-samples-file table to rename those if needed for convenience, or, better, the below extension, to provide proper sample names.

We support two extensions of the format that have not been part of the original specification:

  • Header line in the format #chr pos ref sample1 sample2 ..., tab separated. The original format has no means of naming the samples. Using this format as the first line in the file allows to specify sample names (here, sample1 and sample2). The names of the first three columns (chr, pos, ref) have to be exactly those.
  • Missing data entries in the format .:.:.:.:.:., that is, each count replaced by a dot. The original sync format has no means of distinguishing missing data, and circumvented this by using 0 counts in positions where a sample did not have data, which is indistinguishable from, e.g., a sample where low counts have been filtered out. Note that (as of the time of writing) all values need to be dots, or none, i.e., it is not possible to only specify a particular base count as missing.

When working with grenedalf, we recommend to make use of these extensions, in order to keep the relevant data close to the sample names, and to not lose information on missing data. For this reason, both extensions are by default activated in the sync command that produces these files.

In the current absence of more flexible pool-sequencing-specific file formats, we strongly suggest to convert sam/bam files to sync files when multiple analyses need to be run on pooled data. This is the most compact format, as we do not store every read and every base, but only their counts overall. Hence, this is by far the fastest file format to read with grenedalf, and we additionally support gzipped sync files, to safe storage space.

VCF/BCF

For completeness, we support to read VCF files. This is meant for use cases where a variant caller has been used on pooled populations to create calls for each pool. That is, each sample in the VCF represents a whole pool of individuals. This hence differs from the originally intended use case of the format. As the VCF was developed for individuals, and not for pools, we strongly recommend to not use it for this purpose. Having a VCF of pooled data typically means that some form of variant caller has been used, of which most are not suitable for pool sequencing data in the first place. We generally find that callers that were developed for diploid (or some low ploidy) individuals do not work well for pool data by simply using, e.g., the pool size as the ploidy. The statistical assumptions made by typical variant callers are broken for pooled data, and hence the calls are not necessarily proper. See here for our examination of this.

Still, we offer this input in case some tool has been used that produces reasonable pool sequencing variant calls. When reading VCF, grenedalf treats each column (each sample) of the file as one pool of individuals Then, the AD (alleleic depth) FORMAT field for the reference and alternative allele of each sample is then used to determine the counts of the reference vs alternative(s) for each pool.

Note: In case that some pool-sequencing-specific variant caller outputs a VCF that does not store the relevant data in the AD FORMAT field, please let us know by opening an issue, so that we can include support for that format as well.

Frequency tables

This input file type allows to read from a generic table format, such as csv, with each line containing the data at one position in the genome. We offer a lot of flexibility of types of input data here - as long as some counts or frequencies per sample can be calculated, we can use it. The naming of the columns is also rather flexible, with defaults that should be able to parse most tables with reasonable column names - but which can be overwritten to match other naming conventions if needed.

At minimum, the table needs to contain a column for the chromosome name, and one for the positions on the chromosomes. Additionally, reference and alternative base columns can be provided. Alternatively, the --reference-genome-fasta-file can be used to specify the reference bases (which however leaves the alternative bases unspecified, so we use the transition base of the reference). If neither is given, we use some fixed defaults, which are hence meaningless - but for our typical statistics, the identity of the bases (ACGT) does not matter anyway.

Then, the actual per-sample data columns need to be provided:

  • In the simplest case, the table can provide reference and alternative count columns for each sample, again simply as count data similar to the sync format.
  • Alternatively, a reference count and the read depth of the sample can be provided, in which case the alternative count is their difference, or the other way round (alternative and read depth).
  • Furthermore, instead of counts, a frequency can be provided (either of the reference or alternative allele, see --frequency-table-freq-is-ref), along with a read depth column. This allows us to compute reference and alternative counts again.
  • Lastly, if only a frequency column per sample is provided, that is, in the absence of a read depth column, we instead simply multiply the frequency by a large number (at the time of writing, we use 1,000,000 by default, see --frequency-table-cov-factor) to convert to counts. Obviously, this default read depth is higher than to be expected in any realistic data, but is meant to capture frequencies without losing too much precision due to the conversion to integer counts. It is of course more meaningful to use the actual read depth, if available. Hence, when using this format, it is important to check that the statistic does not depend on read depth - if so, it might be better to set a lower read depth factor that is more realistic. This whole approach is a workaround to use, for instance, HAF-pipe (article, code), see below for details.

All these columns need to be specified by a header line, so that we can interpret the values of each column correctly. The columns need to be separated by the --frequency-table-separator-char, which defaults to comma. Here are some exemplary header lines:

  • chromosome,position,reference,alternative,SampleA-reference-count,SampleA-alternative-count,...
  • chr,pos,ref,alt,sample1.freq,sample1.cov ...

As noted above, by default we parse these header column names and guess their corresponding meaning. See the --frequency-table-... options for details on the default column names that we expect, and use these options to overwrite those if needed for your format.

Note that because this table format only offers ref vs alt counts/frequencies, it can only be used for biallelic SNP data. If more than one alternative needs to be considered, use the sync format instead. Lastly, missing values can be specified via ., na, or nan in the table by default. For a different value to be used as a marker for missing data, see the --frequency-table-missing-value option.

HAF-pipe allele frequencies

HAF-pipe is a tool to obtain haplotype-inferred allele frequencies from pool-seq data, for studies where the founder population is known. This is for instance the case in Evolve-and-Resequence experiments, and gives more reliable frequencies with a higher effective coverage than can be obtained from raw read counts. See the article and the code repository for details. Note the usage of "coverage" there to indicate the (effective) number of reads that cover a particular position, as opposed to the meaning of coverage as the proportion of the genome that is covered by reads.

The basic setup of HAF-pipe unfortunately needs to be run individually per chromosome, and outputs allele frequency tables (afSite) that only contain a position on the chromosome and the allele frequency at this position. In order to use those in grenedalf, we first need to add back the chromosome column. Assuming we have three chromosomes chr1, chr2, and chr3, and three corresponding output files of HAF-pipe sample1.chr1.afSite, sample1.chr2.afSite, and sample1.chr3.afSite, we can add the missing chromosome columns and join all files into a single sample with these shell commands:

# We expect files with names sample1.*.afSite
SAMPLE=sample1

# Initialize the output file with the header
echo "chr,pos,${SAMPLE}.af" > ${SAMPLE}.csv

# Traverse all HAF-pipe allele frequency files
for file in sample1.*.afSite; do
    # Extract the chromosome name from the file name
    chrom=$(echo "$file" | rev | cut -d '.' -f2 | rev)

    # Use awk to prepend the extracted name to every line, skip adding header for each file
    awk -v chrom="$chrom" 'BEGIN {FS=OFS=","} {if(NR==1) next; print chrom,$0}' "$file" >> ${SAMPLE}.csv
done

This needs to be run for every sample. It is in principle also possible to extend the above script to produce one output table for multiple samples combined. However, in that case, one needs to ensure that they all contain the exact same positions, or to take care of writing lines that correspond to the same position of each file. Furthermore, the input sample naming scheme etc will be different for each project. Hence, we do not provide a one-size-fits-all script here.

Alternatively, we have implemented the whole workflow of HAF-pipe, including the mapping to create the per-sample bam files, in our grenepipe pipeline. See here for details on the HAF-pipe setup in grenepipe. This yields per-sample frequency files with all chromosomes, as well as one large table containing all chromosomes for all samples.

These frequency tables can then be used in grenedalf as input for the --frequency-table-path, as explained above. See there for details. Using the column names as in the above script (chr, pos, ${SAMPLE}.af) will allow this to work out of the box. In order to calculate the overall "effective coverage" that the HAF-pipe haplotype-corrected allele frequencies represent, see here. Using this effective coverage as the --frequency-table-depth-factor then allows to transform the frequencies into counts that contain the correct (effective) coverage of the HAF-pipe frequencies.