-
Notifications
You must be signed in to change notification settings - Fork 3
Subcommand: diversity
Synopsis: Compute pool-sequencing corrected diversity measures Theta Pi, Theta Watterson, and Tajima's D. Usage: Required options:
Documentation for grenedalf v0.6.2 |
Table of contents: |
Compute Pool-seq corrected estimators of nucleotide diversity. In short, we correct for the two sources of noise in pool sequencing: First, when selecting individuals from a population, we have sampling bias due to the finite number of individuals in the pool; second, when selecting reads in the sequencing, a second level of sampling bias is introduced due to the finite coverage per position.
We compute three common estimators that correct for these noises:
- Theta Pi
- Theta Watterson
- Tajima's D
For details on the statistics and the derivation of the estimators, please refer to our publication and in particular our equations document.
As described there, we recommend to be careful when numerically interpreting Tajima's D for Pool seq data, as the estimator is not well suited to work well with the noises introduced by the nested sampling process.
The estimators use the following filter settings internally as well as part of their corrections: --filter-sample-min-count
, --filter-sample-min-read-depth
, and --filter-sample-max-read-depth
. See the equations document above for details on how these parameters influence the estimators. Note that the total
filters are applied for filtering here as well (as far as provided), but have no influence on the estimators themselves. We offer them here for convenience only.
The --filter-sample-min-count
has to be set to exactly 2 when computing Tajima's D. This is a consequence of the equations, which only work out when exactly filtering out singleton counts, according to Kofler et al. Furthermore, the correction terms for Theta Pi and Theta Watterson make use of the --filter-sample-min-read-depth
, by computing a sum over values in that range. Due to the exact way that this correction works, it is recommended to use a minimum read depth of at least twice the minimum per-base count:
--filter-sample-min-count 2
--filter-sample-min-read-depth 4
Lastly, the values per window are computed using the window averaging as described here, in order to scale the results per base pair. See Output for a description of the columns of the produced table.
--sam-path
-
TEXT:PATH(existing)=[] ...
List of sam/bam/cram files or directories to process. For directories, only files with the extension.sam[.gz]|.bam|.cram
are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. --sam-min-map-qual
-
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled mapping quality score [0-90] for a read in sam/bam/cram files to be considered. Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account. Default is 0, meaning no filtering by base quality. --sam-min-base-qual
-
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled quality score [0-90] for a base in sam/bam/cram files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by base quality. --sam-split-by-rg
-
FLAG Needs: --sam-path
Instead of considering the whole sam/bam/cram file as one large colletion of reads, use the@RG
read group tag to split reads. Each read group is then considered a sample. Reads with an invalid (not in the header) read group tag or without a tag are ignored. --sam-flags-include-all
-
TEXT Needs: --sam-path
Only use reads with all bits in the given value present in the FLAG field of the read. This is equivalent to the-f
/--require-flags
setting insamtools view
, and uses the same flag names and their corresponding binary values. The value can be specified in hex by beginning with0x
(i.e.,/^0x[0-9A-F]+/
), in octal by beginning with0
(i.e.,/^0[0-7]+/
), as a decimal number not beginning with '0', or as a comma-, plus-, space-, or vertiacal-bar-separated list of flag names as specified by samtools. We are more lenient in parsing flag names thansamtools
, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well. --sam-flags-include-any
-
TEXT Needs: --sam-path
Only use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the--rf
/--incl-flags
/--include-flags
setting insamtools view
. See--sam-flags-include-all
above for how to specify the value. --sam-flags-exclude-all
-
TEXT Needs: --sam-path
Do not use reads with all bits set in the given value present in the FLAG field of the read. This is equivalent to the-G
setting insamtools view
. See--sam-flags-include-all
above for how to specify the value. --sam-flags-exclude-any
-
TEXT Needs: --sam-path
Do not use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the-F
/--excl-flags
/--exclude-flags
setting insamtools view
. See--sam-flags-include-all
above for how to specify the value.
--pileup-path
-
TEXT:PATH(existing)=[] ...
List of (m)pileup files or directories to process. For directories, only files with the extension.(plp|mplp|pileup|mpileup)[.gz]
are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. --pileup-min-base-qual
-
UINT:UINT in [0 - 90]=0 Needs: --pileup-path
Minimum phred quality score [0-90] for a base in (m)pileup files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by phred quality score. --pileup-quality-encoding
-
TEXT:{sanger,illumina-1.3,illumina-1.5,illumina-1.8,solexa}=sanger Needs: --pileup-path
Encoding of the quality scores of the bases in (m)pileup files, when using--pileup-min-base-qual
. Default is"sanger"
, which seems to be the most common these days. Both"sanger"
and"illumina-1.8"
are identical and use an ASCII offset of 33, while"illumina-1.3"
and"illumina-1.5"
are identical with an ASCII offset of 64 (we provide different names for completeness). Lastly,"solexa"
has an offset of 64, but uses a different equation (not phred score) for the encoding.
--sync-path
-
TEXT:PATH(existing)=[] ...
List of sync (as specified by PoPoolation2) files or directories to process. For directories, only files with the extension.sync[.gz]
are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--vcf-path
-
TEXT:PATH(existing)=[] ...
List of vcf/bcf files or directories to process. For directories, only files with the extension.vcf[.gz]|.bcf
are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. This expects that the input file has the per-sample VCF FORMAT fieldAD
(alleleic depth) given, containing the counts of the reference and alternative base. This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We then interpret theAD
field as the allele counts of each pool of individuals. Note that only SNP positions are used; positions that contain indels and other non-SNP variants are skipped.
--frequency-table-path
-
TEXT:PATH(existing)=[] ...
List of frequency table files or directories to process. For directories, only files with the extension.(csv|tsv)[.gz]
are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. --frequency-table-separator-char
-
TEXT:{comma,tab,space,semicolon}=comma Needs: --frequency-table-path
Separator char between fields of the frequency table input. --frequency-table-missing-value
-
TEXT Needs: --frequency-table-path
Marker for denoting missing values in the table. By default, we use.
,nan
, andna
. --frequency-table-depth-factor
-
FLOAT:POSITIVE=0 Needs: --frequency-table-path
For frequency table input that only contains allele frequencies, without any information on read depth, we need to transform those frequencies into counts for our internal processing. This number is multiplied by the frequency to obtain these pseudo-counts. By default, we use 1000000, to get a reasonable interger approximation of the floating point frequency. This is of course above any typical read depth, but allows for more accurate counts when using for instance haplotype-corrected frequencies such as those from HAF-pipe. --frequency-table-freq-is-ref
-
FLAG Needs: --frequency-table-path
For frequency table input that contains allele frequencies, we need to decide whether those frequencies represent the reference or the alternative allele. By default, we assume the latter, i.e., values are interpreted as alternative allele frequencies. Use this flag to instead interpret them as reference allele frequencies. --frequency-table-chr-column
-
TEXT Needs: --frequency-table-path
Specify the name of the chromosome column in the header, case sensitive. By default, we look for columns named "chromosome", "chrom", "chr", or "contig", case insensitive. --frequency-table-pos-column
-
TEXT Needs: --frequency-table-path
Specify the name of the position column in the header, case sensitive. By default, we look for columns named "position" or "pos", case insensitive. --frequency-table-ref-base-column
-
TEXT Needs: --frequency-table-path
Specify the name of the reference base column in the header, case sensitive. By default, we look for columns named "reference", "referencebase", "ref", or "refbase", case insensitive, and ignoring any extra punctuation marks. --frequency-table-alt-base-column
-
TEXT Needs: --frequency-table-path
Specify the name of the alternative base column in the header, case sensitive. By default, we look for columns named "alternative", "alternativebase", "alt", or "altbase", case insensitive, and ignoring any extra punctuation marks. --frequency-table-sample-ref-count-column
-
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample reference count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "reference", "referencebase", "ref", or "refbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.ref_cnt" indicates the reference count column for sample "S1". --frequency-table-sample-alt-count-column
-
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample alternative count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "alternative", "alternativebase", "alt", or "altbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.alt_cnt" indicates the alternative count column for sample "S1". --frequency-table-sample-freq-column
-
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample frequency columns in the header, case sensitive. By default, we look for column names having "frequency", "freq", "maf", "af", or "allelefrequency", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.freq" indicates the frequency column for sample "S1". Note that when the input data contains frequencies, but no reference or alternative base columns, such as HAF-pipe output tables, we cannot know the bases, and will hence guess. To properly set the reference bases, consider providing the--reference-genome-fasta
option. --frequency-table-sample-depth-column
-
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample read depth columns in the header, case sensitive. By default, we look for column names having "readdepth", "depth", "coverage", "cov", or "ad", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.read-depth" indicates the read depth column for sample "S1".
--multi-file-locus-set
-
TEXT:{union,intersection}=union
When multiple input files are provided, select whether the union of all their loci is used (outer join), or their intersection (inner join). For their union, input files that do not have data at a particular locus are considered as missing at that locus. Note that we allow to use multiple input files even with different file types. --make-gapless
-
FLAG
By default, we only operate on the positions for which there is data. In particular, positions that are absent in the input are completely ignored; they do not even show up in themissing
column of output tables. This is because for the statistics, data being absend or (marked as) missing is merely a sementic distinction, but it does not change the results. However, it might make processing with downstream tools easier if the output contains all positions, for instance when usingsingle
windows. With this option, all absent positions are filled in as missing data, so that they show up in themissing
column and as entries in single windows. If a referene genome or dictionary is given, this might also include positions beyond where there is input data, up until the length of each chromosome. Note that this can lead to large ouput tables when processing single positions. --reference-genome-fasta
-
TEXT:FILE Excludes: --reference-genome-dict --reference-genome-fai
Provide a reference genome in.fasta[.gz]
format. This allows to correctly assign the reference bases in file formats that do not store them, and serves as an integrity check in those that do. It further is used as a sequence dictionary to determine the chromosome order and length, on behalf of a dict or fai file. --reference-genome-dict
-
TEXT:FILE Excludes: --reference-genome-fasta --reference-genome-fai
Provide a reference genome sequence dictionary in.dict
format. It is used to determine the chromosome order and length, without having to provide the full reference genome. --reference-genome-fai
-
TEXT:FILE Excludes: --reference-genome-fasta --reference-genome-dict
Provide a reference genome sequence dictionary in.fai
format. It is used to determine the chromosome order and length, without having to provide the full reference genome.
--rename-samples-list
-
TEXT:FILE
Allows to rename samples, by providing a file that lists the old and new sample names, one per line, separating old and new names by a tab.
By default, we use sample names as provided in the input files. Some file types however do not contain sample names, such as (m)pileup or sync files (unless the non-standard sync header line is provided). For such file types, sample names are automatically assigned by using their input file base name (without path and extension), followed by a dot and numbers 1..n for all samples in that file. For instance, samples in/path/to/sample.sync
are namedsample.1
,sample.2
, etc.
Using this option, those names can be renamed as needed. Use verbose output (--verbose
) to show a list of all sample names. We then use these names in the output as well as in the--filter-samples-include
and--filter-samples-exclude
options. --filter-samples-include
-
TEXT Excludes: --filter-samples-exclude
Sample names to include (all other samples are excluded); either (1) a comma- or tab-separated list given on the command line (in a typical shell, this list has to be enclosed in quotation marks), or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option is applied after potentially renaming the samples with--rename-samples-list
. --filter-samples-exclude
-
TEXT Excludes: --filter-samples-include
Sample names to exclude (all other samples are included); either (1) a comma- or tab-separated list given on the command line (in a typical shell, this list has to be enclosed in quotation marks), or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option is applied after potentially renaming the samples with--rename-samples-list
. --sample-group-merge-table
-
TEXT:FILE
When the input contains multiple samples (either within a single input file, or by providing multiple input files), these can be merged into new samples, by summing up their nucleotide base counts at each position. This has essentially the same effect as having merged the raw fastq files or the mapped sam/bam files of the samples, that is, all reads from those samples are treated as if they were a single sample. For this grouping, the option takes a simple table file (comma- or tab-separated), with the sample names (after the above renaming, if provided) in the first column, and their assigned group names in the second column. All samples in the same group are then merged into a grouped sample, and the group names are used as the new sample names for the output. Note that the--pool-sizes
option then need to contain the summed up pool sizes for each group, using the group names.
--filter-region
-
TEXT=[] ...
Genomic region to filter for, in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", or "chr:start..end". Positions are 1-based and inclusive (closed intervals). The filter keeps all listed positions, and removes all that are not listed. Multiple region options can be provided, see also--filter-region-set
. --filter-region-list
-
TEXT:FILE=[] ...
Genomic regions to filter for, as a file with one region per line, either in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", "chr:start..end", or tab- or space-delimited "chr position" or "chr start end". Positions are 1-based and inclusive (closed intervals). The filter keeps all listed positions, and removes all that are not listed. Multiple region options can be provided, see also--filter-region-set
. --filter-region-bed
-
TEXT:FILE=[] ...
Genomic regions to filter for, as a BED file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open[)
interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. The filter keeps all listed positions, and removes all that are not listed. --filter-region-gff
-
TEXT:FILE=[] ...
Genomic regions to filter for, as a GFF2/GFF3/GTF file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed. --filter-region-map-bim
-
TEXT:FILE=[] ...
Genomic positions to filter for, as a MAP or BIM file as used in PLINK. This only uses the chromosome and coordinate per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed. --filter-region-vcf
-
TEXT:FILE=[] ...
Genomic positions to filter for, as a VCF/BCF file (such as a known-variants file). This only uses the chromosome and position per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed. --filter-region-fasta
-
TEXT:FILE=[] ...
Genomic positions to filter for, as a FASTA-like mask file (such as used by vcftools). The file contains a sequence of integer digits[0-9]
, one for each position on the chromosomes, which specify if the position should be filtered out or not. Any positions with digits above the--filter-region-fasta-min
value are removed. Note that this conceptually differs from a mask file, and merely uses the same format. --filter-region-fasta-min
-
UINT:INT in [0 - 9]=0 Needs: --filter-region-fasta
When using--filter-region-mask-fasta
, set the cutoff threshold for the filtered digits. Only positions with that value or lower will be kept. The default is 0, meaning that all positions with digits greater than 0 will be removed. --filter-region-fasta-invert
-
:INT in [0 - 9] Needs: --filter-region-fasta
When using--filter-region-mask-fasta
, invert the mask. This option has the same effect as the equivalent in vcftools, but instead of specifying the file, this here is a flag. When it is set, the mask specified above is inverted. --filter-region-set
-
TEXT:{union,intersection}=union
It is possible to provide multiple of the above region filter options, even of different types. In that case, decide on how to combine the loci of these filters.
--filter-mask-samples-bed-list
-
TEXT:FILE Excludes: --filter-mask-samples-fasta-list
For each sample, genomic positions to mask (mark as missing), as a set of BED files.
See the below--filter-mask-total-bed
for details. Here, individual BED files can be provided for each sample, for fine-grained control over the masking. The option takes a path to a file that contains a comma- or tab-separated list of sample names and BED file paths, with one name/path pair per line, in any order of lines. --filter-mask-samples-bed-invert
-
FLAG Needs: --filter-mask-samples-bed-list
When using--filter-mask-samples-bed-list
, set this flag to invert the specified mask. Needs one of--reference-genome-fasta
,--reference-genome-dict
,--reference-genome-fai
to determine chromosome lengths. --filter-mask-samples-fasta-list
-
TEXT:FILE Excludes: --filter-mask-samples-bed-list
For each sample, genomic positions to mask, as a FASTA-like mask file.
See the below--filter-mask-total-fasta
for details. Here, individual FASTA files can be provided for each sample, for fine-grained control over the masking. The option takes a path to a file that contains a comma- or tab-separated list of sample names and FASTA file paths, with one name/path pair per line, in any order of lines. --filter-mask-samples-fasta-min
-
UINT:INT in [0 - 9]=0 Needs: --filter-mask-samples-fasta-list
When using--filter-mask-samples-fasta-list
, set the cutoff threshold for the masked digits. All positions above that value are masked. The default is 0, meaning that only exactly the positons with value 0 will not be masked. --filter-mask-samples-fasta-invert
-
FLAG Needs: --filter-mask-samples-fasta-list
When using--filter-mask-samples-fasta-list
, invert the mask. When this flag is set, the mask specified above is inverted. --filter-mask-total-bed
-
TEXT:FILE Excludes: --filter-mask-total-fasta
Genomic positions to mask (mark as missing), as a BED file.
The regions listed in the BED file are masked; this is in line with, e.g., smcpp, but is the inverse of the above usage of a BED file for selection regions, where instead the listed regions are kept. Note that this also conceptually differs from the region BED above. We here do not remove the masked positions, but instead just mark them as masked, so that they can still contribute to, e.g., denominators in the statistics for certain settings.
This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open[)
interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. --filter-mask-total-bed-invert
-
FLAG Needs: --filter-mask-total-bed
When using--filter-mask-total-bed
, set this flag to invert the specified mask. Needs one of--reference-genome-fasta
,--reference-genome-dict
,--reference-genome-fai
to determine chromosome lengths. --filter-mask-total-fasta
-
TEXT:FILE Excludes: --filter-mask-total-bed
Genomic positions to mask, as a FASTA-like mask file (such as used by vcftools).
The file contains a sequence of integer digits[0-9]
, one for each position on the chromosomes, which specify if the position should be masked or not. Any positions with digits above the--filter-mask-total-fasta-min
value are tagged as being masked. Note that this conceptually differs from the region fasta above. We here do not remove the the masked positions, but instead just mark them as masked, so that they can still contribute to, e.g., denominators in the statistics for certain settings. --filter-mask-total-fasta-min
-
UINT:INT in [0 - 9]=0 Needs: --filter-mask-total-fasta
When using--filter-mask-total-fasta
, set the cutoff threshold for the masked digits. All positions above that value are masked. The default is 0, meaning that only exactly the positons with value 0 will not be masked. --filter-mask-total-fasta-invert
-
FLAG Needs: --filter-mask-total-fasta
When using--filter-mask-total-fasta
, invert the mask. This option has the same effect as the equivalent in vcftools, but instead of specifying the file, this here is a flag. When it is set, the mask specified above is inverted.
--filter-sample-min-count
-
UINT:POSITIVE=0
Required. Minimum base count for a nucleotide (inACGT
) to be considered as an allele. Counts below that are set to zero, and hence ignored as an allele/variant. For example, singleton read sequencing errors can be filtered out this way. --filter-sample-max-count
-
UINT=0
Maximum base count for a nucleotide (inACGT
) to be considered as an allele. Counts above that are set to zero, and hence ignored as an allele/variant. For example, spuriously high read counts can be filtered out this way. --filter-sample-deletions-limit
-
FLAG
Maximum number of deletions at a position before being filtered out. If this is set to a value greater than 0, and the number of deletions at the position is equal to or greater than this value, the sample is filtered out. --filter-sample-min-read-depth
-
UINT=0
Minimum read depth expected for a position in a sample to be considered covered. If the sum of nucleotide counts (inACGT
) at a given position in a sample is less than the provided value, the sample is ignored at this position. --filter-sample-max-read-depth
-
UINT=0
Maximum read depth expected for a position in a sample to be considered covered. If the sum of nucleotide counts (inACGT
) at a given position in a sample is greater than the provided value, the sample is ignored at this position. This can for example be used to filter spuriously high read depth positions. --filter-total-min-read-depth
-
UINT=0
Minimum read depth expected for a position in total to be considered covered. If the sum of nucleotide counts (inACGT
) at a given position in total (across all samples) is less than the provided value, the position is ignored. --filter-total-max-read-depth
-
UINT=0
Maximum read depth expected for a position in total to be considered covered. If the sum of nucleotide counts (inACGT
) at a given position in total (across all samples) is greater than the provided value, the position is ignored. This can for example be used to filter spuriously high read depth positions. --filter-total-deletions-limit
-
FLAG
Maximum number of deletions at a position before being filtered out, summed across all samples. If this is set to a value greater than 0, and the number of deletions at the position is equal to or greater than this value, the position is filtered out. --filter-total-only-biallelic-snps
-
FLAG
Filter out any positions that do not have exactly two alleles across all samples. That is, after applying all previous filters, if not exactly two counts (inACGT
) are non-zero in total across all samples, the position is not considered a biallelic SNP, and ignored. --filter-total-snp-min-count
-
UINT=0
When filtering for positions that are SNPs, use this minimum count (summed across all samples) to identify what is considered a SNP. Positions where the counts are below this are filtered out. --filter-total-snp-max-count
-
UINT=0
When filtering for positions that are SNPs, use this maximum count (summed across all samples) to identify what is considered a SNP. Positions where the counts are above this are filtered out; probably not relevant in practice, but offered for completeness. --filter-total-snp-min-frequency
-
FLOAT=0
Minimum allele frequency that needs to be reached for a position to be used. Positions where the allele frequencyaf
across all samples, or1 - af
, is below this value, are ignored. If both the reference and alternative base are known, allele frequencies are computed based on those; if only the reference base is known, the most frequent non-reference base is used as the alternative; if neither is known, the first and second most frequent bases are used to compute the frequency.
--subsample-max-read-depth
-
UINT=0
If provided, the nucleotide counts of each sample are subsampled so that they do not exceed this given maximum total read depth (sum of the four nucleotide countsACGT
, as well as the anyN
and deletedD
counts). If they are below this value anyway, they are not changed. This transformation is useful to limit the maximum read depth. For instance, the diversity estimators for Theta Pi and Theta Watterson have terms that depend on read depth. In particular when merging samples such as with--sample-group-merge-table
, having an upper limit can hence avoid long compute times. Furthermore, a very low Tajima's D, usually indicative of a selective sweep, may be found as an artifact in highly covered regions, as such regions have just more sequencing errors. To avoid these kinds of biases we recommend to subsample to an uniform read depth. This transformation is applied after the numerical filters, so that, e.g., filters for high read depth are able to remove any unwanted positions first. See--subsample-method
for the subsampling method. --subsample-method
-
TEXT:{subscale,subsample-with-replacement,subsample-without-replacement}=subscale Needs: --subsample-max-read-depth
When using--subsample-max-read-depth
, decide which method to use. The defaultsubscale
simply re-scales the base counts to the given max read depth, and hence maintains the allele frequencies (within integer precision). We recommend to use this to subsample to, e.g., a max read depth of 10,000, which is a good compromise in most cases.The two alternative options re-sample instead, with and without replacement, by drawing from a multinomial or multivariate hypergeometric distribution, respectively, based on the original counts of the sample.
--window-type
-
TEXT:{interval,queue,single,regions,chromosomes,genome}=interval
Required. Type of window to use. Depending on the type, additional options might need to be provided.
(1)interval
: Typical sliding window over intervals of fixed length (in bases) along the genome.
(2)queue
: Sliding window, but instead of using a fixed length of bases along the genome, it uses a fixed number of positions of the input data. Typically used for windowing over variant positions such as (biallelic) SNPs, and useful for example when SNPs are sparse in the genome.
(3)single
: Treat each position of the input data as an individual window of size 1. This is typically used to process single SNPs, and equivalent tointerval
orqueue
with a width/count of 1, except that positions that are removed by some filter are skipped.
(4)regions
: Windows corresponding to some regions of interest, such as genes. The regions need to be provided, and can be overlapping or nested as needed.
(5)chromosomes
: Each window covers a whole chromosome.
(6)genome
: The whole genome is treated as a single window. --window-interval-width
-
UINT=0
Required when using--window-type interval
: Width of each window along the chromosome, in bases. --window-interval-stride
-
UINT=0
When using--window-type interval
: Stride between windows along the chromosome, that is how far to move to get to the next window. If set to 0 (default), this is set to the same value as the--window-interval-width
. --window-queue-count
-
UINT=0
Required when using--window-type queue
: Number of positions in the genome in each window. This is most commonly used when also filtering for variant positions such as (biallelic) SNPs (which most commands do implicitly), so that each window of the analysis conists of a fixed number of SNPs, instead of a fixed length along the genome. --window-queue-stride
-
UINT=0
When using--window-type queue
: Stride of positions between windows along the chromosome, that is how many positions does the window move forward each time. If set to 0 (default), this is set to the same value as the--window-queue-count
, meaning that each new window consists of new positions. --window-region
-
TEXT=[] ...
When using--window-type regions
: Genomic region to process as windows, in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", or "chr:start..end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided to add region windows to be processed. --window-region-list
-
TEXT:FILE=[] ...
When using--window-type regions
: Genomic regions to process as windows, as a file with one region per line, either in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", "chr:start..end", or tab- or space-delimited "chr position" or "chr start end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided to add region windows to be processed. --window-region-bed
-
TEXT:FILE=[] ...
When using--window-type regions
: Genomic regions to process as windows, as a BED file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open[)
interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. Multiple region options can be provided to add region windows to be processed. --window-region-gff
-
TEXT:FILE=[] ...
When using--window-type regions
: Genomic regions to process as windows, as a GFF2/GFF3/GTF file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Multiple region options can be provided to add region windows to be processed. --window-region-skip-empty
-
FLAG
When using--window-type regions
: In cases where there is no data in the input files for a region window, by default, we produce some "empty" or NaN output. With this option however, regions without data are skipped in the output.
--window-average-policy
-
TEXT:{window-length,available-loci,valid-loci,valid-snps,sum,provided-loci}
Required. Denominator to use when computing the average of a metric in a window:
(1)window-length
: Simply use the window length, which likely underestimates the metric, in particular in regions with low coverage and high missing data.
(2)available-loci
: Use the number of positions for which there was data at all (that is, absent or missing data is excluded), independent of all other filter settings.
(3)valid-loci
: Use the number of positions that passed all quality and numerical filters (that is, excluding the SNP-related filters). This uses all positions of high quality, and is the recommended policy when the input contains data for all positions.
(4)valid-snps
: Use the number of SNPs only. This might overestimate the metric, but can be useful when the data only consists of SNPs.
(5)sum
: Simply report the sum of the per-site values, with no averaging applied to it. This can be used to apply custom averaging later.
(6)provided-loci
: Use the exact loci provided via--window-average-loci-bed
or--window-average-loci-fasta
to determine the denominator for the window averaging, by counting all positions set in this mask in the given window. --window-average-loci-bed
-
TEXT:FILE Needs: --window-average-policy Excludes: --window-average-loci-fasta
Genomic positions to use for--window-average-policy provided-loci
, as a BED file. The regions listed in the BED file are counted towards the window average denominator.
This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open[)
interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. --window-average-loci-bed-invert
-
FLAG Needs: --window-average-loci-bed
When using--window-average-loci-bed
, invert the set of loci. When this flag is set, the loci that are not set are used for the denominator. Needs one of--reference-genome-fasta
,--reference-genome-dict
,--reference-genome-fai
to determine chromosome lengths. --window-average-loci-fasta
-
TEXT:FILE Needs: --window-average-policy Excludes: --window-average-loci-bed
Genomic positions to use for--window-average-policy provided-loci
, as a FASTA-like mask file (such as used by vcftools).
The file contains a sequence of integer digits[0-9]
, one for each position on the chromosomes, which specify if the position should be counted towards the window denominator. Any positions with digits above the--window-average-loci-fasta-min
value are used. --window-average-loci-fasta-min
-
UINT:INT in [0 - 9]=0 Needs: --window-average-loci-fasta
When using--window-average-loci-fasta
, set the cutoff threshold for the counted digits. All positions above that value are counted. The default is 0, meaning that only exactly the positons with value 0 will not be counted. --window-average-loci-fasta-invert
-
FLAG Needs: --window-average-loci-fasta
When using--window-average-loci-fasta
, invert the set of loci. When it is set, all positions in the FASTA-like file below or equal to the threshold are counted towards the window average denominator.
--pool-sizes
-
TEXT
Required. Pool sizes for all samples that are used (not filtered out). These are the number of haploids, so 100 diploid individuals correspond to a pool size of 200. Either
(1) a single pool size that is used for all samples, specified on the command line, or
(2) a path to a file that contains a comma- or tab-separated list of sample names and pool sizes, with one name/size pair per line, in any order of lines. --tajima-d-denominator-policy
-
TEXT:{empirical-min-read-depth,provided-min-read-depth,popoolation-bugs,pool-size,uncorrected}=empirical-min-read-depth
Estimator for the expected number of distinct individuals sequenced in the denominator of the Achaz (2008) correction of Tajima's D, following its adaptation by PoPoolation. With pool seq data, there is no simple way to obtain a statistic that is numerically comparable to the classic Tajima's D with individual data. Hence, all of the below are simplicications that introduce some bias.
(1)empirical-min-read-depth
: Use the lowest empirical read depth found in each window, and the pool size, to compute the expected number of individuals sequenced. This is a conservative estimator that we recommend by default.
(2)provided-min-read-depth
: Same as (1), but use the user-provided--filter-sample-min-read-depth
instead of the empirical minum read depth. This is what PoPoolation uses.
(3)popoolation-bugs
: Same as (2), but additionally re-introduce their bugs. We offer this for comparability with PoPoolation.
(4)pool-size
: Directly use the pool size as an estimate of the number of individuals, instead of computing the expected value. This assumes the number of individuals sequenced to be equal to the pool size, and is good under high read depths.
(5)uncorrected
: The Achaz correction is not applied, so that the result is simply Theta Pi minus Theta Watterson. Hence, magnitudes of values are not comparable to classic Tajima's D. Still, using their sign, and comparing them across windows can be useful. --no-theta-pi
-
FLAG
Do not compute or output Theta Pi. Note that if Tajima's D is computed, we also need to compute the two thetas, but will then not print them in the output. --no-theta-watterson
-
FLAG
Do not compute or output Theta Watterson. Note that if Tajima's D is computed, we also need to compute the two thetas, but will then not print them in the output. --no-tajima-d
-
FLAG
Do not compute Tajmias' D. --no-extra-columns
-
FLAG
Do not output the extra columns containing counts for each position and sample pair that summarize the effects of the filtering. Only the window coordinates and the fst values are printed in that case.
--separator-char
-
TEXT:{comma,tab,space,semicolon}=comma Excludes: --popoolation-format
Separator char between fields of output tabular data. --na-entry
-
TEXT=nan Excludes: --popoolation-format
Set the text to use in the output for n/a and NaN entries (e.g., resulting from positions with no counts, or windows with no variants). This is useful to match formatting expectations of downstream software. --popoolation-format
-
FLAG Excludes: --separator-char --na-entry
If set, instead of writing one output table for all measures and all samples, write the results in separate files for each sample and for each measure of Theta Pi, Theta Watterson, and Tajima's D, following the format of PoPoolation.
--out-dir
-
TEXT=.
Directory to write files to --file-prefix
-
TEXT
File prefix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. --file-suffix
-
TEXT
File suffix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. --compress
-
FLAG
If set, compress the output files using gzip. Output file extensions are automatically extended by.gz
.
--allow-file-overwriting
-
FLAG
Allow to overwrite existing output files instead of aborting the command. By default, we abort if any output file already exists, to avoid overwriting by mistake. --verbose
-
FLAG
Produce more verbose output. --threads
-
UINT
Number of threads to use for calculations. If not set, we guess a reasonable number of threads, by looking at the environmental variables (1)OMP_NUM_THREADS
(OpenMP) and (2)SLURM_CPUS_PER_TASK
(slurm), as well as (3) the hardware concurrency (number of CPU cores), taking hyperthreads into account, in the given order of precedence. --log-file
-
TEXT
Write all output to a log file, in addition to standard output to the terminal.
When using this method, please do not forget to cite
Lucas Czech, Jeffrey Spence, Moises Exposito-Alonso. grenedalf: population genetic statistics for the next generation of pool sequencing. Bioinformatics, vol. 40, no. 8, 2024. doi:10.1093/bioinformatics/btae508
Robert Kofler, Pablo Orozco-terWengel, Nicola De Maio, Ram Vinay Pandey, Viola Nolte, Andreas Futschik, Carolin Kosiol, Christian Schlötterer. PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals. PLoS ONE, vol. 6, no. 1, 2011. doi:10.1371/journal.pone.0015925