xerxes
is a command line software tool for population genetic analyses of Poseidon packages. It is written in Haskell and openly available on GitHub.
To download the latest stable release version of xerxes
click here:
📥 Linux |
📥 macOS (ARM64) |
📥 macOS (X64) |
📥 Windows
So in Linux you can run the following commands to get started:
# download the current stable release binary
wget https://github.com/poseidon-framework/poseidon-analysis-hs/releases/latest/download/xerxes-Linux
# make it executable
chmod +x xerxes-Linux
# run it
./xerxes-Linux -h
On GitHub you will also find older release versions and instructions to build xerxes from source. The relevant changes from one version to the next are documented in this changelog.
With xerxes --help
and xerxes <subcommand> --help
you can get information about each subcommand and parameter directly on the command line. The guide below explains the subcommands in more detail. It is available in .pdf format for the current and previous versions here:
- 🗎 Guide for xerxes v1.0.1.0 to v1.0.1.1 (shown below)
- 🗎 Guide for xerxes v1.0.0.2
- 🗎 Guide for xerxes v0.2.0.0
See the Poseidon website (https://www.poseidon-adna.org/#/xerxes) or the GitHub repository (https://github.com/poseidon-framework/poseidon-analysis-hs) for up-to-date installation instructions.
Xerxes allows you to analyse genotype data across Poseidon packages, including your own, by pre-loading sets of packages via a --baseDir
(or -d
) parameter. From the pre-loaded packages it selects the ones relevant for the statistics you requested, and then streams through only these. It thus computes arbitrary F-Statistics across groups and individuals distributed in many packages, without the need to explicitly merge data first.
In this process xerxes
takes care of merging PLINK and EIGENSTRAT data on the fly and works across different SNP sets, like Human-Origins and 1240k. It flips alleles automatically across genotype files, and throws an error if the alleles in different packages are incongruent with each other.
Here is an example command for computing several F-Statistics:
xerxes fstats -d ... -d ... \
--stat "F4(<Chimp.REF>, <Altai_published.DG>, Yoruba, French)" \
--stat "F3(<Chimp.REF>, <Altai_snpAD.DG>, Spanish)" \
--statFile fstats.txt
--statConfig fstats.yaml
-f outputfile.txt
First, the two options -d ...
exemplify that you need to provide at least one base directory with source Poseidon packages, but can also give multiple. Second, F-Statistics can be entered in three different ways:
- Directly via the command line using
--stat
. - With a simple text file using
--statFile
. - With a powerful configuration file format via
--statConfig
.
These three modes of input can be mixed and matched, and even given multiple times. They are explained below.
Last, the option -f
can be used to write the output table into a tab-separated text file, beyond just printing a table into the standard out when the program finishes. Note that there are more options, which you can view using xerxes fstats --help
:
Usage: xerxes fstats (-d|--baseDir DIR) [-j|--jackknife ARG]
[-e|--excludeChroms ARG]
(--stat ARG | --statConfig ARG | --statFile ARG)
[--noTransitions] [-f|--tableOutFile ARG]
[--blockTableFile ARG]
Compute f-statistics on groups and invidiuals within and across Poseidon
packages
Available options:
-h,--help Show this help text
-d,--baseDir DIR A base directory to search for Poseidon packages.
-j,--jackknife ARG Jackknife setting. If given an integer number, this
defines the block size in SNPs. Set to "CHR" if you
want jackknife blocks defined as entire chromosomes.
The default is at 5000 SNPs
-e,--excludeChroms ARG List of chromosome names to exclude chromosomes,
given as comma-separated list. Defaults to X, Y, MT,
chrX, chrY, chrMT, 23,24,90
--stat ARG Specify a summary statistic to be computed. Can be
given multiple times. Possible options are: F4(a, b,
c, d), F3(a, b, c), F3star(a, b, c), F2(a, b), PWM(a,
b), FST(a, b), Het(a) and some more special options
described at
https://poseidon-framework.github.io/#/xerxes?id=fstats-command.
Valid entities used in the statistics are group names
as specified in the *.fam, *.ind or *.janno failes,
individual names using the syntax "<Ind_name>", so
enclosing them in angular brackets, and entire
packages like "*Package1*" using the Poseidon package
title. You can mix entity types, like in
"F4(<Ind1>,Group2,*Pac*,<Ind4>)". Group or individual
names are separated by commas, and a comma can be
followed by any number of spaces.
--statConfig ARG Specify a yaml file for the Fstatistics and group
configurations
--statFile ARG Specify a file with F-Statistics specified similarly
as specified for option --stat. One line per
statistics, and no new-line at the end
--maxSnps ARG Stop after a maximum nr of snps has been processed.
Useful for short test runs
--noTransitions Skip transition SNPs and use only transversions
-f,--tableOutFile ARG a file to which results are written as tab-separated
file
--blockTableFile ARG a file to which the per-Block results are written as
tab-separated file
The following statistics are allowed in the --stat
, --statFile
and --statConfig
options. In all of the following, the symbols a
, b
, c
or d
stand for arbitrary entities allowed in Poseidon, so groups (such as French
), individuals (such as <MA1.SG>
) or packages (such as *2012_PattersonGenetics*
).
F2vanilla(a,b)
: F2-Statistics - Vanilla version. Computed usingF2vanilla(a,b) = (a-b)^2
across the genome.F2(a,b)
: F2-Statistics (bias-corrected version). Computed asF2(a, b) = F2vanilla(a,b) - hA/sA - hB/sB
, wheresA
is the number of non-missing alleles in entitya
, andhA = nA * nA' / sA * (sA - 1)
is an estimator of half the heterozygosity (seeHet(a)
), and likewise forsB
andnB
etc.F3vanilla(a,b,c)
: F3-Statistics - Vanilla version, recommended if used as Outgroup-F3 statistics or with groupc
being pseudo-haploid: Are computed asF3(a,b,c) = (c-a)(c-b)
across all SNPs.F3(a,b,c)
: F3-Statistics (bias-corrected version). Computed asF3(a,b,c) = F3vanilla(a,b) - hC/sC
.F3star(a,b,c)
: F3-Statistics as defined in @Patterson2012 - normalised and bias-corrected version, recommended for Admixture-F3 tests. Are computed by i) first subtracting per SNP from the vanilla-F3-Statistic a bias-correction termhC/sC
, as above for F2, and ii) then normalising the genome-wide estimate by a genome-wide estimate of the heterozygosity of entityc
(Het(c)
), in order to make results comparable between different groupsc
.F4(a,b,c,d)
: F4 statistics. Are computed by averaging the quantity(a-b)(c-d)
across all SNPs. No bias correction is necessary for this statistic.Het(a)
: An estimate of the heterozygosity across all SNPs, computed as2*hA
, withhA
defined as above forF2
.FST(a,b)
: An estimate of FST across the genome, following the estimator presented in @Bhatia2013 and implemented in the ADMIXTOOLS package. This amounts to a ratio of genome-wide ranges, where the numerator is an unbiased estimate ofF2
(see above), and the denominator isPWM(a,b)
(see below).FSTvanilla(a,b)
: Similar toFST(a,b)
but without the bias correction in the numerator, mainly useful for teaching and learning.PWM(a,b)
: The pairwise mismatch rate between entitiesa
andb
, computed from allele frequencies asa (1 - b) + (1 - a) b
.
Most of these equations can also be found in @Patterson2012. See also Appendix A of this paper for the unbiased estimators used above.
For each of the "slots" A, B, C or D, you can enter:
- Individuals, using the syntax
<Individual_Name>
- Groups, using no special syntax
Group_Name
- Packages, using syntax
*Package_Name*
(This can be useful, for example, if you happen to have a homogenous set of individuals from multiple groups in one package and want to consider all of these as one group.)
This is the simplest option to instruct the program to compute a specified statistic. Each statistic requires a separate input using --stat
using this input method. Example:
xerxes fstats -d ... -d ... --stat "F3(French, Spanish, <Chimp.REF>) --stat "FST(French, Spanish)"
You can prepare a text file, e.g. fstats.txt
, into which you write the above statistics, one statistic per line. Example:
F4(<Chimp.REF>, <Altai_published.DG>, Yoruba, French)
F4(<Chimp.REF>, <Altai_snpAD.DG>, Spanish, French)
F4(Mbuti,Nganasan,Saami.DG,Finnish)
You can then load these statistics using the option --statFile fstats.txt
.
This is the most powerful way to input F-Statistics. Example:
groupDefs:
CEU2: ["CEU.SG", "-<NA12889.SG>", "-<NA12890.SG>"]
FIN2: ["FIN.SG", "-<HG00383.SG>", "-<HG00384.SG>"]
GBR2: ["GBR.SG", "-<HG01791.SG>", "-<HG02215.SG>"]
IBS2: ["IBS.SG", "-<HG02238.SG>", "-<HG02239.SG>"]
fstats:
- type: F2
a: ["French", "Spanish"]
b: ["Han", "CEU2"]
# Ascertainment is optional
- type: F3 # This will create 3x2x1 = 6 Statistics
a: ["French", "Spanish", "Mbuti"]
b: ["Han", "CEU2"]
c: ["<Chimp.REF>"]
ascertainment:
outgroup: "<Chimp.REF>" # ascertaining on outgroup-polarised derived allele frequency
reference: "CEU2"
lower: 0.05
upper: 0.95
- type: F4 # This will create 5x5x4x1 = 100 Statistics
a: ["<I0156.SG>", "<I0157.SG>", "<I0159.SG>", "<I0160.SG>", "<I0161.SG>"]
b: ["<I0156.SG>", "<I0157.SG>", "<I0159.SG>", "<I0160.SG>", "<I0161.SG>"]
c: ["CEU2", "FIN2", "GBR2", "IBS2"]
d: ["<Chimp.REF>"]
ascertainment:
# A missing outgroup means: ascertain on minor allele frequency
reference: "CEU.SG"
lower: 0.00
upper: 0.10
You can save this into a text file, for example named fstats_config.yaml
, and load it via --statConfig fstats_config.yaml
.
The top level structure of this YAML file is an object with two fields: groupDefs
(which is optional) and fstats
(which is mandatory).
You can specify ad-hoc group definitions using the syntax above. Every group consists of a name (used as object key) and then a JSON- or YAML-list of signed entities, following the same syntax as trident forge
. Briefly: Individuals, Groups and Packages can be added or excluded (prefixed by a -
) in order. In the example above, two individuals are removed from each group.
Note that currently, groups can be defined only independently, so not incremental to each other. That means, you cannot currently use an already defined new group name in the entity list of a following group name.
Each statistic defined in the fstats
section of the YAML file, actually defines a loop over multiple populations in each statistic. In the example above, there are 6 F3-Statistics, each using a different combination of the input groups defined in each of the a:
, b:
and c:
slots. There are also 100 (!) F4 statistics, following all combinations of 5x5x4x1 slots defined in a:
, b:
, c:
and d:
.
In addition, every statistic section allows for a definition of an ascertainment specification, using a special key ascertainment:
, which is optional. If given, you can specify an optional outgroup
, a reference
group in which to ascertain SNPs, and lower
and upper
allele frequency bounds. If specified, only SNPs for which the reference
group has an allele frequency within the given bounds are used to compute the statistic (note that normalisation is still using all non-missing SNPs for that given statistic). If an outgroup
is defined, then the outgroup-polarised derived allele frequency is used. If no outgroup
is defined, then the minor allele frequency is used instead. If an outgroup is defined, any sites where the outgroup is polymorphic are treated as missing.
The final output of the fstats
command looks like this:
.-----------.-----------.---------------.--------.-------.---------.
| Statistic | a | b | c | d | NrSites |
:===========:===========:===============:========:=======:=========:
| F3 | French | Italian_North | Mbuti | | 593124 |
| F3 | French | Han | Mbuti | | 593124 |
| F3 | Sardinian | Pima | French | | 593124 |
| F4 | French | Russian | Han | Mbuti | 593124 |
| F4 | Sardinian | French | Pima | Mbuti | 593124 |
'-----------'-----------'---------------'--------'-------'---------' ->
----------------.--------------------.------------------.---------------------.
Estimate_Total | Estimate_Jackknife | StdErr_Jackknife | Z_score_Jackknife |
================:====================:==================:=====================:
5.9698e-2 | 5.9698e-2 | 5.1423e-4 | 116.0908951980249 |
5.0233e-2 | 5.0233e-2 | 5.0324e-4 | 99.81843057232513 |
-1.2483e-3 | -1.2483e-3 | 9.2510e-5 | -13.493505348221081 |
-1.6778e-3 | -1.6778e-3 | 9.1419e-5 | -18.35262346091248 |
-1.4384e-3 | -1.4384e-3 | 1.1525e-4 | -12.481084899924868 |
----------------'--------------------'------------------'---------------------'
This output table lists each statistic, the slots a
, b
, c
and d
, the number of sites with non-missing data for that statistic, ascertainment information (outgroup, reference, lower and upper bound, if given), the genome-wide estimate, its standard error and its Z-score. If you specify an output file using option --tableOutFile
or -f
, these results are also written as a simple, tab-separated file.
Additionally, an option --blockOutFile
can be specified. This creates a file to which a table with estimates per Jackknife block is written.
Specific cases of statistics yield zero by construction:
F2(a,b)
,F2vanilla(a,b)
,FST(a,b)
andFSTvanilla(a,b)
wherea=b
.F3(a,b,c)
andF3vanilla(a,b,c)
wherec=a
orc=b
F4(a,b,c,d)
wherea=b
orc=d
Even though the bias-correction technically can result in non-zero and even negative values, we automatically detect these cases and output zero for them. This can be useful, for example, when looping over pairs of populations for a pairwise matrix of FST
, where we then want the diagonal to be zero to yield a proper distance matrix.
Genotype ploidy in input samples is important for many of the statistics, because the bias-correction terms require the number of chromosomes. Ploidy information is automatically read through the field of Genotype_Ploidy
in the .janno
file. A warning is printed if that information is missing, in which case we assume diploid genotypes. But often with low-coverage data from ancient DNA we create pseudo-haploid genotypes, so in that case it is important to provide that information correctly through the .janno
file.
In specific cases statistics are illegal with only a single haplotype. Specifically:
F2(a,b)
andFST(a,b)
is undefined if either one ofa
orb
contains only a single haplotype.F3(a,b,c)
is undefined ifc
contains only a single haplotype.Het(a)
unsurprisingly is undefined ifa
contains only a single haplotype.
These cases are detected and an error is thrown. For F2
, F3
and FST
you can use the "vanilla" versions of the statistics if that makes sense. This is particularly relevant for so-called "Outgroup-F3-Statistics", where we sometimes use a single haploid reference genome in position c
. Use F3vanilla
in that case.
The repository comes with a detailed whitepaper that describes some more mathematical details of the methods implemented here.
The RAS command computes pairwise RAS statistics between a collection of "left" entities, and a collection of "right" entities. Every entity is either a group name or an individual, with similar syntax as for the F-Statistics above, so French
is a group, and <IND001>
is an individual.
The input of left-pops and right-pops uses a YAML file via --popConfigFile
. Here is an example:
groupDefs:
group1: a,b,-c,-<d>
group2: e,f,-<g>
popLefts:
- <I13721>
- <I14000>
- <I13722>
- <Iceman.SG>
popRights:
- Mbuti
- Mixe
- Spanish
outgroup: <Chimp.REF>
In this case, two groups are defined on the fly: group1
comprises groups a
and b
, but excludes group c
and individual d
. Note that inclusions and exclusions are executed in order. group2
comprises of group e
and group f
, but excludes individual <g>
.
As in RAScalculator @Flegontov2019, the allele frequency ascertainment is done across right populations only.
The are a couple of options, as specified in the CLI help (xerxes ras --help
):
Usage: xerxes ras (-d|--baseDir DIR) [-j|--jackknife ARG]
[-e|--excludeChroms ARG] --popConfigFile ARG
(--minAC ARG | --minFreq ARG | --noMinFreq)
(--maxAC ARG | --maxFreq ARG | --noMaxFreq)
[-m|--maxMissingness ARG] [--blockTableFile ARG]
[--f4TableOutFile ARG] [--noTransitions] [--bedFile ARG]
Compute RAS statistics on groups and individuals within and across Poseidon
packages
Available options:
-h,--help Show this help text
-d,--baseDir DIR A base directory to search for Poseidon packages.
-j,--jackknife ARG Jackknife setting. If given an integer number, this
defines the block size in SNPs. Set to "CHR" if you
want jackknife blocks defined as entire chromosomes.
The default is at 5000 SNPs
-e,--excludeChroms ARG List of chromosome names to exclude chromosomes,
given as comma-separated list. Defaults to X, Y, MT,
chrX, chrY, chrMT, 23,24,90
--popConfigFile ARG a file containing the population configuration
--minAC ARG define a minimal allele-count cutoff for the RAS
statistics.
--minFreq ARG define a minimal allele-frequency cutoff for the RAS
statistics.
--noMinFreq switch off the minimum allele frequency filter
--maxAC ARG define a maximal allele-count cutoff for the RAS
statistics.
--maxFreq ARG define a maximal allele-frequency cutoff for the RAS
statistics.
--noMaxFreq switch off the maximum allele frequency filter. This
cam help mimic Outgroup-F3
-m,--maxMissingness ARG define a maximal missingness for the right
populations in the RAS statistics. (default: 0.1)
--blockTableFile ARG a file to which the per-Block results are written as
tab-separated file
--f4TableOutFile ARG a file to which F4 computations are written as
tab-separated file
--maxSnps ARG Stop after a maximum nr of snps has been processed.
Useful for short test runs
--noTransitions Skip transition SNPs and use only transversions
--bedFile ARG An optional bed file that gives sites to be included
in the analysis.
The output gives both cumulative (up to allele-count k) and per-allele-frequency RAS (for allele count k) for every pair of left and rights. The standard output contains a pretty-printed table. A tab-separated file is written to the file specified using the option -f
.
xerxes ras
makes a few important assumptions:
- It assumes that the Right Populations are "nearly" completely non-missing. Any allele that is actually missing from the rights is in fact treated as homozygous-reference! A different approach would be to compute the actual frequencies on the non-missing right alleles, but then we cannot any more nicely accumulate over different ascertainment allele counts.
- If no outgroup is specified, the ascertainment operates on minor-allele frequency (as in
fstats
) - If an outgroup is specified and missing from a SNP, or if the SNP is polymorphic, the SNP is skipped as missing