# _________. ______. __________________.__
# <\______ \\ \\____ _ \.___ _ | |__ ____. _____. .__.
# | _/| ._// ^ \| _/| ___| | \\__ \ / ____/ __ \
# | \| / | |> | \ | | Y \/ __ \_\___ \\ ___/
# |______ \____|__ |\___/ . \___/ . |___| (____ /____ >\___ >
# . \/ . . \/ . . . . . \/ . \/ \/ . \/ v1.3
# . . . . . . .
# Copyright (c) 2016, Lloyd T. Elliott and Yee Whye Teh. .
# . . . .
# All rights reserved. .
-
Impute genetic data using a nonhomogeneous and nonparametric HMM.
-
Genetic sequence data are well described by hidden Markov models (HMMs) in which latent states correspond to clusters of similar mutation patterns. Theory from statistical genetics suggests that these HMMs are nonhomogeneous (their transition probabilities vary along the chromosome) and have large support for self transitions. BNPPhase is a new nonparametric model of genetic sequence data, based on the hierarchical Dirichlet process, which supports these self transitions and nonhomogeneity. BNPPhase provides a parameterization of the genetic process that is more parsimonious than other more general nonparametric models which have previously been applied to population genetics. BNPPhase uses truncation-free MCMC inference derived through a new auxiliary sampling scheme for Bayesian nonparametric HMMs. The popular finite model fastPHASE can be seen as a parametric truncation of the BNPPhase model.
-
Detail about the BNPPhase model is given in ‘L.T. Elliott, Y.W. Teh. A nonparametric HMM for genetic imputation and coalescent inference. 2016. Electronic Journal of Statistics.’
-
BNPPhase is released under the BSD 2-clause license, which is provided in
LICENSE.md
. The BNPPhase executable is bundled with a Scala library and libraries from the Apache Commons. Scala is provided under the BSD 3-clause license (provided inclasses/scala/LICENSE.md
and the libraries from the Apache commons are provided under the Apache Software License 2.0 (provided inclasses/commons/LICENSE.md
. -
perl
andjava
are required in order to run BNPPhase. This software has been tested withperl
version 5.20 andjava
openjdk version 1.8, on a Linux system running Fedora 22. This software should support a wide variety of systems (including cygwin) and it should support many earlier or later versions of perl and java. If you encounter problems running BNPPhase please contact us using the information at the bottom of this manual and we would be happy to assist you.
-
The following command performs imputation using the BNPPhase model. Advanced usage (in which control is given over all aspects of the prior, and also over which statistics are collected from the MCMC samples) is provided in the
ADVANCED USAGE
section below../BNPPhase -input -output -impute [-iters ] [-burnin ] [-init ] [-thin ] [-random_restarts ] [-seed ]
-
Performs imputation on the phased genotype file IN-FILE and stores the result in the directory OUT-DIR (this directory is created if it does not already exist). Imputed values are found by averaging over collected MCMC iterations.
-impute
This flag specifies that BNPPhase should perform imputation on
missing genotypes.
-iters <INT>
Set the number of MCMC iterations to collect to INT
.
Default value is 60
.
-burnin <INT>
Set the number of MCMC iterations to perform before starting
collection of MCMC iterations to INT
. Default value is 200
.
-init <INT>
Set the number of MCMC iterations to do after the SMC
initialization in which the trajectories are fixed and the parameters are
resampled to INT
. These MCMC iterations occur before the burnin
iterations. Default value is 10
.
-thin <INT>
Set the number of MCMC iterations to perform and discard
between each collected iteration to INT. Default value is 1
. Note that the
total number of MCMC iterations performed in which the trajectories are
resampled is burnin + (thin + 1) * iters
.
-random_restarts <INT>
The default number of random restarts is 1
.
-seed <INT>
Set the seed for the random number generator to INT
. If
this flag is not provided, then a seed will be read from the random device
/dev/random
.
- The input file must be a matrix of binary
0/1
coded SNPs. Each column represents a SNP, and each row represents a chromosome of a subject. The input is assumed to be phased: so if a diploid chromosome is being considered, then each subject must occupy two rows of the input file, and each row must be a0/1
sequence representing the alleles for each of the two phased chromosomes for that subject. The input file must be space separated.
Unobserved or held out alleles must be represented by a question mark ?
.
An example file is provided as follows:
-
example/input.txt
0 0 0 1 0 0 0 0 1 0 0 0 0 1 ? 1 0 0 1 0 1 0 0 1 ? 1 0 ? 1 ? 1 0 ? 0 1 1 0 ? 0 1 1 0 1 0 1 1 0 1 0 1
In this example, 10
sequences are provided and are marked at 5 locations.
The first two sequences are fully observed, and the third sequence has the
5
-th SNP held out.
-
The command specified in the
USAGE
section above creates a directory calledOUT-DIR
with the following structure:OUT-DIR/cmd OUT-DIR/solution/000000
The file OUT-DIR/cmd
contains a copy of the command line arguments used to
specify the call to the BNPPhase software. The file OUT-DIR/solution/000000
contains calls for the alleles from the input file which are either held out or
unobserved. The columns and rows of OUT-DIR/solution/000000
match the
columns and rows of the input file. Alleles that are observed in the input file
are replaced with a -1
in the file OUT-DIR/solution/000000
.
-
The following command performs imputation on the example file provided with this software:
./BNPPhase -input example/input.txt -output example/output -impute
After running this command, the accuracy of BNPPhase's imputed solution can be computed:
./bin/accuracy example/validate.txt example/output/solution/000000
The bin/accuracy
command compares the calls of BNPPhase to a specification
of the true values of alleles held out from example/input.txt
(in
example/validate.txt
), and outputs the proportion of calls that are
correct. (If bin/accuracy
detects an error in the alignment or
specification of the files passed to it, then it outputs a ?
or a X
).
The format of the validation file is the same of the format specified in the
INPUT FORMAT
section. All observed alleles must be marked as unobserved
(i.e., with an ?
) in the validation file. The above accuracy command
should produce output which looks something like this:
0.83333333
This indicates that the accuracy of BNPPhase on the example dataset is
~83%
.
./BNPPhase
# Basic options
-input <IN-FILE>
-output <OUT-DIR>
[-impute]
[-iters <INT>]
[-burnin INT>]
[-init <INT>]
[-thin <INT>]
[-random_restarts <INT>]
[-seed <INT>]
# System and I/O options
[-m <INT>(G|M)]
[-a]
# MCMC options
[-calls]
[-statistics]
[-mosaic]
[-trace]
# Model parameterization
[-mass (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-varmass (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-mu (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-varmu (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-gamma (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-r (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-b (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)]
[-likelihood <STRING>]
[-mass_prior <STRING>]
[-mu_prior <STRING>]
[-fixmass]
[-fixmu]
[-fixgamma]
[-fixb]
[-fixr]
- MCMC options. The following flags control what summary statistics are saved to disk in collected MCMC iterations.
-trace
If this flag is specified, calls for every iteration of the MCMC
will be recorded and saved to the OUT-DIR/trace/
directory. Here,
<OUT-DIR>
is the directory provided to the -output
flag. A file for
each iteration of the MCMC (including burnin and thinned samples) are
reported in this directory. The probabilities used to produce these calls are
given by the MCMC states (i.e., MCMC states are not averaged over multiple
iterations to produce the calls in the OUT-DIR/trace/
directory). The
initial MCMC iterations (in which the parameters are updated with fixed
trajectories) are not included in the OUT-DIR/trace/
directory. The format
used for the trace files is the same as the format described in the
OUTPUT FORMAT
section (i.e., with -1
for sample/location pairs that
appear in the input file).
-mosaic
If this flag is specified, the cluster numbers of the top level
Dirichlet process to which each sequence is assigned at each location are
recorded during the MCMC, forming a mosaic pattern of haplotypes [2]. This
mosaic pattern is saved to the directory <OUT-DIR>/mosaic/
.
-statistics
If this flag is specified, BNPPhase will record additional
statistics for each MCMC iteration. These statistics are saved in the
following files in the <OUT-DIR>
directory: beta
, gamma
, r
,
and statistics
. The beta
file contains the value of the location
dependent mean minor allele frequency. This is the parameter b
from [1].
Each line contains a space separated list of the values, one for each marker.
Each MCMC iteration is described by one of the lines (including the initial
state, which is on the first line). The gamma
file contains the location
dependent minor allele frequency pseudo-observations, in the same format as
the beta
file. These two files are only meaningful if the -likelihood
flag does not specify an allele likelihood other than the default
beta/Bernoulli likelihood (the -likelihood
flag is described below).
The r
file contains the values of the jump rates at each iteration (with
one line per iteration). The number of jump rates is one less than the number
of markers (there is no jump rate associated with the first marker), and that
is reflected in the number of columns of r
.
The sampled values of the latent parameters and some marginal statistics for
the BNPPhase model are contained in the statistics
file. Again, there is
one line per iteration and in addition the first line is a header describing
the columns.
-calls
If this flag is specified, then for each MCMC iteration, calls are
produced in the <OUT-DIR>/calls/
directory by combining marginals for
each previously collected MCMC iteration. The format of the files in this
directory is the same as the format described in the OUTPUT FORMAT
section.
- System and I/O options:
-m <INT>(G|M)
This flag sets the internal memory used by BNPPhase to
INT
. The suffix G
or M
denotes GBs
or MBs
respectively. By
default, the amount of memory used by BNPPhase is 1G
.
-a
This flag specifies that BNPPhase should conduct internal assertions and
verifications. If BNPPhase crashes or produces unexpected results, please
rerun with the -a
flag and send us a bug report along with a stack trace
(if any).
- Model specification. The following flags control the initialization and range
of latent variables used by the BNPPhase model. There are three ways of
specifying each flag:
<INIT>
(without angle brackets) will set the initial value of the variable toINIT
(a real number).<MIN>:<MAX>
will set the minimum and maximum values for a variable toMIN
andMAX
respectively. The prior on the variable will be truncated to the range[MIN, MAX]
.<MIN>:<INIT>:<MAX>
will set both the initial value of the variable and also its range. The naming convention of the parameters used in this software is slightly different than the naming convention used in [1]. Also, some of the default values of this software are different from the settings used in [1], so some care must be taken in setting these parameters.
-mass (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)
Specify the concentration
parameter of the top level Dirichlet process in the BNPPhase model. This is
the parameter alpha_0
from [1]. The default value for mass
is
1e-3:1e2:1e3
(i.e., initial value 10
, minimum value 1/1000
and
maximum value 1000
).
-varmass
Specify the variance of the prior on the concentration parameter
of the top level Dirichlet process. The default value for varmass
is 1
.
-mu (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)
Specify the concentration
parameter of the bottom level Dirichlet processes in the BNPPhase model. This
is the parameter alpha
from [1]. The default value for mu
is
1e-3:1:1e3
.
-varmu
Specify the variance of the prior on the concentration parameter of
the bottom level Dirichlet process. The default value for varmu
is 1
.
-gamma (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)
Specify the pseudo-
observations for the beta/Bernoulli likelihood model. This is the parameter
gamma
from [1]. The default value for gamma
is 1e-3:1e-2:1e3
.
-r (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)
Specify the prior on the jump
rate. This is the parameter r
from [1]. The jump rate varies by location,
but the initial jump rate will be set to the real number <INIT>
at all
locations for the initialisation of the MCMC.
-b (<INIT>|<MIN>:<MAX>|<MIN>:<INIT>:<MAX>)
Specify the range of the prior
on the pseudo-observations for the minor allele frequency. This is the
parameter beta
from [1]. The default value for beta
is
1e-5:1e-3:1e-1
.
-mass_prior <STRING>
Specify the prior to use on the concentration of the
top level Dirichlet process. Possible values for <STRING>
are normal
,
lognormal
, exponential
or loguniform
. The default value is
lognormal
. Note that if loguniform
is used, then the varmass
parameter is ignored. The varmass
parameter governs the variance of the
prior before truncation to the range specified by the mass
parameter
-mu_prior <STRING>
Specify the prior to use for the concentration of the
bottom level Dirichlet process. Possible values for <STRING>
are normal
,
lognormal
, exponential
or loguniform
. As for -mass_prior
,
-varmu
is ignored if <STRING>
is loguniform
.
-fixmass
This flag specifies that the concentration parameter mass
for
the top level Dirichlet processes should be fixed, and not resampled during
MCMC. The default value for mass
is 1
. If a different value is desired,
it can be provided with the -mass <INIT>
parameter (and then the value of
mass
will be fixed to the real number <INIT>
). If -fixmass
is
provided, then the <MIN>:<MAX>
or <MIN>:<INIT>:<MAX>
forms for the flag
-mass
must not be used. The remaining flags (all beginning with -fix
)
are of a similar form. The -fix
flag can be prepended to the names of other
parameters to institute a similar effect, as follows:
-fixmu
-fixgamma
-fixr
-
L.T. Elliott, Y.W. Teh. A nonparametric HMM for genetic imputation and coalescent inference. 2016. Electronic Journal of Statistics.
-
M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and R. S. Lander. High-resolution haplotype structure in the human genome. 2001. Nature Genetics.
If you would like to contact us for additional help, or to send us comments or bug reports, please visit the https://github.com/BigBayes/BNPPhase website and send us your message through github.