Skip to content

Commit

Permalink
Merge changes for release 1.13.0
Browse files Browse the repository at this point in the history
  • Loading branch information
iainrb committed Jun 23, 2016
1 parent 830fc8e commit ae2b288
Show file tree
Hide file tree
Showing 24 changed files with 672 additions and 340 deletions.
12 changes: 11 additions & 1 deletion README
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ genotyping microarrays, supplemented with Sequenom genotyping.

Change log is located in: doc/CHANGELOG.md

To install, set the $INSTALL_ROOT environment variable to the path of the
desired installation directory, and run src/bash/install.sh. This installs
the Perl and Ruby components of the pipeline. It recursively installs
non-core Perl dependencies from CPAN, and requires approximately 120 MB of
disk space.

The gender check used as a QC metric is also available as a standalone
application; see below.

Expand All @@ -22,9 +28,13 @@ Pipeline:
Bcftools https://github.com/samtools/bcftools


Install script:
cpanm >= 1.7042 http://search.cpan.org/~miyagawa/Menlo-1.9003/script/cpanm-menlo


QC scripts:

Simtools >= 1.5 https://github.com/wtsi-npg/simtools
Simtools >= 2.2 https://github.com/wtsi-npg/simtools
R >= 2.11.1
R mixtools library http://cran.r-project.org/web/packages/mixtools/index.html

Expand Down
26 changes: 26 additions & 0 deletions doc/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,32 @@ Added:
and plots


Release 1.13.0: 2016-06-20
--------------------------

Added:
- install.sh script to install pipeline and its Perl dependencies
- Documentation for Bayesian identity check

Changed:
- Modified ready_workflow.pl to better align with user SOP
- Use try/catch to handle unexpected errors in retrieving QC plex results
from iRODS
- Updated reference genome for Sequenom iRODS query
- Update perl-irods-wrap dependency to 2.4.0; removes unhelpful warning
messages to STDERR. This in turn requires baton version >= 0.16.4.

Removed:
- Script publish_infinium_file_list.pl; superseded by other publish scripts


Release 1.12.1: 2016-05-13
--------------------------

Fixed:
- Support repeat scans from Infinium database


Release 1.12.0: 2016-04-07
--------------------------

Expand Down
7 changes: 7 additions & 0 deletions doc/bayesian_identity/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# patterns to ignore for git updates
\.*
*~
*.aux
*.dvi
*.ps
*.log
Binary file added doc/bayesian_identity/bayesian_identity.pdf
Binary file not shown.
188 changes: 188 additions & 0 deletions doc/bayesian_identity/bayesian_identity.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
\documentclass{article}
\usepackage{graphicx}
\usepackage{mathtools}
\DeclareGraphicsExtensions{.pdf,.png,.jpg}

\begin{document}

\title{Bayesian Identity Metric for Genotyping}
\author{Iain Bancarz, [email protected] }
\date{May 2016}
\maketitle


\section{Introduction}

This document describes the theoretical background to the Bayesian identity check, which is a QC metric introduced in pipeline release 1.11.5.

An identity check compares calls on a set of QC plex SNPs, between Infinium and another genotyping platform, such as Sequenom and/or Fluidigm. We observe \emph{concordance} between Infinium and the alternate platform on the QC plex, and use it to evaluate \emph{identity}. If concordance is too low, we are not confident that the production and QC calls derive from the same sample; this may indicate a sample swap or poor data quality.

Key questions include:

\begin{itemize}
\item How do we handle no-calls? How many non-null calls are required to evaluate identity?
\item What is an appropriate concordance threshold for sample identity?
\item How do we handle multiple QC plex runs, potentially on different platforms?
\end{itemize}

Previous versions of the identity check set an \textit{ad hoc} threshold on concordance, which was used to mark samples as passing or failing QC. The Bayesian approach seeks to place the identity check on a sound probabilistic footing, in which the underlying assumptions are made explicit.

\section{Bayesian Inference}

Bayesian inference is a widely used statistical method in which we specify \emph{prior} assumptions, and update the probability of an event based on observed data, finding what is known as the \emph{posterior} probability.

This is done using \emph{Bayes' rule}:

\begin{displaymath}
\Pr(H|E) = \frac{\Pr(E|H)\Pr(H)}{\Pr(E)}
\end{displaymath}

where:

\begin{itemize}
\item $H$ is a \emph{hypothesis}; ``samples are identical'' or ``samples are not identical''
\item $E$ is \emph{evidence} which has been observed; in this case, the production and QC calls
\item $\Pr(H|E)$ is probability of hypothesis given the evidence; this is the \emph{posterior probability} which interests us.
\item $\Pr(H)$ is probability of the hypothesis when evidence is not known; this is the \emph{prior probability} of the hypothesis.
\item $\Pr(E|H)$ is known as the \emph{likelihood}. This is the probability of evidence given the hypothesis; if the samples are identical, what is the probability of observing these calls?
\item $\Pr(E)$ is prior probability of the evidence, independent of the hypothesis. Serves as a normalising constant.
\end{itemize}


\section{Framework for sample identity}

We define the terms as follows:


\subsection*{Hypotheses}

\begin{itemize}
\item $H_1$: Samples are identical
\item $H_2$: Samples are not identical
\end{itemize}


\subsection*{Evidence}

\begin{itemize}
\item Production calls: $p_1, p_2, \dots , p_m$ where $p_i$ is the production call on the $i$th of $m$ SNPs in the QC plex, on the Infinium platform.
\item QC calls: $q_{11}, q_{12}, \dots , q_{mn},$ where $q_{ij}$ is the call on the $i$th SNP in the $j$th of $n$ QC runs, using Fluidigm, Sequenom, or any other platform.
\end{itemize}

\subsection*{Likelihood}

\subsubsection*{Defining events}

Suppose the evidence $E$ is made up of multiple, independent events $E_i$. In that case:

\begin{displaymath}
\Pr(E|H) = \prod_{i} \Pr(E_{i}|H)
\end{displaymath}

Let $E_i$ consist of a production call and $n$ QC calls on the $i$th SNP:

\begin{displaymath}
E_i = \{ p_i, q_{i1}, q_{i2}, \dots , q_{in} \}
\end{displaymath}

where as before, $p_i$ is the production call on Infinium, and $q_{ij}$ are the QC calls on Fluidigm, Sequenom, or both. Each QC call $q_{ij}$ is either equivalent to the production call $p_i$, or not equivalent. Our model will use the probability of equivalent calls.

\subsubsection*{Binomial distribution}

Let us consider whether each of the QC calls $q_{ij}$ is equivalent to the production call $p_i$. We have $n_i$ calls on the $i$th SNP, excluding no-calls so that $n_i \le n$. Suppose that out of these $n_i$ calls, we have $k_i$ matches. Let $u_{i}$ be the probability that production and QC calls are identical on the $i$th SNP, given that the samples are equivalent; in other words, given $H_{1}$. Let $v_{i}$ be the probability of identical calls, given that the samples are not equivalent, ie.\ $H_{2}$ holds.

So, we are interested in the likelihood of observing $k_i$ equivalent calls out of $n_i$ trials, with probability of equivalence $u_{i}$ or $v_{i}$. This is simply the binomial distribution. Omitting the subscripts for $i$ to simplify notation, the binomial distribution is:

\begin{displaymath}
\Pr(k) = \binom{n}{k} u^{k} (1 - u)^{(n - k)}
\end{displaymath}

and similarly for $v$, where:

\begin{displaymath}
\binom{n}{k} = \frac{m!}{k!(n-k)!}
\end{displaymath}

\subsubsection*{Probability of equivalence}

Now, for our two hypotheses $H_1$ and $H_2$, we have probabilities $u$ and $v$ as follows:

\begin{itemize}
\item $u_{i}$ is the probability of equivalent calls, given that the samples are identical. If the samples are identical, any mismatches must be the result of a genotyping error. We then have $u_{i} = 1 - r$, where $r$ is the expected error rate.
\item $v_{i}$ is the probability of equivalent calls, given that the samples are \emph{not} identical. So we have $v_{i} = \hat{v}_{i} + \theta$, where $\hat{v}_{i}$ is the ``true'' probability of equivalent calls on non-identical samples for the $i$th SNP, and $\theta$ is a correction factor to account for experimental error.
\end{itemize}

In general, $v_i$ will depend on the expected degree of relatedness between samples, as well as the heterozygosity and minor allele frequency (MAF) of the $i$th SNP.

\subsection*{Normalising constant}

We can now find the normalising constant $\Pr(E)$ as follows:

\begin{displaymath}
\Pr(E) = \Pr(E|H_{1})\Pr(H_{1}) + \Pr(E|H_{2})\Pr(H_{2})
\end{displaymath}

\subsection*{Computing the identity metric}

Our desired metric is the probability that the production and QC samples are identical. This happens exactly when hypothesis $H_1$ holds. We can compute the probability of $H_1$ given the evidence $E$ as follows:

\begin{displaymath}
\Pr(H_1|E) = \frac{\Pr(E|H_1)\Pr(H_1)}{\Pr(E)}
= \frac{\Pr(E|H_1)\Pr(H_1)}{\Pr(E|H_{1})\Pr(H_{1}) + \Pr(E|H_{2})\Pr(H_{2})}
\end{displaymath}

where:

\begin{displaymath}
\Pr(E|H_1) = \prod_{i} \Pr(E_{i}|H_1)
= \prod_{i} b(k_i; n_i,u_i)
\end{displaymath}

where $b(k_i; n_i, u_i)$ is the binomial distribution for $k_i$ successes in $n_i$ trials with probability $u_i$, and similarly for $H_2$ and $v_i$.

\section{Required parameters}

\subsection*{Model inputs}

In order to compute the identity metric as above, the following parameters must be supplied or estimated from data:

\begin{itemize}
\item Prior probability of sample mismatch, $\Pr(H_2)$. We also have $\Pr(H_1) = 1 - \Pr(H_2)$.
\item Error rate of genotype calls, $r$. Probability of equivalent calls on identical samples, $u = 1 - r$.
\item Probability of identical calls on non-identical samples for the $i$th SNP, $v_i$.
\end{itemize}

The attached graphs simulate the effects of varying the above parameters while keeping the others at constant default values. The defaults used were:

\begin{itemize}
\item $\Pr(H_2) = 0.01$
\item $r = 0.01$
\item $v_i = 0.40625$ for all SNPs
\end{itemize}

The default $v_i$ assumes heterozygosity 0.5 and minor allele frequency 0.25. Unless otherwise stated, the number of SNPs in the QC plex was 24.

\subsection*{Pass/fail threshold}

To apply the metric in practice, we need a threshold on the posterior probability $\Pr(H_1|E)$. Samples pass if they are above the threshold, and fail otherwise.

Here are two plausible ways of choosing a threshold:

\begin{itemize}
\item \textbf{Optimistic:} Require convincing evidence that a sample swap \emph{has} occurred. Set threshold to a ``significant'' probability of non-identical samples, for example 0.95.
\item \textbf{Pessimistic:} Require convincing evidence that a swap \emph{has not} occurred. Specifically, insist that posterior probability of identity is higher than prior probability; that is, the QC plex calls have made us \emph{more certain} that the samples are not identical. Set threshold to $1 - \Pr(H_2)$, where as above $H_2$ is the sample mismatch hypothesis. With default parameters, this gives us a threshold of 0.99.
\end{itemize}

In the pipeline implementation of the Bayesian identity check, the second, ``pessimistic'' scenario has been adopted.

\section{Graphs for simulated data}

\includegraphics[scale=0.5]{identity_by_plex_size}

\includegraphics[scale=0.5]{identity_by_smp}

\includegraphics[scale=0.5]{identity_by_xer}

\includegraphics[scale=0.5]{identity_by_eq_prob.png}

\end{document}
Binary file added doc/bayesian_identity/identity_by_eq_prob.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/bayesian_identity/identity_by_plex_size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/bayesian_identity/identity_by_smp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/bayesian_identity/identity_by_xer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 131 additions & 0 deletions src/bash/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/bin/bash

# Author: Iain Bancarz <[email protected]>, June 2016

# script to install Perl dependencies of the genotyping pipeline using cpanm
# also installs Perl and Ruby components of the pipeline
# does *not* install baton and other non-Perl dependencies; these are specified in the genotyping modulefile

# similar in purpose to npg_irods/scripts/travis_install.sh

if [ -z $INSTALL_ROOT ]; then
echo "INSTALL_ROOT environment variable must specify the path to a directory for installation; exiting." >&2
exit 1
elif [ ! -e $INSTALL_ROOT ]; then
echo "INSTALL_ROOT environment variable path $INSTALL_ROOT does not exist; exiting." >&2
exit 1
elif [ ! -d $INSTALL_ROOT ]; then
echo "INSTALL_ROOT environment variable path $INSTALL_ROOT is not a directory; exiting." >&2
exit 1
else
echo "Installing pipeline to root directory $INSTALL_ROOT"
fi

WTSI_DNAP_UTILITIES_VERSION="0.5.1"
ML_WAREHOUSE_VERSION="2.1"
ALIEN_TIDYP_VERSION="1.4.7"
NPG_TRACKING_VERSION="85.3"
NPG_IRODS_VERSION="2.4.0"
RUBY_VERSION="1.8.7-p330"
LIB_RUBY_VERSION="0.3.0"

TEMP_NAME=`mktemp -d genotyping_temp.XXXXXXXX`
TEMP=`readlink -f $TEMP_NAME` # full path to temp directory
export PATH=$TEMP:$PATH # ensure cpanm download is on PATH
START_DIR=$PWD
cd $TEMP

# use wget to download tarballs to install
URLS=(https://github.com/wtsi-npg/perl-dnap-utilities/releases/download/$WTSI_DNAP_UTILITIES_VERSION/WTSI-DNAP-Utilities-$WTSI_DNAP_UTILITIES_VERSION.tar.gz \
https://github.com/wtsi-npg/ml_warehouse/releases/download/$ML_WAREHOUSE_VERSION/ml_warehouse-$ML_WAREHOUSE_VERSION.tar.gz \
http://search.cpan.org/CPAN/authors/id/K/KM/KMX/Alien-Tidyp-v$ALIEN_TIDYP_VERSION.tar.gz \
https://github.com/wtsi-npg/npg_tracking/releases/download/$NPG_TRACKING_VERSION/npg-tracking-$NPG_TRACKING_VERSION.tar.gz \
https://github.com/wtsi-npg/perl-irods-wrap/releases/download/$NPG_IRODS_VERSION/WTSI-NPG-iRODS-$NPG_IRODS_VERSION.tar.gz )

for URL in ${URLS[@]}; do
wget $URL
if [ $? -ne 0 ]; then
echo "Failed to download $URL; non-zero exit status from wget; install halted" >&2
exit 1
fi
done

# clone genotyping to access src/perl
git clone https://github.com/wtsi-npg/genotyping.git
if [ $? -ne 0 ]; then
echo "Failed to clone genotyping repository; install halted" >&2
exit 1
fi

eval $(perl -Mlocal::lib=$INSTALL_ROOT) # set environment variables

# use local installation of cpanm
module load cpanm/1.7042

# install prerequisites from tarfiles
TARFILES=(WTSI-DNAP-Utilities-$WTSI_DNAP_UTILITIES_VERSION.tar.gz \
ml_warehouse-$ML_WAREHOUSE_VERSION.tar.gz \
Alien-Tidyp-v$ALIEN_TIDYP_VERSION.tar.gz \
npg-tracking-$NPG_TRACKING_VERSION.tar.gz \
WTSI-NPG-iRODS-$NPG_IRODS_VERSION.tar.gz)

for FILE in ${TARFILES[@]}; do
cpanm --installdeps $FILE --self-contained --notest
if [ $? -ne 0 ]; then
echo "cpanm --installdeps failed for $FILE; install halted" >&2
exit 1
fi
cpanm --install $FILE --notest
if [ $? -ne 0 ]; then
echo "cpanm --install failed for $FILE; install halted" >&2
exit 1
fi
done

# install pipeline Perl
# first ensure we are on 'master' branch and find pipeline version
cd genotyping/src/perl
git checkout master
cpanm --installdeps . --self-contained --notest
if [ $? -ne 0 ]; then
echo "cpanm --installdeps failed for genotyping; install halted" >&2
exit 1
fi
cpanm --install . --notest
if [ $? -ne 0 ]; then
echo "cpanm --install failed for genotyping; install halted" >&2
exit 1
fi

# now set Ruby environment variables and install
export RUBY_HOME=/software/gapi/pkg/ruby/$RUBY_VERSION
export PATH=$RUBY_HOME/bin:$PATH
export MANPATH=$RUBY_HOME/share/man:$MANPATH
export GEM_HOME=$INSTALL_ROOT
export GEM_PATH=/software/gapi/pkg/lib-ruby/$LIB_RUBY_VERSION
export GEM_PATH=$INSTALL_ROOT:$GEM_PATH
export PATH=/software/gapi/pkg/lib-ruby/$LIB_RUBY_VERSION/bin:$PATH

cd ../ruby/genotyping-workflows/
rake gem
if [ $? -ne 0 ]; then
echo "'rake gem' failed for genotyping; install halted" >&2
exit 1
fi
GENOTYPING_VERSION=`git describe --dirty --always`
GENOTYPING_GEM="pkg/genotyping-workflows-$GENOTYPING_VERSION.gem"
if [ ! -f $GENOTYPING_GEM ]; then
echo "Expected Ruby gem $GENOTYPING_GEM not found; install halted" >&2
exit 1
fi
gem install $GENOTYPING_GEM
if [ $? -ne 0 ]; then
echo "'gem install' failed for genotyping; install halted" >&2
exit 1
fi

# clean up temporary directory
cd $START_DIR
rm -Rf $TEMP

exit 0
2 changes: 1 addition & 1 deletion src/perl/Build.PL
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ my $build = Build->new
'Try::Tiny' => '>= 0.22',
'URI' => '>= 1.67',
'WTSI::DNAP::Warehouse::Schema' => '>= 1.1',
'WTSI::NPG::iRODS' => '>= 0.15.0'
'WTSI::NPG::iRODS' => '>= 2.1.0'
},
recommends => {
'UUID' => '>= 0.24',
Expand Down
Loading

0 comments on commit ae2b288

Please sign in to comment.