Merge changes for release 1.13.0

wtsi-npg · Jun 23, 2016 · ae2b288 · ae2b288
1 parent 830fc8e
commit ae2b288
Show file tree

Hide file tree

Showing 24 changed files with 672 additions and 340 deletions.
diff --git a/README b/README
@@ -7,6 +7,12 @@ genotyping microarrays, supplemented with Sequenom genotyping.
 
 Change log is located in: doc/CHANGELOG.md
 
+To install, set the $INSTALL_ROOT environment variable to the path of the
+desired installation directory, and run src/bash/install.sh. This installs
+the Perl and Ruby components of the pipeline. It recursively installs
+non-core Perl dependencies from CPAN, and requires approximately 120 MB of
+disk space.
+
 The gender check used as a QC metric is also available as a standalone
 application; see below.
 
@@ -22,9 +28,13 @@ Pipeline:
  Bcftools               https://github.com/samtools/bcftools
 
 
+Install script:
+ cpanm >= 1.7042        http://search.cpan.org/~miyagawa/Menlo-1.9003/script/cpanm-menlo
+
+
 QC scripts:
 
- Simtools >= 1.5	https://github.com/wtsi-npg/simtools
+ Simtools >= 2.2	https://github.com/wtsi-npg/simtools
  R >= 2.11.1
  R mixtools library	http://cran.r-project.org/web/packages/mixtools/index.html
 

diff --git a/doc/CHANGELOG.md b/doc/CHANGELOG.md
@@ -12,6 +12,32 @@ Added:
 and plots
 
 
+Release 1.13.0: 2016-06-20
+--------------------------
+
+Added:
+- install.sh script to install pipeline and its Perl dependencies
+- Documentation for Bayesian identity check
+
+Changed:
+- Modified ready_workflow.pl to better align with user SOP
+- Use try/catch to handle unexpected errors in retrieving QC plex results
+from iRODS
+- Updated reference genome for Sequenom iRODS query
+- Update perl-irods-wrap dependency to 2.4.0; removes unhelpful warning
+messages to STDERR. This in turn requires baton version >= 0.16.4.
+
+Removed:
+- Script publish_infinium_file_list.pl; superseded by other publish scripts
+
+
+Release 1.12.1: 2016-05-13
+--------------------------
+
+Fixed:
+- Support repeat scans from Infinium database
+
+
 Release 1.12.0: 2016-04-07
 --------------------------
 

diff --git a/doc/bayesian_identity/.gitignore b/doc/bayesian_identity/.gitignore
@@ -0,0 +1,7 @@
+# patterns to ignore for git updates
+\.*
+*~
+*.aux
+*.dvi
+*.ps
+*.log
diff --git a/doc/bayesian_identity/bayesian_identity.pdf b/doc/bayesian_identity/bayesian_identity.pdf
diff --git a/doc/bayesian_identity/bayesian_identity.tex b/doc/bayesian_identity/bayesian_identity.tex
@@ -0,0 +1,188 @@
+\documentclass{article}
+\usepackage{graphicx}
+\usepackage{mathtools}
+\DeclareGraphicsExtensions{.pdf,.png,.jpg}
+
+\begin{document}
+
+\title{Bayesian Identity Metric for Genotyping}
+\author{Iain Bancarz, [email protected] }
+\date{May 2016}
+\maketitle
+
+
+\section{Introduction}
+
+This document describes the theoretical background to the Bayesian identity check, which is a QC metric introduced in pipeline release 1.11.5.
+
+An identity check compares calls on a set of QC plex SNPs, between Infinium and another genotyping platform, such as Sequenom and/or Fluidigm. We observe \emph{concordance} between Infinium and the alternate platform on the QC plex, and use it to evaluate \emph{identity}. If concordance is too low, we are not confident that the production and QC calls derive from the same sample; this may indicate a sample swap or poor data quality.
+
+Key questions include:
+
+\begin{itemize}
+\item How do we handle no-calls? How many non-null calls are required to evaluate identity?
+\item What is an appropriate concordance threshold for sample identity?
+\item How do we handle multiple QC plex runs, potentially on different platforms?
+\end{itemize}
+
+Previous versions of the identity check set an \textit{ad hoc} threshold on concordance, which was used to mark samples as passing or failing QC. The Bayesian approach seeks to place the identity check on a sound probabilistic footing, in which the underlying assumptions are made explicit.
+
+\section{Bayesian Inference}
+
+Bayesian inference is a widely used statistical method in which we specify \emph{prior} assumptions, and update the probability of an event based on observed data, finding what is known as the \emph{posterior} probability.
+
+This is done using \emph{Bayes' rule}:
+
+\begin{displaymath}
+\Pr(H|E) = \frac{\Pr(E|H)\Pr(H)}{\Pr(E)}
+\end{displaymath}
+
+where:
+
+\begin{itemize}
+\item $H$ is a \emph{hypothesis}; ``samples are identical'' or ``samples are not identical''
+\item $E$ is \emph{evidence} which has been observed; in this case, the production and QC calls
+\item $\Pr(H|E)$ is probability of hypothesis given the evidence; this is the \emph{posterior probability} which interests us.
+\item $\Pr(H)$ is probability of the hypothesis when evidence is not known; this is the \emph{prior probability} of the hypothesis.
+\item $\Pr(E|H)$ is known as the \emph{likelihood}. This is the probability of evidence given the hypothesis; if the samples are identical, what is the probability of observing these calls?
+\item $\Pr(E)$ is prior probability of the evidence, independent of the hypothesis. Serves as a normalising constant.
+\end{itemize}
+
+
+\section{Framework for sample identity}
+
+We define the terms as follows:
+
+
+\subsection*{Hypotheses}
+
+\begin{itemize}
+\item $H_1$: Samples are identical
+\item $H_2$: Samples are not identical
+\end{itemize}
+
+
+\subsection*{Evidence}
+
+\begin{itemize}
+\item Production calls: $p_1, p_2, \dots , p_m$ where $p_i$ is the production call on the $i$th of $m$ SNPs in the QC plex, on the Infinium platform.
+\item QC calls: $q_{11}, q_{12}, \dots , q_{mn},$ where $q_{ij}$ is the call on the $i$th SNP in the $j$th of $n$ QC runs, using Fluidigm, Sequenom, or any other platform.
+\end{itemize}
+
+\subsection*{Likelihood}
+
+\subsubsection*{Defining events}
+
+Suppose the evidence $E$ is made up of multiple, independent events $E_i$. In that case:
+
+\begin{displaymath}
+\Pr(E|H) = \prod_{i} \Pr(E_{i}|H)
+\end{displaymath}
+
+Let $E_i$ consist of a production call and $n$ QC calls on the $i$th SNP: 
+
+\begin{displaymath}
+E_i = \{ p_i, q_{i1}, q_{i2}, \dots , q_{in} \}
+\end{displaymath}
+
+where as before, $p_i$ is the production call on Infinium, and $q_{ij}$ are the QC calls on Fluidigm, Sequenom, or both. Each QC call $q_{ij}$ is either equivalent to the production call $p_i$, or not equivalent. Our model will use the probability of equivalent calls.
+
+\subsubsection*{Binomial distribution}
+
+Let us consider whether each of the QC calls $q_{ij}$ is equivalent to the production call $p_i$. We have $n_i$ calls on the $i$th SNP, excluding no-calls so that $n_i \le n$. Suppose that out of these $n_i$ calls, we have $k_i$ matches. Let $u_{i}$ be the probability that production and QC calls are identical on the $i$th SNP, given that the samples are equivalent; in other words, given $H_{1}$. Let $v_{i}$ be the probability of identical calls, given that the samples are not equivalent, ie.\ $H_{2}$ holds.
+
+So, we are interested in the likelihood of observing $k_i$ equivalent calls out of $n_i$ trials, with probability of equivalence $u_{i}$ or $v_{i}$. This is simply the binomial distribution. Omitting the subscripts for $i$ to simplify notation, the binomial distribution is:
+
+\begin{displaymath}
+\Pr(k) = \binom{n}{k} u^{k} (1 - u)^{(n - k)}
+\end{displaymath}
+
+and similarly for $v$, where:
+
+\begin{displaymath}
+\binom{n}{k} = \frac{m!}{k!(n-k)!}
+\end{displaymath}
+
+\subsubsection*{Probability of equivalence}
+
+Now, for our two hypotheses $H_1$ and $H_2$, we have probabilities $u$ and $v$ as follows:
+
+\begin{itemize}
+\item $u_{i}$ is the probability of equivalent calls, given that the samples are identical. If the samples are identical, any mismatches must be the result of a genotyping error. We then have $u_{i} = 1 - r$, where $r$ is the expected error rate.
+\item $v_{i}$ is the probability of equivalent calls, given that the samples are \emph{not} identical. So we have $v_{i} = \hat{v}_{i} + \theta$, where $\hat{v}_{i}$ is the ``true'' probability of equivalent calls on non-identical samples for the $i$th SNP, and $\theta$ is a correction factor to account for experimental error.
+\end{itemize}
+
+In general, $v_i$ will depend on the expected degree of relatedness between samples, as well as the heterozygosity and minor allele frequency (MAF) of the $i$th SNP.
+
+\subsection*{Normalising constant}
+
+We can now find the normalising constant $\Pr(E)$ as follows:
+
+\begin{displaymath}
+\Pr(E) = \Pr(E|H_{1})\Pr(H_{1}) + \Pr(E|H_{2})\Pr(H_{2})
+\end{displaymath}
+
+\subsection*{Computing the identity metric}
+
+Our desired metric is the probability that the production and QC samples are identical. This happens exactly when hypothesis $H_1$ holds. We can compute the probability of $H_1$ given the evidence $E$ as follows:
+
+\begin{displaymath}
+\Pr(H_1|E) = \frac{\Pr(E|H_1)\Pr(H_1)}{\Pr(E)}
+= \frac{\Pr(E|H_1)\Pr(H_1)}{\Pr(E|H_{1})\Pr(H_{1}) + \Pr(E|H_{2})\Pr(H_{2})}
+\end{displaymath}
+
+where:
+
+\begin{displaymath}
+\Pr(E|H_1) = \prod_{i} \Pr(E_{i}|H_1)
+= \prod_{i} b(k_i; n_i,u_i)
+\end{displaymath}
+
+where $b(k_i; n_i, u_i)$ is the binomial distribution for $k_i$ successes in $n_i$ trials with probability $u_i$, and similarly for $H_2$ and $v_i$.
+
+\section{Required parameters}
+
+\subsection*{Model inputs}
+
+In order to compute the identity metric as above, the following parameters must be supplied or estimated from data:
+
+\begin{itemize}
+\item Prior probability of sample mismatch, $\Pr(H_2)$. We also have $\Pr(H_1) = 1 - \Pr(H_2)$.
+\item Error rate of genotype calls, $r$. Probability of equivalent calls on identical samples, $u = 1 - r$.
+\item Probability of identical calls on non-identical samples for the $i$th SNP, $v_i$.
+\end{itemize}
+
+The attached graphs simulate the effects of varying the above parameters while keeping the others at constant default values. The defaults used were:
+
+\begin{itemize}
+\item $\Pr(H_2) = 0.01$
+\item $r = 0.01$
+\item $v_i = 0.40625$ for all SNPs
+\end{itemize}
+
+The default $v_i$ assumes heterozygosity 0.5 and minor allele frequency 0.25. Unless otherwise stated, the number of SNPs in the QC plex was 24.
+
+\subsection*{Pass/fail threshold}
+
+To apply the metric in practice, we need a threshold on the posterior probability $\Pr(H_1|E)$. Samples pass if they are above the threshold, and fail otherwise.
+
+Here are two plausible ways of choosing a threshold:
+
+\begin{itemize}
+\item \textbf{Optimistic:} Require convincing evidence that a sample swap \emph{has} occurred. Set threshold to a ``significant'' probability of non-identical samples, for example 0.95.
+\item \textbf{Pessimistic:} Require convincing evidence that a swap \emph{has not} occurred. Specifically, insist that posterior probability of identity is higher than prior probability; that is, the QC plex calls have made us \emph{more certain} that the samples are not identical. Set threshold to $1 - \Pr(H_2)$, where as above $H_2$ is the sample mismatch hypothesis. With default parameters, this gives us a threshold of 0.99.
+\end{itemize}
+
+In the pipeline implementation of the Bayesian identity check, the second, ``pessimistic'' scenario has been adopted.
+
+\section{Graphs for simulated data}
+
+\includegraphics[scale=0.5]{identity_by_plex_size}
+
+\includegraphics[scale=0.5]{identity_by_smp}
+
+\includegraphics[scale=0.5]{identity_by_xer}
+
+\includegraphics[scale=0.5]{identity_by_eq_prob.png}
+
+\end{document}
diff --git a/doc/bayesian_identity/identity_by_eq_prob.png b/doc/bayesian_identity/identity_by_eq_prob.png
diff --git a/doc/bayesian_identity/identity_by_plex_size.png b/doc/bayesian_identity/identity_by_plex_size.png
diff --git a/doc/bayesian_identity/identity_by_smp.png b/doc/bayesian_identity/identity_by_smp.png
diff --git a/doc/bayesian_identity/identity_by_xer.png b/doc/bayesian_identity/identity_by_xer.png
diff --git a/src/bash/install.sh b/src/bash/install.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+
+# Author: Iain Bancarz <[email protected]>, June 2016
+
+# script to install Perl dependencies of the genotyping pipeline using cpanm
+# also installs Perl and Ruby components of the pipeline
+# does *not* install baton and other non-Perl dependencies; these are specified in the genotyping modulefile
+
+# similar in purpose to npg_irods/scripts/travis_install.sh
+
+if [ -z $INSTALL_ROOT ]; then
+    echo "INSTALL_ROOT environment variable must specify the path to a directory for installation; exiting." >&2
+    exit 1
+elif [ ! -e $INSTALL_ROOT ]; then
+    echo "INSTALL_ROOT environment variable path $INSTALL_ROOT does not exist; exiting." >&2
+    exit 1
+elif [ ! -d $INSTALL_ROOT ]; then
+    echo "INSTALL_ROOT environment variable path $INSTALL_ROOT is not a directory; exiting." >&2
+    exit 1
+else
+    echo "Installing pipeline to root directory $INSTALL_ROOT"
+fi
+
+WTSI_DNAP_UTILITIES_VERSION="0.5.1"
+ML_WAREHOUSE_VERSION="2.1"
+ALIEN_TIDYP_VERSION="1.4.7"
+NPG_TRACKING_VERSION="85.3"
+NPG_IRODS_VERSION="2.4.0"
+RUBY_VERSION="1.8.7-p330"
+LIB_RUBY_VERSION="0.3.0"
+
+TEMP_NAME=`mktemp -d genotyping_temp.XXXXXXXX`
+TEMP=`readlink -f $TEMP_NAME` # full path to temp directory
+export PATH=$TEMP:$PATH # ensure cpanm download is on PATH
+START_DIR=$PWD
+cd $TEMP
+
+# use wget to download tarballs to install
+URLS=(https://github.com/wtsi-npg/perl-dnap-utilities/releases/download/$WTSI_DNAP_UTILITIES_VERSION/WTSI-DNAP-Utilities-$WTSI_DNAP_UTILITIES_VERSION.tar.gz \
+https://github.com/wtsi-npg/ml_warehouse/releases/download/$ML_WAREHOUSE_VERSION/ml_warehouse-$ML_WAREHOUSE_VERSION.tar.gz \
+http://search.cpan.org/CPAN/authors/id/K/KM/KMX/Alien-Tidyp-v$ALIEN_TIDYP_VERSION.tar.gz \
+https://github.com/wtsi-npg/npg_tracking/releases/download/$NPG_TRACKING_VERSION/npg-tracking-$NPG_TRACKING_VERSION.tar.gz \
+https://github.com/wtsi-npg/perl-irods-wrap/releases/download/$NPG_IRODS_VERSION/WTSI-NPG-iRODS-$NPG_IRODS_VERSION.tar.gz )
+
+for URL in ${URLS[@]}; do
+    wget $URL
+    if [ $? -ne 0 ]; then
+        echo "Failed to download $URL; non-zero exit status from wget; install halted" >&2
+        exit 1
+    fi
+done
+
+# clone genotyping to access src/perl 
+git clone https://github.com/wtsi-npg/genotyping.git
+if [ $? -ne 0 ]; then
+    echo "Failed to clone genotyping repository; install halted" >&2
+    exit 1
+fi
+
+eval $(perl -Mlocal::lib=$INSTALL_ROOT) # set environment variables
+
+# use local installation of cpanm
+module load cpanm/1.7042
+
+# install prerequisites from tarfiles
+TARFILES=(WTSI-DNAP-Utilities-$WTSI_DNAP_UTILITIES_VERSION.tar.gz \
+ml_warehouse-$ML_WAREHOUSE_VERSION.tar.gz \
+Alien-Tidyp-v$ALIEN_TIDYP_VERSION.tar.gz \
+npg-tracking-$NPG_TRACKING_VERSION.tar.gz \
+WTSI-NPG-iRODS-$NPG_IRODS_VERSION.tar.gz)
+
+for FILE in ${TARFILES[@]}; do
+    cpanm --installdeps $FILE --self-contained  --notest
+    if [ $? -ne 0 ]; then
+        echo "cpanm --installdeps failed for $FILE; install halted" >&2
+        exit 1
+    fi
+    cpanm --install $FILE --notest
+    if [ $? -ne 0 ]; then
+        echo "cpanm --install failed for $FILE; install halted" >&2
+        exit 1
+    fi
+done
+
+# install pipeline Perl
+# first ensure we are on 'master' branch and find pipeline version
+cd genotyping/src/perl
+git checkout master
+cpanm --installdeps . --self-contained --notest 
+if [ $? -ne 0 ]; then
+    echo "cpanm --installdeps failed for genotyping; install halted" >&2
+    exit 1
+fi
+cpanm --install . --notest
+if [ $? -ne 0 ]; then
+    echo "cpanm --install failed for genotyping; install halted" >&2
+    exit 1
+fi
+
+# now set Ruby environment variables and install
+export RUBY_HOME=/software/gapi/pkg/ruby/$RUBY_VERSION
+export PATH=$RUBY_HOME/bin:$PATH
+export MANPATH=$RUBY_HOME/share/man:$MANPATH
+export GEM_HOME=$INSTALL_ROOT
+export GEM_PATH=/software/gapi/pkg/lib-ruby/$LIB_RUBY_VERSION
+export GEM_PATH=$INSTALL_ROOT:$GEM_PATH
+export PATH=/software/gapi/pkg/lib-ruby/$LIB_RUBY_VERSION/bin:$PATH
+
+cd ../ruby/genotyping-workflows/
+rake gem
+if [ $? -ne 0 ]; then
+    echo "'rake gem' failed for genotyping; install halted" >&2
+    exit 1
+fi
+GENOTYPING_VERSION=`git describe --dirty --always`
+GENOTYPING_GEM="pkg/genotyping-workflows-$GENOTYPING_VERSION.gem"
+if [ ! -f $GENOTYPING_GEM ]; then
+    echo "Expected Ruby gem $GENOTYPING_GEM not found; install halted" >&2
+    exit 1
+fi
+gem install $GENOTYPING_GEM
+if [ $? -ne 0 ]; then
+    echo "'gem install' failed for genotyping; install halted" >&2
+    exit 1
+fi
+
+# clean up temporary directory
+cd $START_DIR
+rm -Rf $TEMP
+
+exit 0
diff --git a/src/perl/Build.PL b/src/perl/Build.PL
@@ -45,7 +45,7 @@ my $build = Build->new
                          'Try::Tiny'                           => '>= 0.22',
                          'URI'                                 => '>= 1.67',
                          'WTSI::DNAP::Warehouse::Schema'       => '>= 1.1',
-                         'WTSI::NPG::iRODS'                    => '>= 0.15.0'
+                         'WTSI::NPG::iRODS'                    => '>= 2.1.0'
                         },
    recommends        => {
                          'UUID' => '>= 0.24',