Skip to content

Commit

Permalink
corrections in manual
Browse files Browse the repository at this point in the history
  • Loading branch information
timydaley committed Oct 29, 2015
1 parent 19b8739 commit 9580631
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 52 deletions.
11 changes: 11 additions & 0 deletions docs/biblio.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
@article{heck1975explicit,
title={Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size},
author={Heck, Jr, Kenneth L and van Belle, Gerald and Simberloff, Daniel},
journal={Ecology},
volume={56},
number={6},
pages={1459--1461},
year={1975},
publisher={JSTOR}
}

@article{willis2015inference,
title={Inference for changes in biodiversity},
author={Willis, Amy and Bunge, John and Whitman, Thea},
Expand Down
Binary file modified docs/manual.pdf
Binary file not shown.
137 changes: 85 additions & 52 deletions docs/manual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@
\titleformat*{\paragraph}{\large\bfseries}


\title{The preseq Manual}
\author{Timothy Daley \and Victoria Helus \and Andrew Smith }
\title{The \textbf{preseq} Manual}
\author{Timothy Daley \and Victoria Helus \and Chao Deng \and Andrew Smith }

\begin{document}
\maketitle
Expand All @@ -42,39 +42,45 @@ \section{Quick Start}



The \textbf{preseq} package is aimed at predicting
the yield of distinct reads from a genomic library
from an initial sequencing experiment. The estimates
The \textbf{preseq} package is aimed to help researchers
design and optimize sequencing experiments by using
population sampling models to infer properties of the
population or the behavior under deeper sampling based
upon a small initial sequencing experiment. The estimates
can then be used to examine the utility of further
sequencing, optimize the sequencing depth,
or to screen multiple libraries to avoid low complexity
samples.~\\[-.2cm]

\noindent The three main programs are \fn{c\_curve}, \fn{lc\_extrap},
and \fn{gc\_extrap}.
\fn{c\_curve} samples reads without replacement from the
given mapped sequenced read file or duplicate count file to estimate the yield
of the experiment and the subsampled experiments. These estimates
are used construct the complexity
curve of the experiment. \fn{lc\_extrap} uses rational function approximations
\noindent The four main programs are \fn{c\_curve},
\fn{lc\_extrap}, \fn{gc\_extrap}, and \fn{bound\_pop}.
\fn{c\_curve} interpolates the expected complexity
curve based upon a hypergeometric formula and
is primarily used to check predictions from
\fn{lc\_extrap} and \fn{gc\_extrap}.
\fn{lc\_extrap} uses rational function approximations
of Good \& Toulmin's~\cite{good1956number} non-parametric
empirical Bayes estimator to predict the yield
empirical Bayes estimator to predict the library complexity
of future experiments, in essence looking into the future
for hypothetical experiments. \fn{lc\_extrap} is used to predict
the yield and then \fn{c\_curve} can be used to check the yield
from the larger experiment.
for hypothetical experiments.

\fn{gc\_extrap} uses rational function approximations
to Good \& Toulmin's estimator to predict the genomic
coverage, i.e. the number of bases covered at least once,
\fn{gc\_extrap} uses a similar approach as \fn{lc\_extrap}
to predict the genome coverage,
i.e. the number of bases covered at least once,
from deeper sequencing in a single cell or low input sequencing
experiment based on the observed coverage counts.
The option is available to predict the coverage based on binned
An option is available to predict the coverage based on binned
coverage counts to speed up the estimates.
\fn{gc\_extrap} requires mapped read or bed format
input, so the tool \fn{bam2mr} is provided to convert
bam format read to mapped read format.

\fn{bound\_pop} uses a non-parametric moment-based
approach to conservatively estimate the total number
of classes in the sample, also called the species
richness of the population that is sampled.


\newpage

\section{Installation}
Expand All @@ -83,7 +89,8 @@ \section{Installation}
\paragraph{Download}
\label{sub:download}~\\~\\[-.2cm]
\raggedright{\textbf{preseq} is available at }
\url{http://smithlab.cmb.usc.edu/software/}.
\url{http://smithlabresearch.org/software/preseq/}
or \url{https://github.com/smithlabcode/preseq}.


\paragraph{System Requirements}
Expand All @@ -92,56 +99,66 @@ \section{Installation}
\textbf{preseq} runs on Unix-type system
with GNU Scientific Library (GSL), available
at ~\url{http://www.gnu.org/software/gsl/}.
If the input file is in BAM format, SAMTools is
required, available at ~\url{http://samtools.sourceforge.net/}.
If the input is
a text file of counts in a single column or is
If the input file is in BAM format, the SAMTools
API is required but is included in all binaries and
source code.
If the input is a text file of counts in a single column or is
in BED format,
SAMTools is not required.
It has been tested on Linux and
Mac OS-X.

\paragraph{Installation}~\\~\\[-.2cm]
\label{sub:install}
Download the source code and decompress
it with
If the source code was downloaded from the Smithlab
website the first step is to decompress it using the
command
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$ tar -jxvf preseq.tar.bz2
\end{alltt} \endgroup
To download the source code from GitHub, use
the command
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$ git clone --recursive git://github.com/smithlabcode/preseq.git
\end{alltt} \endgroup
%
Enter the \textbf{preseq/} directory and run
In both cases, enter the \textbf{preseq/} directory and run
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$ make all
\end{alltt}\endgroup
to compile all the code.

The input file may possibly be in BAM format. If the root directory
of SAMTools is \$SAMTools, instead run
If one wishes to link to SAMTools API not
included with the source code, the if the
SAMTools API is located at \$SAMTools instead run
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$ make all SAMTOOLS_DIR=$SAMTools
\end{alltt}\endgroup
Output after typing this command should include the flag \fn{-DHAVE\_SAMTOOLS} if the linking is successful. If compiled successfully, the executable file is available
in \textbf{preseq/}.

If a BAM file is used as input without first having run \begingroup \fontsize{9pt}{11pt}\selectfont \fn{\$ make all SAMTOOLS\_DIR=/loc/of/SAMTools}\endgroup, then the following error will occur: \begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup.
If a BAM file is used as input without successful linking to
SAMTools, then the following error will occur:
\begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup.

\newpage

\section{Using preseq}
\section{Using \textbf{preseq}}
\label{sec:usage}

\paragraph{Basic usage}~\\~\\[-.2cm]
\label{sub:basic}
To generate the complexity plot of a genomic
To generate the complexity curve of a genomic
library from a read file in BED or BAM format or a duplicate count file,
use the function \fn{c\_curve}. Use
\fn{-o} to specify the output name.
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$ ./preseq c_curve -o complexity_output.txt input.bed
\end{alltt}\endgroup

To estimate the future yield
of a genomic library
using an initial experiment in BED format,
To predict the complexity curve
of a sequencing library
using an initial experiment in BED format,
use the function \fn{lc\_extrap}.
The required options are \fn{-o} to specify
the output of the yield estimates and
Expand All @@ -159,7 +176,7 @@ \section{Using preseq}
coverage is highly variable and uncertain function
of sequencing depth. Some regions may be missing
due to locus dropout or preferentially amplified during
MDA (multiple displacement amplification).
whole genome amplification.
\fn{gc\_extrap} allows the level genomic coverage from deep
sequencing to be predicted based on an initial sample.
The input file format need to be a mapped read (MR) or BED,
Expand Down Expand Up @@ -198,7 +215,9 @@ \section{File Format}
mapped fragments are counted. This means that both ends
of a disconcordantly mapped read will each be counted separately.
If a large number of reads are disconcordant, then
the default single end should be used. In this case only the mapping
the default single end should be used or the disconcordantly
mapped reads removed prior to running \textbf{preseq}.
In this case only the mapping
location of the first mate
is used as the unique molecular identifier~\cite{kivioja2011counting}.

Expand All @@ -223,7 +242,9 @@ \section{File Format}
\end{alltt}\endgroup
More complicated unique molecular identifiers
can be used, such as mapping position plus a random barcode,
but are too complicated to detail in this manual. For questions with such usage, please contact us at \href{mailto:[email protected]}{\nolinkurl{[email protected]}}
but are too complicated to detail in this manual.
For questions with such usage, please contact us at
\href{mailto:[email protected]}{\nolinkurl{[email protected]}}

\paragraph{Mapped read format for \fn{gc\_extrap}}~\\~\\[-.2cm]

Expand All @@ -247,9 +268,8 @@ \section{Detailed usage}
\label{sec:complexityplot}

\fn{c\_curve} is used to compute the
expected complexity curve of a mapped read file by
subsampling smaller experiments without replacement
and counting the distinct reads.
expected complexity curve of a mapped read file
with a hypergeometric formula~\cite{heck1975explicit}.
Output is a text file with two
columns. The first gives the total number
of reads and the second the corresponding number
Expand All @@ -265,6 +285,8 @@ \section{Detailed usage}
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
\end{description}

\newpage

\paragraph{lc\_extrap}~\\~\\[-.2cm]
\label{sec:librarycomplexity}

Expand Down Expand Up @@ -297,8 +319,11 @@ \section{Detailed usage}
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-H, -hist\endgroup] Input is a text file of the observed histogram
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate yield without bootstrapping for confidence intervals
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-D, -defects\endgroup] Defects mode, estimates the complexity curve without checking for instabilities in the curve. Should only be used on datasets that fail estimation without defects.
\end{description}

\newpage

\paragraph{gc\_extrap}~\\~\\[-.2cm]
\label{sec:genomiccoverage}

Expand Down Expand Up @@ -333,6 +358,8 @@ \section{Detailed usage}
\item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate genomic coverage without bootstrapping for confidence intervals
\end{description}

\newpage

\paragraph{bound\_pop}~\\~\\[-.2cm]
\label{sec:lib_size}

Expand Down Expand Up @@ -468,7 +495,10 @@ \section{lc\_extrap Examples}
10 146334
\end{alltt}\endgroup

The following command will give output of the same format as the above examples.\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -H histogram.txt \end{alltt}\endgroup
The following command will give output of the same format as the above examples.
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$./preseq lc_extrap -o future_yield.txt -H histogram.txt
\end{alltt}\endgroup

Similarly, both \fn{lc\_extrap} and \fn{c\_curve} allow the option to input read counts (text file should contain ONLY the observed counts in a single column). For example, if a dataset had the following counts histogram:

Expand All @@ -490,7 +520,10 @@ \section{lc\_extrap Examples}
1
\end{alltt}\endgroup

Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode): \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -V counts.txt \end{alltt}\endgroup
Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode):
\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
$./preseq lc_extrap -o future_yield.txt -V counts.txt
\end{alltt}\endgroup

\newpage

Expand Down Expand Up @@ -665,11 +698,11 @@ \section{bound\_pop Example}

\newpage

\section{preseq Application Examples}
\section{\textbf{preseq} Application Examples}

\subsection*{Screening multiple libraries}
\label{sec:multlib}
This section provides a more detailed example using data from different experiments to illustrate how preseq might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).
This section provides a more detailed example using data from different experiments to illustrate how \textbf{preseq} might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372).

These libraries help show what would be considered a relatively poor library and a relatively good library, as well as compare the complexity curves obtained from running \fn{c\_curve} and \fn{lc\_extrap}, to show how \fn{lc\_extrap} would help in the decision to sequence further. The black diagonal line represents an ideal library, in which every read is a distinct read (though this cannot be achieved in reality). The full experiments were down sampled at 5\% to obtain a mock initial experiment of the libraries, as shown here, where we have the complexity curves of the initial experiments generated by \fn{c\_curve}:
~\newline
Expand Down Expand Up @@ -853,7 +886,7 @@ \subsection*{Estimating and analyzing TCR$\beta$ richness}

\section{FAQ}

\Que{When compiling the preseq binary, I receive the error
\Que{When compiling the \textbf{preseq} binary, I receive the error

\fn{fatal error: gsl/gsl\_cdf.h: No such file or directory
}
Expand All @@ -864,7 +897,7 @@ \section{FAQ}



\Que{When compiling the preseq binary, I receive the error
\Que{When compiling the \textbf{preseq} binary, I receive the error

\fn{Undefined symbols for architecture x86\_64: ~\\
\tab"\_packInt16", referenced from:~\\
Expand All @@ -883,14 +916,14 @@ \section{FAQ}



\Que{I compile the preseq binary but receive the error
\Que{I compile the \textbf{preseq} binary but receive the error

\fn{terminate called after throwing an instance of 'std::string'}
}

\Ans{This error is typically called because either the flag -B was not included to
specify bam input or because the linking to SAMTools was not included when
compiling preseq. To ensure that the linking was done properly, check for the flag
compiling \textbf{preseq}. To ensure that the linking was done properly, check for the flag
\fn{-DHAVE\_SAMTOOLS}.}

\Que{When running \fn{lc\_extrap}, I receive the error
Expand Down Expand Up @@ -950,12 +983,12 @@ \section{FAQ}
\vspace{5mm}
If none of these solutions worked, please email us at
\href{mailto:[email protected]}{\nolinkurl{[email protected]}}
and please include the standard output from running preseq in
and please include the standard output from running \textbf{preseq} in
verbose mode (specifically the duplicate counts histogram) so
that we can look into the problem and rectify problems in future
versions. Also, feel free to email us with any other questions or
concerns.
The preseq software is still under development so we would appreciate any
The \textbf{preseq} software is still under development so we would appreciate any
advice, comments, or notification of any possible bugs. Thanks!

\newpage
Expand Down

0 comments on commit 9580631

Please sign in to comment.