corrections in manual

smithlabcode · Oct 29, 2015 · 9580631 · 9580631
1 parent 19b8739
commit 9580631
Show file tree

Hide file tree

Showing 3 changed files with 96 additions and 52 deletions.
diff --git a/docs/biblio.bib b/docs/biblio.bib
@@ -1,3 +1,14 @@
+@article{heck1975explicit,
+  title={Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size},
+  author={Heck, Jr, Kenneth L and van Belle, Gerald and Simberloff, Daniel},
+  journal={Ecology},
+  volume={56},
+  number={6},
+  pages={1459--1461},
+  year={1975},
+  publisher={JSTOR}
+}
+
 @article{willis2015inference,
   title={Inference for changes in biodiversity},
   author={Willis, Amy and Bunge, John and Whitman, Thea},

diff --git a/docs/manual.pdf b/docs/manual.pdf
diff --git a/docs/manual.tex b/docs/manual.tex
@@ -26,8 +26,8 @@
 \titleformat*{\paragraph}{\large\bfseries}
 
 
-\title{The preseq Manual}
-\author{Timothy Daley \and Victoria Helus \and Andrew Smith }
+\title{The \textbf{preseq} Manual}
+\author{Timothy Daley \and Victoria Helus \and Chao Deng \and Andrew Smith }
 
 \begin{document}
 \maketitle
@@ -42,39 +42,45 @@ \section{Quick Start}
 
 
 
-The \textbf{preseq} package is aimed at predicting
-the yield of distinct reads from a genomic library
-from an initial sequencing experiment.  The estimates
+The \textbf{preseq} package is aimed to help researchers
+design and optimize sequencing experiments by using
+population sampling models to infer properties of the
+population or the behavior under deeper sampling based 
+upon a small initial sequencing experiment.  The estimates
 can then be used to examine the utility of further
 sequencing, optimize the sequencing depth,
 or to screen multiple libraries to avoid low complexity
 samples.~\\[-.2cm]
 
-\noindent The three main programs are \fn{c\_curve}, \fn{lc\_extrap},
-and \fn{gc\_extrap}.
-\fn{c\_curve} samples reads without replacement from the 
-given mapped sequenced read file or duplicate count file to estimate the yield
-of the experiment and the subsampled experiments.  These estimates
-are used construct the complexity
-curve of the experiment.  \fn{lc\_extrap} uses rational function approximations
+\noindent The four main programs are \fn{c\_curve}, 
+\fn{lc\_extrap}, \fn{gc\_extrap}, and \fn{bound\_pop}.
+\fn{c\_curve}  interpolates the expected complexity
+curve based upon a hypergeometric formula and
+is primarily used to check predictions from 
+\fn{lc\_extrap} and \fn{gc\_extrap}.  
+\fn{lc\_extrap} uses rational function approximations
 of Good \& Toulmin's~\cite{good1956number} non-parametric
-empirical Bayes estimator to predict the yield
+empirical Bayes estimator to predict the library complexity
 of future experiments, in essence looking into the future
-for hypothetical experiments.  \fn{lc\_extrap} is used to predict 
-the yield and then \fn{c\_curve} can be used to check the yield
-from the larger experiment.
+for hypothetical experiments.  
 
-\fn{gc\_extrap} uses rational function approximations
-to Good \& Toulmin's estimator to predict the genomic
-coverage, i.e. the number of bases covered at least once,
+\fn{gc\_extrap} uses a similar approach as \fn{lc\_extrap}
+to predict the genome coverage, 
+i.e. the number of bases covered at least once,
 from deeper sequencing in a single cell or low input sequencing
 experiment based on the observed coverage counts.
-The option is available to predict the coverage based on binned
+An option is available to predict the coverage based on binned
 coverage counts to speed up the estimates.  
 \fn{gc\_extrap} requires mapped read or bed format
 input, so the tool \fn{bam2mr} is provided to convert
 bam format read to mapped read format.
 
+\fn{bound\_pop} uses a non-parametric moment-based
+approach to conservatively estimate the total number
+of classes in the sample, also called the species
+richness of the population that is sampled.
+
+
 \newpage
 
 \section{Installation}
@@ -83,7 +89,8 @@ \section{Installation}
 \paragraph{Download}
 \label{sub:download}~\\~\\[-.2cm]
 \raggedright{\textbf{preseq} is available at }
-\url{http://smithlab.cmb.usc.edu/software/}.
+\url{http://smithlabresearch.org/software/preseq/}
+or \url{https://github.com/smithlabcode/preseq}.
 
 
 \paragraph{System Requirements}
@@ -92,56 +99,66 @@ \section{Installation}
 \textbf{preseq} runs on Unix-type system
 with GNU Scientific Library (GSL), available
 at ~\url{http://www.gnu.org/software/gsl/}.  
-If the input file is in BAM format, SAMTools is
-required, available at ~\url{http://samtools.sourceforge.net/}.
-If the input is 
-a text file of counts in a single column or is 
+If the input file is in BAM format, the SAMTools
+API is required but is included in all binaries and 
+source code.
+If the input is a text file of counts in a single column or is 
 in BED format, 
 SAMTools is not required.
 It has been tested on Linux and 
 Mac OS-X.  
 
 \paragraph{Installation}~\\~\\[-.2cm]
 \label{sub:install}
-Download the source code and decompress
-it with 
+If the source code was downloaded from the Smithlab
+website the first step is to decompress it using the
+command
 \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
  $ tar -jxvf preseq.tar.bz2
 \end{alltt} \endgroup
+To download the source code from GitHub, use
+the command
+\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
+ $ git clone --recursive git://github.com/smithlabcode/preseq.git
+\end{alltt} \endgroup
 % 
-Enter the \textbf{preseq/} directory and run
+In both cases, enter the \textbf{preseq/} directory and run
 \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
 $ make all
 \end{alltt}\endgroup
+to compile all the code.
 
-The input file may possibly be in BAM format. If the root directory 
-of SAMTools is \$SAMTools, instead run
+If one wishes to link to SAMTools API not
+included with the source code, the if the
+SAMTools API is located at \$SAMTools instead run
 \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
 $ make all SAMTOOLS_DIR=$SAMTools
 \end{alltt}\endgroup
 Output after typing this command should include the flag \fn{-DHAVE\_SAMTOOLS} if the linking is successful. If compiled successfully, the executable file is available
 in \textbf{preseq/}. 
 
-If a BAM file is used as input without first having run \begingroup \fontsize{9pt}{11pt}\selectfont  \fn{\$ make all SAMTOOLS\_DIR=/loc/of/SAMTools}\endgroup, then the following error will occur: \begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup. 
+If a BAM file is used as input without successful linking to 
+SAMTools, then the following error will occur: 
+\begingroup \fontsize{9pt}{12pt}\selectfont \fn{terminate called after throwing an instance of 'std::string'}\endgroup. 
 
 \newpage
 
-\section{Using preseq}
+\section{Using \textbf{preseq}}
 \label{sec:usage}
 
 \paragraph{Basic usage}~\\~\\[-.2cm]
 \label{sub:basic}
-To generate the complexity plot of a genomic
+To generate the complexity curve of a genomic
 library from a read file in BED or BAM format or a duplicate count file,
 use the function \fn{c\_curve}.  Use
 \fn{-o} to specify the output name.
 \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt}
 $ ./preseq c_curve -o complexity_output.txt input.bed
 \end{alltt}\endgroup
 
-To estimate the future yield 
-of a genomic library
-using an initial experiment in BED  format,
+To predict the complexity curve 
+of a sequencing library
+using an initial experiment in BED format,
 use the function \fn{lc\_extrap}.
 The required options are \fn{-o} to specify
 the output of the yield estimates and 
@@ -159,7 +176,7 @@ \section{Using preseq}
 coverage is highly variable and uncertain function
 of sequencing depth.  Some regions may be missing
 due to locus dropout or preferentially amplified during
-MDA (multiple displacement amplification).  
+whole genome amplification.  
 \fn{gc\_extrap} allows the level genomic coverage from deep
 sequencing to be predicted based on an initial sample.
 The input file format need to be a mapped read (MR) or BED,
@@ -198,7 +215,9 @@ \section{File Format}
 mapped fragments are counted.  This means that both ends
 of a disconcordantly mapped read will each be counted separately.
 If a large number of reads are disconcordant, then
-the default single end should be used.  In this case only the mapping 
+the default single end should be used or the disconcordantly
+mapped reads removed prior to running \textbf{preseq}.
+In this case only the mapping 
 location of the first mate
 is used as the unique molecular identifier~\cite{kivioja2011counting}.
 
@@ -223,7 +242,9 @@ \section{File Format}
 \end{alltt}\endgroup
 More complicated unique molecular identifiers
 can be used, such as mapping position plus a random barcode,
-but are too complicated to detail in this manual. For questions with such usage, please contact us at \href{mailto:[email protected]}{\nolinkurl{[email protected]}}
+but are too complicated to detail in this manual. 
+For questions with such usage, please contact us at 
+\href{mailto:[email protected]}{\nolinkurl{[email protected]}}
 
 \paragraph{Mapped read format for \fn{gc\_extrap}}~\\~\\[-.2cm]
 
@@ -247,9 +268,8 @@ \section{Detailed usage}
 \label{sec:complexityplot}
 
 \fn{c\_curve} is used to compute the 
-expected complexity curve of a mapped read file by 
-subsampling smaller experiments without replacement 
-and counting the distinct reads.
+expected complexity curve of a mapped read file 
+with a hypergeometric formula~\cite{heck1975explicit}.
 Output is a text file with two 
 columns.  The first gives the total number
 of reads and the second the corresponding number
@@ -265,6 +285,8 @@ \section{Detailed usage}
 \item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
 \end{description}
 
+\newpage
+
 \paragraph{lc\_extrap}~\\~\\[-.2cm]
 \label{sec:librarycomplexity}
 
@@ -297,8 +319,11 @@ \section{Detailed usage}
 \item[\begingroup \fontsize{9pt}{12pt}\selectfont-H, -hist\endgroup] Input is a text file of the observed histogram
 \item[\begingroup \fontsize{9pt}{12pt}\selectfont-V, -vals\endgroup] Input is a text file of read counts
 \item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate yield without bootstrapping for confidence intervals
+\item[\begingroup \fontsize{9pt}{12pt}\selectfont-D, -defects\endgroup] Defects mode, estimates the complexity curve without checking for instabilities in the curve.  Should only be used on datasets that fail estimation without defects.
 \end{description}
 
+\newpage
+
 \paragraph{gc\_extrap}~\\~\\[-.2cm]
 \label{sec:genomiccoverage}
 
@@ -333,6 +358,8 @@ \section{Detailed usage}
 \item[\begingroup \fontsize{9pt}{12pt}\selectfont-Q, -quick\endgroup] Quick mode, option to estimate genomic coverage without bootstrapping for confidence intervals
 \end{description}
 
+\newpage
+
 \paragraph{bound\_pop}~\\~\\[-.2cm]
 \label{sec:lib_size}
 
@@ -468,7 +495,10 @@ \section{lc\_extrap Examples}
 10      146334
 \end{alltt}\endgroup
 
-The following command will give output of the same format as the above examples.\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -H histogram.txt \end{alltt}\endgroup
+The following command will give output of the same format as the above examples.
+\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} 
+$./preseq lc_extrap -o future_yield.txt -H histogram.txt 
+\end{alltt}\endgroup
 
 Similarly, both \fn{lc\_extrap} and \fn{c\_curve} allow the option to input read counts (text file should contain ONLY the observed counts in a single column). For example, if a dataset had the following counts histogram:
 
@@ -490,7 +520,10 @@ \section{lc\_extrap Examples}
 1
 \end{alltt}\endgroup
 
-Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode): \begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} $./preseq lc_extrap -o future_yield.txt -V counts.txt \end{alltt}\endgroup
+Command should be run with the \fn{-V} flag (not to be confused with \fn{-v} for verbose mode): 
+\begingroup \fontsize{9pt}{12pt}\selectfont \begin{alltt} 
+$./preseq lc_extrap -o future_yield.txt -V counts.txt 
+\end{alltt}\endgroup
 
 \newpage
 
@@ -665,11 +698,11 @@ \section{bound\_pop Example}
 
 \newpage
 
-\section{preseq Application Examples}
+\section{\textbf{preseq} Application Examples}
 
 \subsection*{Screening multiple libraries}
 \label{sec:multlib}
-This section provides a more detailed example using data from different experiments to illustrate how preseq might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372). 
+This section provides a more detailed example using data from different experiments to illustrate how \textbf{preseq} might be applied. Because it is important to avoid spending time on low complexity samples, it is important to decide after observing an initial experiment whether or not it is beneficial to continue with sequencing. The data in this example comes from a study (accession number SRA061610) using single cell sperm cells amplified by Multiple Annealing and Looping Based Amplification Cycles (MALBAC)~\cite{lu2012probing} and focuses on three libraries coming from different experiments from the study (SRX205369, SRX205370, SRX205372). 
 
 These libraries help show what would be considered a relatively poor library and a relatively good library, as well as compare the complexity curves obtained from running \fn{c\_curve} and \fn{lc\_extrap}, to show how \fn{lc\_extrap} would help in the decision to sequence further. The black diagonal line represents an ideal library, in which every read is a distinct read (though this cannot be achieved in reality). The full experiments were down sampled at 5\% to obtain a mock initial experiment of the libraries, as shown here, where we have the complexity curves  of the initial experiments generated by \fn{c\_curve}:
 ~\newline
@@ -853,7 +886,7 @@ \subsection*{Estimating and analyzing TCR$\beta$ richness}
 
 \section{FAQ}
 
-\Que{When compiling the preseq binary, I receive the error
+\Que{When compiling the \textbf{preseq} binary, I receive the error
 
 \fn{fatal error: gsl/gsl\_cdf.h: No such file or directory
 }
@@ -864,7 +897,7 @@ \section{FAQ}
 
 
 
-\Que{When compiling the preseq binary, I receive the error
+\Que{When compiling the \textbf{preseq} binary, I receive the error
 
 \fn{Undefined symbols for architecture x86\_64: ~\\
 \tab"\_packInt16", referenced from:~\\
@@ -883,14 +916,14 @@ \section{FAQ}
 
 
 
-\Que{I compile the preseq binary but receive the error 
+\Que{I compile the \textbf{preseq} binary but receive the error 
 
 \fn{terminate called after throwing an instance of 'std::string'}
 }
 
 \Ans{This error is typically called because either the flag -B was not included to 
 specify bam input or because the linking to SAMTools was not included when
-compiling preseq.  To ensure that the linking was done properly, check for the flag
+compiling \textbf{preseq}.  To ensure that the linking was done properly, check for the flag
 \fn{-DHAVE\_SAMTOOLS}.}
 
 \Que{When running \fn{lc\_extrap}, I receive the error 
@@ -950,12 +983,12 @@ \section{FAQ}
 \vspace{5mm}
 If none of these solutions worked, please email us at 
 \href{mailto:[email protected]}{\nolinkurl{[email protected]}}
-and please include the standard output from running preseq in
+and please include the standard output from running \textbf{preseq} in
 verbose mode (specifically the duplicate counts histogram) so 
 that we can look into the problem and rectify problems in future
 versions.  Also, feel free to email us with any other questions or
 concerns.
-The preseq software is still under development so we would appreciate any 
+The \textbf{preseq} software is still under development so we would appreciate any 
 advice, comments, or notification of any possible bugs. Thanks!
 
 \newpage