-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathBiB-reproducibility.Rnw
148 lines (107 loc) · 34.3 KB
/
BiB-reproducibility.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
\documentclass[11pt]{article}
% Guidelines
% Abstract (short)
% Biography (~30 words for each author)
% Keywords (up to six):
% 2000 and 5000
% No limit on the number of figures
\usepackage[margin=1in]{geometry}
\usepackage{endfloat}
\usepackage[square,numbers,sort&compress]{natbib}
<<load-packages-options,cache=FALSE,echo=FALSE,message=FALSE,warning=FALSE>>=
library(ggplot2)
library(gdata)
library(reshape2)
library(plyr)
#Set some knitr options
opts_knit$set(progress = TRUE, verbose = TRUE)
opts_chunk$set(cache=FALSE, fig.width=16, echo=TRUE, eval=TRUE, message=FALSE, warning=FALSE)
@
\begin{document}
\title{Comparability and reproducibility of biomedical data}
\author{Raphael Gottardo and Yunda Huang\\ Fred Hutchinson Cancer Research Center}
\date{\today}
\maketitle
% Over the past two decades, the biomedical field has been transformed by the advent of new high throughput technologies such as gene expression microarrays, protein arrays, flow cytometry and next generation sequencing. Experiments and protocols have become increasingly complex, involving the use of instruments that can be very sensitive to specific settings. Furthermore, these novel biomedical technologies generate large high-dimensional data sets from individual experiments. The growth of such data has highlighted the importance of implementing data management and analysis plans as an integral part of experimental design. In consequence, data analysis procedures contribute significantly to the reproducibility or non-reproducibility of an experiment or publication. Unfortunately, as of today, too many published studies remain irreproducible due to the lack of sharing of detailed experimental protocols, data, computer code, or software required to reproduce the study results. It is crutial that funding agencies, publishers and researchers to work together by setting very strict reproducibility guidelines and policies. Such policies could potentially save a great deal of money and resources by making sure that scientific errors can quickly be discovered and corrected instead of giving birth to new scientific projects and clinical trials based on erroneous results.
\begin{abstract}
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment simultaneously measures hundreds to thousands of individual features (\textit{e.g.}, genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis, and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global pressures, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code.
Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time an money on novel projects. The availability of high quality, reproducible data, will also lead to more powerful analyses (or meta-analyses) where multiple datasets are combined to generate new knowledge. In this article, we review some of the recent development in the area of biomedical reproducibility and comparability, and discuss some of the areas where the overall field could be improved.\\
\noindent \textbf{Keywords:} Analysis pipeline, Accuracy, Open science, Precision, Protocol, Standardization
\end{abstract}
\section{Introduction}
% Some text to motivate the reproducibility of data analysis
Over the past two decades, the biomedical field has been transformed by the advent of new high throughput technologies such as gene expression microarrays, protein arrays, flow cytometry and next generation sequencing. Experiments and protocols have become increasingly complex, involving the use of instruments that can be very sensitive to specific settings. For example, small changes in the photomultiplier tube (PMT) voltage of a flow cytometer or a microarray scanner could drastically change the output of an experiment \cite{Lyng:2004ht}. It is thus crucial that protocols be well described, standardized and shared in order for an experiment to be reproducible and comparable within and between laboratories.
Furthermore, these novel biomedical technologies generate large high-dimensional data sets from individual experiments. The growth of such data has highlighted the importance of implementing data management and analysis plans as an integral part of experimental design. In consequence, data analysis procedures contribute significantly to the reproducibility or non-reproducibility of an experiment or publication. Unfortunately, as of today, too many published studies remain irreproducible due to the lack of sharing of data, computer code, or software required to reproduce the study results. This lack of reproducibility has had significant impact, leading to the halt of a cancer clinical trial when key gene expression signatures used for decision making were found to be caused by analysis errors and could not be independently reproduced by researchers \cite{Hutson:2010ih}. Had the data and computer code been made available, the results of the study could have been invalidated more rapidly, which could have saved funding, avoided giving patients false hope, and most importantly ensured patients received effective treatment \cite{Baggerly:2011ca}. Fortunately, over the past decade, computers, software tools, and online resources have drastically improved to the point that it is easier than ever to share data, code and construct fully reproducible data analysis pipelines.
In this paper, we review some of the fundamental issues involved in the reproducibility and comparability (C\&R) of biomedical data going from assay standardization to reproducible data analysis. Our intent is not to exhaustively review all possible problems with all existing assays, but rather to select a few concrete examples based on our own experience and present some thoughts and solutions towards the overall concept of comparability and reproducibility. Our paper is divided into two main sections, one related to the experiment reproducibility and one to the analysis reproducibility, though the two topics significantly overlap.
\section{Reproducibility of assay and primary data}
\label{sec:experiment}
\subsection{Overview of data generation process and its impact on C\&R}
We examine a prototypical biomedical data generation process to illustrate factors that may negatively impact the C\&R of the data throughout different stages of the process. As shown in Figure \ref{fig:circle}, a data generation process can be roughly broken down into three core stages (Steps 1-3) of information transformation from signals contained in biological samples to numeric values captured in datasets for analysis. In step 1, biological samples are measured and raw instrument data are generated. There are several factors that may influence the C\&R of data at this stage. These include some obvious factors such as the specific type of technologies (\textit{e.g.}, hybridization-based or sequence-based gene expression) \cite[e.g.,][]{Yauk2004, LiuJenssen2007, Kuo2006, GitDvinge2010} or platforms (\textit{e.g.}, Affymetrix, Illumina or Operon) \cite[e.g.,][]{Larkin:2005hb,Baumbusch2008, WangHowel2011, LiuKuo2011, ChangWei2012}, the standard operating procedures for biological sample preparation, experiment layout, and measurement \cite[e.g.,][]{AlMulla2004, Ach:2007er}, as well as other conditions that are often not specified in the experiment protocol. For example, the level of experience or expertise of the technicians performing the experiment \cite{Duewer2009, Todd:2012cx}, or the origin of the reagents (\textit{e.g.}, batch effects \cite{Scherer2009, Leek:2010jq}) are also possible sources for differences between independent experimental results. Therefore, in step 1, to increase the C\&R of data, all these factors should be thought out and optimally controlled and standardized whenever possible. When factors such as technicians or reagent batches may not be standardizable across multiple studies or labs, a measuring system comprised of a specific platform using a specific technology should strive to minimize variations caused by these factors and increase robustness against changes in these factors. In step 2, raw information from an instrument is calibrated and quantified into numeric values. This stage often involves image analyses for information alignment and/or dimension reduction. Consequently, the specific algorithms used to make such transformations, their implementation in software, and the specific data storage structures, including data formats (\textit{i.e.}, databases or flat files) and variable naming conventions are vital to maintaining data consistency and should be standardized to a maximal level for effective C\&R of the data. We will refer to the data derived from this stage as primary data versus the secondary data generated after step 3.
In some specific cases, primary data are derived directly from the instrument but in many cases the extremely large size of the raw data (\textit{e.g.}, raw images) make it prohibitive to share these, and the lack of true raw data is accepted.
Lastly, in step 3, data from step 2 are further (pre-)processed before study objective-driven analyses are conducted. This later stage often involves further data alignment such as background-adjustment or data aggregation such as per-biomarker summarization from multiple subset measurements. Certain quality assurance and control processing may also occur to remove unreliable data and reduce any systematic variations between data points. As in step 2, the specification and implementation of the algorithms and the data storage structures should be tracked in the effort to maintain the C\& R of the data. In section \ref{sec:data}, we will discuss some of the tools available to share step 2 data and associated computer code for data processing and analysis.
\begin{figure}
\begin{center}
\includegraphics[width=7in]{./figure/OverallChart.pdf}
\label{fig:circle}
\caption{Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. For each step, guidelines for full reproducibility are given in the corresponding blue rectangles. After completion of all steps according to the reproducible guidelines, the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.}
\end{center}
\end{figure}
\subsection{Metrics to quantify C\&R}
We use accuracy and precision as two building-block metrics to illustrate the concept of C\&R. While the exact definition of C\&R may vary depending on the context, accuracy and precision are two well defined statistical concepts. Specifically, accuracy indicates how close a measurement is to its true (actual) value, whereas precision indicates how close measurements are to each other. Deviation from accuracy (\textit{i.e.}, bias) is often introduced by systematic sources of error. For example, factors mentioned earlier such as the measuring system, or a poor antibody may be a primary source of bias that cannot be removed by repeating or averaging large numbers of measurements. On the other hand, precision (\textit{i.e.}, variability) of data can generally be improved by increasing the number of measurements. For this reason, biological and technical replicates are recommended in an experimental design to help distinguish biological variation from technical variation. In general, there is a trade-off between accuracy and precision, in the sense that one cannot optimize both simultaneously. For example, in microarray image analysis, spots can either be summarized by the estimated foreground intensity or the background corrected intensity (foreground minus the background). Foreground intensities are typically less variable but can exhibit higher variance compared to background corrected intensity. In this context, many research groups have proposed pre-processing techniques that aim at finding a good compromise between the two\citep{Gottardo2006b,Scharpf:2007jj}. A hypothetical example is shown in Figure \ref{fig:variance-bias}, where comparable and reproducible data do not necessarily require unbiased measurements as long as they are ``consistently inaccurate'' (Protocol C). Imagine a hypothetical gene expression device that always measures the expression of a gene as being zero. The experiment is highly reproducible but completely biased, and thus useless. It is not atypical for an experimentalist to compute a coefficient of correlation between two series of experiments and to be very pleased when he/she obtains a value close to 1. Unfortunately, the large correlation could be explained by the fact that the measurements are biased and both are correlated with the same experimental artifact. So it is important that when C\&R is evaluated, accuracy is also taken into consideration. Therefore, to ensure meaningful integrative analysis of biomedical data from multiple sources, we encourage the inclusion of a ``gold standard'' of measurement whenever possible such as the inclusion of \textit{established} positive and negative controls in the experiments. In this way, any signals identified from comparable and reproducible data can also be scrutinized against the gold standard for true scientific values.
\begin{figure}
\begin{center}
<<"data-for-reproducibility", fig.width=8,fig.height=6,dev=c('pdf', 'postscript'),echo=FALSE,dev.args=list(pointsize=16)>>=
# Set a seed for reproducibility
set.seed(6)
# True estimate
true.mean<-6
# Number of replicates
n<-100
biased.low.var<-rnorm(n,true.mean+3,sd=1)
biased.high.var<-rnorm(n,true.mean+2,sd=1.5)
unbiased.low.var<-rnorm(n,true.mean,sd=2)
unbiased.high.var<-rnorm(n,true.mean,sd=3)
data<-data.frame(Replicates=c(unbiased.low.var,unbiased.high.var,biased.low.var,biased.high.var),variance=c(rep("Low variance",n),rep("High variance",n),rep("Low variance",n),rep("High variance",n)),bias=c(rep("Unbiased",n),rep("Unbiased",n),rep("Biased",n),rep("Biased",n)),Protocol=c(rep("A",n),rep("B",n),rep("C",n),rep("D",n)),Exp=rep("x",4*n))
ggplot()+geom_boxplot(data=data,aes(y=Replicates,x=Exp,fill=Protocol,alpha=Protocol),outlier.size=0)+geom_abline(data=data,intercept=true.mean,slope=0,size=2,color="black",alpha=.5)+facet_grid(variance~bias)+theme_bw()+opts(panel.grid.major=theme_blank(),panel.grid.minor=theme_blank(),axis.text.y = theme_blank(),axis.text.x = theme_blank(), axis.title.x = theme_blank(), axis.ticks=theme_blank())+ annotate("text", 0, 6, label = "Truth", hjust = -0.5, vjust = -0.5)
@
\label{fig:variance-bias}
\caption{Precision-Accuracy trade off. Four different protocols are compared. Protocol B exhibits large variance (wide box) with small bias (close to the true value on average) while protocol C has small variance but large bias. Overall, protocol D exhibits good variance-bias trade off and should be prefered.}
\end{center}
\end{figure}
\subsection{Methods to correct for experimental bias}
In the presence of possible experiment-specific bias, data pre-processing methods can be used to improve C\&R. It is common practice to reduce non-biological sources of variation via pre-processing techniques such as background correction, batch effect removal or normalization. Many of these methods were established during the early days of microarrays at a time where experimental procedures were still being optimized, and technical variability was omnipresent. Such methods include lowess normalization \cite{Dudoit2002}, quantile normalization \cite{Bolstad2003}, ComBat for batch effect removal \cite{Johnson:2007fp} and gcRMA for removing non-specific binding of oligonucleotides \citep{Wu2004}, to cite a few. Due to the positive impact these methods have had on C\&R, many other fields have adopted similar pre-processing techniques, \textit{e.g.}, flow cytometry \cite{Hahne2010} and next generation sequencing \cite{Robinson:2010dd}. Most of these methods rely on the assumption that the majority of biomarkers (genes or proteins) are not differentially expressed and the numbers of up-and down-regulated biomarkers are roughly equal across samples. Such an assumption can be reasonable when the dimension of the biomarkers collected in each sample is large but may not be satisfied in lower-dimension biomedical data. In the latter case, internal or external validation data are usually used to correct for experimental bias that may be related to measurement, instrument or sampling design \citep{Buonaccorsi:2009vx}. When there is a lack of standard for a quantity's true value \citep[e.g.,][]{Maecker2005a} and validation data are infeasible to generate, calibration methods based on paired-samples \citep{Huang:vw} can be adopted to adjust for experiment bias. For example, in the field of flow cytometry true gold standards do not exist yet, and it is thus difficult to evaluate C\&R. The FlowCAP group (flowcap.flowsite.org) is currently working with the Human Immunology Project group \cite{Maecker2012a} to derive objective criteria and gold standards that will be used to standardize and evaluate pre-processing of flow cytometry data.
\subsection{Standards and data sharing}
As datasets get richer with more data, more variables, and more metadata, it is important to define standards that can be used to capture and distribute all necessary information toward achieving reproducibility \cite{Quackenbush:2004hk}. Several standards have been proposed for biomedical data that achieve these goals including MIAME for gene expression \cite{Brazma:2001gv}, MINSEQE for sequencing experiment \cite{Society:wo} or MiFlowCyt for flow cytometry \cite{Anonymous:2010fo}. In addition to assay protocol information, primary and secondary data, it is important that any preprocessing done to the data be fully described (\textit{e.g.}, normalization for microarrays).
Unfortunately, too many assays are still lacking data standards (\textit{e.g.}, bead array multiplex assays) or if data standards are available, manufacturers and/or software companies have been slow at adopting them. For example, despite the availability of data standards for defining preprocessing for flow cytometry, no analysis software has yet fully adopted this format, and it is very difficult to share reproducible analyses across software platforms. We, the flow informatics community, basically had to reverse engineer commercial software file formats and write custom open source software that can read these \cite{Finak}.
% Data sharing policy for organization
Funding agencies have been very supportive to the creation and adoption of standards for biomedical data, by funding many of the standards that are existing. For example, as part of the Human Immunology Project Consortium (HIPC), a project funded by the NIH, we and other bioinformaticians are currently working towards the definition of novel standards for immunological data. Similarly, the Collaboration for AIDS Vaccine Discovery (CAVD), funded by the Bill and Melinda Gates Foundation (BMGF), has set up immune monitoring consortia to establish validated T-cell and antibody immunological assays across a network of Good Clinical Laboratory Practices (GCLP) certified laboratories that could monitor the anticipated pipeline of HIV vaccine trials emanating from the field. Once data and data formats have been standardized, it is important to make these data publicly available for the benefit of science, and to this extent, funding agencies have an important role to play. Most funding agencies including the National Science Foundation (NSF) and the NIH clearly encourage investigators to share data and/or have defined policies to this end. Similarly, charitable organizations such as the BMGF and the Wellcome Trust are also actively working with grantees to maximize the amount of data available to the research community.
Example projects that have good data sharing policies and have setup databases for sharing data, that we are personally involved in, are the HIV vaccine Trials Network (HVTN), HIPC, and the CAVD. In addition to helping retrieve data more efficiently (\textit{e.g.}, via queries) databases can help minimize human errors in data manipulation by ensuring that raw and processed data along with metadata, are automatically uploaded with minimal manual intervention. Databases can also help maintain data consistency by checking that some standards are followed or by doing basic data quality checks.
For example, the Immunological Portal database (ImmPort.org) provides data templates that help investigators upload their data in a standardized format.
It is thus a good idea to use specialized databases whenever possible to store and share data. Despite this global effort, many policies are still either too vague or not properly enforced, and data are treated as the private property of investigators who aim to maximize their publication record at the expense of the widest possible use of the data. This situation threatens to limit both the progress of this research and its application for public health benefit. We feel that it is important for funding agencies to set stricter and clearer data sharing policies, particularly for sensitive data (\textit{e.g.}, individual genomes and clinical data) where policies are often vague or industrial partnerships make the creation of such policies very difficult. In these cases, despite their sensitive nature, these data could and should be shared as long as they are properly de-identified to protect the patients identity under the Health Insurance Portability and Accountability Act (HIPAA).
% Data sharing policy for organization
Once data and all necessary information are made available, these data need to be appropriately cited when the study and its results are published. To this end, it is crucial that journals set data sharing policies or guidelines, and that authors do follow these guidelines. Unfortunately, as mentioned in a recent study \cite{AlsheikhAli:2011hd}, too few journals have clear policies for data deposition and even fewer make it mandatory for publication. That study found that even when data deposition is a requirement, the majority of authors did not fully follow the instructions. For example, it is common for researchers to share processed data only, which makes it nearly impossible to reproduce the results or use different analysis tools that require primary data. For example, in the field of genomics, many researchers share processed sequence file formats (\textit{e.g.}, wiggle files), which prevents anyone from analyzing the data with an algorithm that requires primary data (\textit{e.g.}, raw or aligned reads).
\section{Reproducibility of assay results and derived data}
\label{sec:data}
Here we discuss some of the tools available to researchers to perform reproducible analysis and share processed data, computer code and final results. Analysis of data issued from high throughput experiments can be extremely complex, involving multiple steps from data formatting and pre-processing to statistical inference. Thus it is important that all steps be recorded for full reproducibility. This can be difficult to do with a point-and-click software interface, where there is no easy way to save intermediate results. This is not to mention the fact that the ``manual'' analysis of a high throughput data set typically requires the use of multiple software tools and is very time consuming. In addition, it is not clear how robust the conclusions of a study are to small perturbations in any of these analysis steps. As such, it might be a good idea to be able to quickly redo an analysis after tuning some parameters to optimize the analysis; something that is not practical within a point-and-click environment.
\subsection{Tools for reproducible analyses}
In recent years, several open-source, community-based projects have emerged that enable researchers to construct and share complete and fully reproducible data analysis pipelines.
The Bioconductor project \cite{Gentleman2004}, based on the R statistical language \cite{Ihaka1996}, provide more than 500 software packages for the analysis of a wide range of biomedical data, from gene expression microarrays to flow cytometry and next generation sequencing. These packages can be combined via scripts written in the R language to form complex data analysis pipelines, connect to data repositories, and generate high quality graphics. The resulting R scripts can then be used to record and later reproduce the analysis (along with all input parameters). Because all steps of the analysis are automated when the script is executed, it is easy to assess the robustness of the results when tuning some parameters. Other similar projects with perhaps more focused capabilities include BioPython \cite{Cock:2009hj} and BioPerl \cite{Stajich:2002bf} that are based on the Python and Perl languages, respectively (to our knowledge, neither BioPython nor Perl have tools for the analysis of flow cytometry data).
Even though several graphical user interfaces (\textit{e.g.}, RStudio for R) are available for writing computer scripts based on R/Bioconductor (or BioPerl, BioPython), the learning curve can still be steep for novice users. More user-friendly based tools are now available to construct reproducible data analysis pipelines using combinations of available modules that are for the most part wrappers of packages written in R, Perl or Python (or some other language). For example, a popular platform for gene expression analysis, GenePattern, versions every pipeline and its methods, ensuring that each version of a pipeline (and its results) remain static \cite{Reich2006}. A more recent project, GenomeSpace (genomespace.org), funded by the National Human Genome Research Institute (NHGRI) can now combine GenePattern with other popular Bioinformatics tools including Galaxy, Cytoscape and the UCSC genome browser. As such, users can perform all of their analysis using a single platform. In the clinical and immunological field, LabKey is a popular web-based tool for storing immunological data (via a database) and building complex analysis pipelines that can be shared with other users \cite{Nelson2011a}. LabKey is currently being used by large research networks including the CAVD, the HVTN, and the Immune Tolerance Network, to name a few.
\subsection{Standards and code sharing}
In the same fashion that experimental protocols need to be published in order for an experiment to be reproduced, computer code, software and data should also be published along with the results of a data analysis. Ideally, software would be open source and computer code would be well packaged and standardized to facilitate exchange and usability. Both Bioconductor and GenePattern, mentioned above, provide facilities for users to package and share code with other users. Bioconductor is based on the R packaging system, which is highly standardized and has been a driving force behind the wide adoption of both R and Bioconductor. Bioconductor goes even further by 1) ensuring that all submitted packages are peer-reviewed and 2) providing version control repositories and build systems where source code is maintained, versioned and binaries automatically built for all operating systems.
Among other things, the peer review process ensures that the package follows some basic guidelines, are well documented, work as advertised and are useful to the community. The open source and versioning system provide full access to algorithms and their implementation, which are crucial to obtain full reproducibility.
For users that want to version and share software code outside of the Bioconductor (or similar) project, there exist many, free, web-based hosting services to store, version and share code (and even data). One of our favorite platforms is GitHub, which the company markets as "Social Coding for all". GitHub makes it easy for anyone to store and version control computer code, packages, documents, webpages and even wikis to document their code. The social aspect of GitHub makes it easy for users to work in teams on a common project, software or manuscript. GitHub is free of all open source projects.
Unfortunately, very few journals have code sharing/software policies and even fewer have requirements that the code/software be open access. For example, BMC Bioinformatics, for which one of us is an associate editor, only has policies for software articles, and even for these the source code is not required, only an executable. PLoS One requires authors of manuscripts in which software is the central part of the paper to release software and make code open source for submission. Although this policy is clearer, it is still up to the editor/reviewers to decide whether software was a central part of the paper. In a day and age where most experiments generate large amounts of data, software is always going to play a central role, so why not make this policy universal for all submissions involving data analysis?
This being said, based on our own experience, we feel that reviewers are pushing in the right direction by asking that code be open source and released along with the paper. So even if journals have no clear policies yet, we, the community, can enforce that code be released every time we review a paper.
\subsection{Authoring tools}
Several tools have been proposed to automatically incorporate reproducible data analysis pipelines or computer code into documents. An example is the GenePattern Word plugin that can be used to embed analysis pipelines in a document and rerun them on any GenePattern server from the Word application \cite{Mesirov2010}. Another example that is popular among statisticians and bioinformatics is the Sweave literate language \cite{Leisch:2002wh} that allows one to create dynamic reports by embedding R code in latex documents. This is our preferred approach because it is open source and does not depend on propriety software. As an example, every Bioconductor package is required to have fully reproducible documentation (called a vignette) written in the Sweave language. Recent software development tools such as RStudio (rstudio.org) and knitr (yihui.name/knitr\/) have made working with Sweave even more accessible, which should reduce the learning curve for most users. In fact, this article was written using the Sweave language and processed using RStudio, and the source file (along with all versions of it) is available from GitHub.
Ideally, all material including the Sweave source file, computer code and data, which Gentleman and Temple refers to as a \textit{compendium} \cite{Gentleman:2007bm}, would be made available along with the final version of the manuscript and be open access, allowing anyone to reproduce the results or identify potential problems in the analysis. An obvious option would be to package code, data and the Sweave source file into an R package for ease of distribution as is commonly done for Bioconductor data packages. Anyone could directly install this package in R and have access to all necessary materials.
Journals that promote this openness should further improve their impact versus non open journals by giving more credibility to the published results, in the same fashion that open access journals typically have greater impact factors \cite{Eysenbach:2006jo}. Unfortunately, currently very few journals are pushing for full reproducibility and even less have clear reproducibility policies. An example of a journal moving in the right direction is Biostatistics, for which one of us is associate editor. Biostatistics now has a reproducibility guideline and is now working with authors towards making sure that published results are reproducible given that data and code are provided (as described in the guidelines). When data and code are provided and results can be reproduced by the associate editor, the article is marked with an R for reproducible.
\section{Conclusion}
We have reviewed some of the key steps involved in the C\&R of biomedical data going from protocols to code and data sharing. Even though experiments, protocols and data analyses have become more complex than ever before, tools and methods for C\&R have also significantly improved. Unfortunately, we are still far from the ideal situation where every study can be reproduced and relevant data be compared and pooled across laboratories or institutions. Besides experiment and protocol consistency, there is still a lot of work to be done in terms of data and analysis standardization that would not only improve reproducibility but also facilitate data exchange and meta analyses. Perhaps one way to achieve this is for experimental and computational groups to work together when developing novel assays, standards and analysis tools. This is something that is integral to the CAVD and HIPC projects mentioned previously. For example, both the CAVD and HIPC have bioinformatics and biostatistics and assays subcommittees that work together to optimize and standardize novel assays and analysis tools.
In terms of data, code and software sharing, we cannot yet rely on goodwill and self discipline when it comes to sharing publication material and making studies fully reproducible. As such, we feel that today, the most important step forward toward improving C\&R is for funding agencies, publishers and researchers to work together by setting very strict reproducibility guidelines and policies. Such policies could potentially save a great deal of money and resources by making sure that scientific errors can quickly be discovered and corrected instead of giving birth to new scientific projects and clinical trials based on erroneous results. Of course, no one should be afraid of making their publication material available because someone might identify a flaw in the study. As Alexander Pope said, ``To err is human, to forgive is divine''; we all learn by our mistakes and this is the only way science can move forward.
\section{Acknowledgments}
This work was supported by a Bill and Melinda Gates Foundation grant, the Vaccine Immunology Statistical Center, and NIH grants U01 AI068635-01 and U19 AI089986-01.
\bibliographystyle{unsrtnat}
\bibliography{BiB-reproducibility}
\end{document}