Skip to content

Commit

Permalink
Clarifications, section restructuring, improved definitions
Browse files Browse the repository at this point in the history
Point 0 of Reviewer 2
  • Loading branch information
TheChymera committed May 15, 2024
1 parent c71f96d commit f89b312
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 22 deletions.
13 changes: 7 additions & 6 deletions publishing/article/background.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@ \section{Background}
\subsection{Reexecutable Research}

Independent verification of published results is a crucial step for establishing and maintaining trust in shared scientific understanding \cite{rpp, Ioannidis2005}.
The basic feasibility of \textit{de novo} research output generation from the earliest recorded data provenance is known as reexecutability, and has remained largely unexplored as a distinct phenomenon in the broader sphere of research reproducibility.
Reexecutability is distinguished from reproducibility, in that the latter refers to obtaining consistent results when re-testing the same phenomenon \textit{NASrepro}, while the former refers to being able to obtain \textit{any} results, automatically, while re-using the same data and instructions.
The property of a research workflow to automatically produce an output — analogous, even if incoherent, with the original — based on the same input data and same instruction set is known as reexecutability.
This property, though conceptually simple, has remained largely unexplored as a distinct phenomenon in the broader sphere of “research reproducibility”.
The core distinction between reexecutability and reproducibility, is that the latter refers to obtaining consistent results when re-testing the same phenomenon \cite{NASrepro}, while the former refers to being able to obtain \textit{any} results, automatically, while re-using the same data and instructions.
While the scope of \textit{reexecution} is thus much narrower than that of \textit{reproduction}, it constitutes a more well-defined and therefore tractable issue in improving the quality and sustainability of research.
In all cases, reexecutability increases the feasibility of reproduction assessments, as it enables high-iteration re-testing of whatever parts of a study are automated.
Further, in the case of complex analysis processes with vast parameter spaces, reexecutability is a prerequisite for detailed reproducibility assessments.
Expand All @@ -21,9 +22,9 @@ \subsection{Reexecutable Research}
Reexecutability is an emergent topic in research, with a few extant efforts attempting to provide solutions and tackle associated challenges.
Such efforts stem both from journals and independent researchers interested in the capabilities which reexecutable research processes offer to the ongoing development of their work.
Among these, an effort by the eLife journal \cite{eliferep} provides dynamic article figures based on the top-most data processing output and executable code conforming to journal standards.
NeuroLibre~\cite{neurolibre} provides a Jupyter Notebook based online platform for publishing executable books along with a selection of reexecutabiliety assets, namely code, data, and a reexecution runtime.
Jupyter Notebooks are also used independently of journal support, yet such usage is indicative of a focus on interactivity for top-most analysis steps rather than full reexecution, commonly not providing either data or software dependency tracking \cite{samuel2024}.
Independent researcher efforts at crating reexecution systems offer more comprehensive and flexible solutions, yet remain constrained in scope and generalizability.
NeuroLibre~\cite{neurolibre} provides a Jupyter Notebook based online platform for publishing executable books along with a selection of related assets, namely code, data, and a reexecution runtime.
Jupyter Notebooks are also used independently of journal support, yet such usage is indicative of a focus on interactivity for top-most analysis steps rather than full reexecution, and characterized by a widespread lack of either data or software dependency specification \cite{samuel2024}.
Independent researcher efforts at creating reexecution systems offer more comprehensive and flexible solutions, yet remain constrained in scope and generalizability.
For example, they may provide reference implementations which are either applied to comparatively simple analysis processes \cite{Dar2019} or tackle complex processes, but assume environment management capabilities which may not be widespread \cite{repsep}.

In order to optimally leverage extant efforts pertaining to full article reexecution and in order to test reexecutability in the face of high task complexity, we have selected a novel neuroimaging study, identified as OPFVTA (OPtogenetic Functional imaging of Ventral Tegmental Area projections) \cite{opfvta}.
Expand All @@ -45,7 +46,7 @@ \subsection{Data Analysis}

Computationally, in the case of the OPFVTA article as well as the general case, the various data analysis workflow steps are sharply distinguished by their time cost.
By far the most expensive element is a substage of data preprocessing known as registration.
This commonly relies on iterative gradient descent and can additionally require high-density sampling depending on the feature density of the data.
This process commonly relies on iterative gradient descent and can additionally require high-density sampling depending on the feature density of the data.
The second most costly step is the first-level GLM, the cost of which emerges from to the high number of voxels modeled individually for each subject and session.

The impact of these time costs on reexecution is that rapid-feedback development and debugging can be stifled if the reexecution is monolithic.
Expand Down
23 changes: 11 additions & 12 deletions publishing/article/results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,24 @@ \section{Results}

\subsection{Repository}

The repository constituting the output of our work is published openly and with version control based on Git \cite{git} via GitHub, a social coding platform \cite{me} and via Gin, an academic code and data sharing platform \cite{me-gin}.
The most up to date instructions for reexecuting our work (the original as well as this article) are found in the \texttt{README.md} file on the repository.
While the key focus on reexecution means that the software internal to the article workflows is provided via containers, requirements remain for fetching the required data remain.
The repository constituting the output of our work is published openly and with version control based on Git \cite{git} via GitHub, a social coding platform \cite{me} and via Gin, an academic code and data sharing platform \cite{me-gin}.
The most up to date instructions for accessing reexecuting our work (the original as well as this article) are found in the \texttt{README.md} file on the repository.
While the key focus on reexecution means that the software internal to the article workflows is provided via containers, software requirements remain for fetching the software, data, and containers themselves.
These include, prominently, Git, DataLad \cite{datalad}, and a container management system (Docker, Podman, or Singularity).

\subsection{Repository Structure}

In order to prevent resource duplication and divergence, and to improve the modularity in view of potential re-use of this system, we have constructed a parent repository which leverages Git and DataLad to link all reexecution requirements.
This framework uses Git submodules for resource referencing, and DataLad in order to permit Git integration with data resources.
In order to prevent resource duplication and divergence, and to improve the modularity in view of potential re-use of this system, we have bundled access to all elements of our work into a parent repository.
This structure (\cref{fig:topology}) uses Git submodules for referencing individual elements relevant for the workflow, and DataLad in order to permit Git integration with data resources.

These submodules include the original article, the raw data it operates on, and a reference mouse brain templates package.
Additionally, the top-level repository directly tracks the code required to coordinate the OPFVTA article reexecution and subsequent generation of \emph{this} article.
The code unique to the reexecution framework consists of container image generation and container execution instructions, as well as a Make system for process coordination (\cref{fig:topology}).
The code unique to the reexecution framework consists of container image generation and container execution instructions, as well as a Make file and is tracked directly via Git.

This repository structure enhances the original reference article by directly linking the data at the repository level, as opposed to relying on its installation via a package manager.
Notably, however, the article source code itself is not duplicated or further edited here, but handled as a Git submodule, with all proposed improvements being recorded in the original upstream repository.
The OPFVTA article source code itself is not duplicated as part of our work, but handled as a Git submodule, with all proposed improvements being contributed to the original upstream repository.
The layout constructed for this study thus provides robust provenance tracking and constitutes an implementation of the YODA principles (a recursive acronym for “YODAs Organigram on Data Analysis” \cite{yoda}).

The Make system is structured into a top-level Makefile, which can be used for container image regeneration and upload, article reexecution in a containerized environment, and meta-article production.
There are independent entry points for both \emph{this} and the original article — making both articles reexecutable (\cref{fig:workflow}).
The Make system (\cref{fig:workflow}) is structured into a top-level Makefile, which can be used for container image regeneration and upload, article reexecution in a containerized environment, and meta-article production.
There are independent entry points for both \emph{this} and the original article — making both articles reexecutable.
Versioning of the original article reexecution is done via file names (as seen in the \texttt{outputs/} subdirectories of \cref{fig:topology}) in order to preserve shell accessibility to what are equivalent resources.
Versioning of the meta-article is handled via Git, so that the most recent version of the work is unambiguously exposed.

Expand All @@ -37,7 +36,7 @@ \subsection{Repository Structure}
\centering
\includegraphics[clip,width=0.99\textwidth]{figs/topology.pdf}
\caption{
\textbf{The directory topology of the new reexecution system nests all resources and includes a Make system for process coordination.}
\textbf{The directory topology of the reexecution repository \cite{me}, highlighting Git submodules.}
Depicted is the directory tree topology of the repository coordinating OPFVTA reexecution.
Nested directories are represented by nested boxes, and Git submodules are highlighted in orange.
The article reexecution PDF results are highlighted in light green, and the PDF of the resulting meta-article (i.e. this article) is highlighted in light blue.
Expand Down
8 changes: 4 additions & 4 deletions publishing/common/abstract.tex
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
The value of research articles is increasingly contingent on the results of complex data analyses which substantiate their claims.
Compared to data production, data analysis more readily lends itself to a higher standard of both full transparency and repeated operator-independent execution.
This higher standard can be approached via fully reexecutable research outputs, which contain the entire instruction set for end-to-end generation of an entire article solely from the earliest feasible provenance point, in a programmatically executable format.
In this study, we make use of a peer-reviewed neuroimaging article which provides complete but fragile reexecution instructions, as a starting point to formulate a new reexecution system which is both robust and portable.
The value of research articles is increasingly contingent on complex data analysis results which substantiate their claims.
Compared to data production, data analysis more readily lends itself to a higher standard of transparency and repeated operator-independent execution.
This higher standard can be approached via fully reexecutable research outputs, which contain the entire instruction set for automatic end-to-end generation of an entire article from the earliest feasible provenance point.
In this study, we make use of a peer-reviewed neuroimaging article which provides complete but fragile reexecution instructions, as a starting point to draft a new reexecution system which is both robust and portable.
We render this system modular as a core design aspect, so that reexecutable article code, data, and environment specifications could potentially be substituted or adapted.
In conjunction with this system, which forms the demonstrative product of this study, we detail the core challenges with full article reexecution and specify a number of best practices which permitted us to mitigate them.
We further show how the capabilities of our system can subsequently be used to provide reproducibility assessments, both via simple statistical metrics and by visually highlighting divergent elements for human inspection.
Expand Down

0 comments on commit f89b312

Please sign in to comment.