Skip to content

Commit

Permalink
Proofread — typos, style, clarification, grammar.
Browse files Browse the repository at this point in the history
  • Loading branch information
TheChymera committed Dec 22, 2023
1 parent b0e0307 commit 81900ae
Show file tree
Hide file tree
Showing 5 changed files with 76 additions and 121 deletions.
44 changes: 18 additions & 26 deletions publishing/article/background.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,21 @@ \subsection{Reexecutable Research}

%TODO yoh cite hurr-durr reproduction crisis article
Independent verification of published results is a crucial step for establishing and maintaining trust in shared scientific understanding \cite{rpp}.
The basic feasibility of \textit{de novo} research output generation from the earliest recorded data provenance is known as reexecutability, and has remained largely unexplored as distinct phenomenon in the broader sphere of research reproducibility.
The basic feasibility of \textit{de novo} research output generation from the earliest recorded data provenance is known as reexecutability, and has remained largely unexplored as distinct phenomenon in the broader sphere of research reproducibility.
While the scope of \textit{reexecution} is narrower than that of \textit{reproduction}, it constitutes a more well-defined and therefore tractable issue in improving the quality and sustainability of research.
Reexecution is a prerequisite for the reproduction of any complex analysis process, and therefore in its absence reproduction quality assesments become largely intractable.
Further, reexecution constitutes a capability in and of itself, with ample utility in education, training, and resource reuse for novel research purposes (colloquially, “hacking”) — which may accrue even in the absence of accurate result reproduction.
In all cases, reexecutability increases the feasibility of reproduction assessments.
Further, in the case of complex analysis processes with vast parameter spaces, reexecutability is a prerequisite for detailed reproducibility assessments.
Lastly, reexecution constitutes a capability in and of itself, with ample utility in education, training, and resource reuse for novel research purposes (colloquially, “hacking”) — which may accrue even in the absence of accurate result reproduction.

%TODO yoh Is there a review of people sharing their code? If not we can cite a bunch of people who brag about putting their stuff on GH
%TODO asmacdo +1 cool
%chr I could not find a review showing this, I could manually cite a bunch of papers.... but no idea if that's that helpful or just bloat our bib.
Free and Open Source Software \cite{foss} has significantly permeated the world of research, and it is presently not uncommon for researchers to publish part of the analysis instructions used in generating published results under free and open licenses.
However, such analysis instructions are commonly disconnected from the research output document, which is manually constructed from static inputs.
Notably, without fully reexecutable instructions, data analysis outputs and the positive claims which they support are not verifiably linked to the methods which generate them.

% Also cite for relevance of topic → doi:10.52294/001c.85104
Reexecutability is an emergent topic in research, with a few extant efforts attempting to provide solutions and tackle associated challenges.
Such efforts stem both from journals and independent researchers interested in the capabilities which reproducible outputs offer to the ongoing development of their projects.
eLife's journal-based effort \cite{eliferep} provides dynamic article figures based on the top-most data processing output and executable code conforming to journal standards.
NeuroLibre~\cite{neurolibre} provides a Jupyter Notebook based online platform for publishing executable books along with all the reproducibility assets: code, data, and reproducible runtime.
Such efforts stem both from journals and independent researchers interested in the capabilities which reexecutable research processes offer to the ongoing development of their work.
Among these, an effort by the eLife journal \cite{eliferep} provides dynamic article figures based on the top-most data processing output and executable code conforming to journal standards.
NeuroLibre~\cite{neurolibre} provides a Jupyter Notebook based online platform for publishing executable books along with a selection of reexecutability assets, namely code, data, and a reexecution runtime.
Independent researcher efforts offer more comprehensive and flexible solutions, yet provide reference implementations which are either applied to comparatively simple analysis processes \cite{Dar2019} or tackle complex processes, but assume environment management capabilities which may not be widespread \cite{repsep}.

In order to optimally leverage extant efforts pertaining to full article reexecution and in order to test reexecutability in the face of high task complexity, we have selected a novel neuroimaging study, identified as OPFVTA based on author naming conventions \cite{opfvta}.
Expand All @@ -40,25 +38,23 @@ \subsection{Data Analysis}
Data evaluation consists of various types of statistical modeling, commonly applied in sequence at various hierarchical steps.

The OPFVTA article, which this study uses as an example, primarily studies effective connectivity, which is resolved via stimulus-evoked neuroimaging analysis.
The stimulus-evoked paradigm is widespread across neuroimaging research, and thus the data analysis workflow (both in terms of \emph{data processing} and \emph{data evaluation}) provides significant analogy to numerous other studies.
The stimulus-evoked paradigm is widespread across the field of neuroimaging, and thus the data analysis workflow (both in terms of \emph{data processing} and \emph{data evaluation}) provides significant analogy to numerous other studies.
The data evaluation step for this sort of study is subdivided into “level one” (i.e. within-subject) analysis, and “level two” (i.e. across-subject) analysis, with the results of the latter being further reusable for higher-level analyses \cite{Friston1995}.
In the simplest terms, these steps represent iterative applications of General Linear Modeling (GLM), at increasingly higher orders of abstraction.

% Insert and reference example workflow figure

Computationally, in the case of the OPFVTA article as well as the general case, the various data analysis workflow steps are sharply distinguished by their time cost.
By far the most expensive element is a substage of data preprocessing known as registration.
This commonly relies on iterative gradient descent and can additionally require high-density sampling depending on the feature density of the data.
The second most costly step is the first-level GLM, the cost of which emerges from to the high number of voxels modeled individually for each subject.
The second most costly step is the first-level GLM, the cost of which emerges from to the high number of voxels modeled individually for each subject and session.

The impact of these this time cost on reexecution is that rapid-feedback development and debugging can be compromised if the reexecution is monolithic.
While ascertaining the effect of changes in the registration instructions on the final result unavoidably necessitate the reexecution of the entire pipeline — editing natural-language commentary in the article text, or adapting figure styles, should not.
The impact of these time costs on reexecution is that rapid-feedback development and debugging can be stifled if the reexecution is monolithic.
While ascertaining the effect of changes in registration instructions on the final result unavoidably necessitate the reexecution of registration and all subsequent steps — editing natural-language commentary in the article text, or adapting figure styles, should not.
To this end the reference article employs a hierarchical Bash-script structure, consisting of two steps.
The first step, consisting in data preprocessing and all data evaluation steps which operate in voxel space, is handled by one dedicated sub-script.
The second step handles document-specific element generation, i.e. inline statistics, figure, and TeX-based article generation.
The nomenclature to distinguish these two phases introduced by the authors is “high-iteration” and “low-iteration” \cite{repsep}.
The nomenclature to distinguish these two phases introduced by the authors is “low-iteration” and “high-iteration”, respectively \cite{repsep}.

Analysis dependency tracking, which is to say monitoring whether files required for the next hierarchical step have changedand thus whether that step needs to be re-executed — is handled for the high-iteration analysis script via the RepSeP infrastructure, but not for the low-iteration script.
Analysis dependency tracking — i.e. monitoring whether files required for the next hierarchical step have changed, and thus whether that step needs to be reexecuted — is handled for the high-iteration analysis script via the RepSeP infrastructure, but not for the low-iteration script.


\subsection{Software Dependency Management}
Expand All @@ -75,14 +71,14 @@ \subsection{Software Dependency Management}
This affords a homogeneous environment for dependency resolution, as specified by the Package Manager Standard \cite{pms}.
Additionally, the reference article contextualizes its raw data resource as a dependency, integrating data provision in the same network as software provision.

While the top-level ebuild (i.e. the software dependency requirements of the workflow) is included in the article repository and distributed alongside it, the ebuilds tracking dependencies further down the tree are all distributed via separate repositories.
While the top-level ebuild (i.e. the direct software dependency requirements of the workflow) is included in the article repository and distributed alongside it, the ebuilds which specify dependencies further down the tree are all distributed via separate repositories.
These repositories are version controlled, meaning that their state at any time point is documented, and they can thus be restored to represent the environment as it would have been generated at any point in the past.


\subsection{Software Dependencies}

The aforementioned infrastructure is relied upon to provide a full set of widely adopted neuroimaging tools, including but not limited to ANTs \cite{ants}, nipype \cite{nipype}, FSL \cite{fsl}, AFNI \cite{afni}, and nilearn \cite{nilearn}.
Nipype in particular, provides workflow management tools, rendering the individual sub-steps of the data analysis process open to introspection and isolated re-execution.
Nipype in particular provides workflow management tools, rendering the individual sub-steps of the data analysis process open to introspection and isolated reexecution.
Additionally, the OPFVTA study employs a higher-level workflow package, SAMRI \cite{samri,irsabi}, which provides workflows optimized for the preprocessing and evaluation of animal neuroimaging data.


Expand All @@ -98,16 +94,12 @@ \subsection{Containers}
Container technology is widespread in industry applications, and many container images are made available via public image repositories.
While container technology has gained significant popularity specifically via the Docker toolset, it refers to an overarching effort by numerous organizations, now best represented via a Linux Foundation project, the “Open Container Initiative” (OCI).
The OCI governing body has produced an open specification for containers, which can be used by various container runtimes and toolsets.
Generally, OCI-compliant container images can be executed identically with Docker, Podman, or other OCI compliant tools.
Generally, OCI-compliant container images can be executed analogously with Docker, Podman, or other OCI compliant tools.

While OCI images are nearly ubiquitous in the software industry, Singularity (recently renamed to Apptainer) is a toolset that was developed specifically for High Performance Computing and tailored to research environments.
While OCI images are nearly ubiquitous in the software industry, Singularity (recently renamed to Apptainer) is a toolset that was developed specifically for high-performance computing (HPC) and tailored to research environments.
A significant adaptation of Singularity to HPC environments is its capability to run without root privileges.
However, recent advances in container technology have provided similar capabilities.
Further, Singularity permits the conversion of OCI images into Singularity images, and recent versions of Apptainer have also added support for natively running OCI containers — thus making reuse of images between the two technologies increasingly convenient.

% Do we really want to get into this? appears... to whom? still... do we predict the future? Also, ultimately we provide solutions for both.
% The core thing if we pick favourites would be the actual capabilities, which we detail in the next sentences.
% Podman apears to be gaining traction in the HPC community, but Apptainer is still required on many systems.

Container technology thus represents a solution to providing stable reusable environments for complex processes, such as the automatic generation of research articles.
In particular it is attractive in view of the reexecutable research solutions constraints — as seen in the original OPFVTA article — which assume environment management capabilities which may not always be present on a host system.
In particular, containers provide a convenient way of making advanced package management solutions — as seen in the original OPFVTA article — available to users which may lack them on their host systems.
Loading

0 comments on commit 81900ae

Please sign in to comment.