Skip to content

Commit

Permalink
Merge pull request #56 from con/enh-formatting
Browse files Browse the repository at this point in the history
Remove trailing spaces from subsubsection titles
  • Loading branch information
TheChymera authored Oct 18, 2023
2 parents 7c8ce1a + 6e339d7 commit 55dbced
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions article/results.tex
Original file line number Diff line number Diff line change
Expand Up @@ -55,55 +55,55 @@ \subsection{Best Practice Guidelines}

As part of this work we have contributed substantial changes to the original OPFVTA repository, based on which we formulate a number of best practice guidelines, highly relevant in the production of reexecutable research outputs.

\subsubsection{Errors should be fatal more often than not.}
\subsubsection{Errors should be fatal more often than not}
The high complexity of a full analysis workflow makes missing output data intractable, as there are many steps at which data may have been dropped.
Manually setting individual invocations propagate exit codes throughout the Make system is cumbersome, and thus an easy global invocation should be relied on so that errors become fatal and immediately visible.
This is best accomplished by prepending all Bash scripts for which this concern is relevant with a \texttt{set -eu} line.

\subsubsection{Avoid assuming or hard-coding absolute paths to resources.}
\subsubsection{Avoid assuming or hard-coding absolute paths to resources}
Ensuring layout compatibility in different article execution environments is contingent on executable resources being able to find code or data required for execution.
This would require changes to executable files contingent on higher-level process coordination, which would make article code less portable.
This problem is best avoided by parameterizing paths to external resources in article code, and allowing it to be specified dynamically or via an environment variable during execution.

\subsubsection{Avoid assuming a directory context for execution.}
\subsubsection{Avoid assuming a directory context for execution}
Within the article code, resources may be linked via relative paths, these are paths resolved based on their hierarchical location with the execution base path as the reference point.
However, if the execution base is variable, as it may be if the environment is modified, these relative paths will become fragile.
Further, relative paths may also become fragile during debugging, where individual execution from their base directory might be preferable to emulating the otherwise more common base.
A good way of making scripts utilising relative paths more robust is ensuring that they always set their base execution directory to their parent directory.
This can be accomplished in Bash via prepending \texttt{cd \textquotedbl\$(dirname \textquotedbl\$0\textquotedbl)\textquotedbl} to the script content.

\subsubsection{Hierarchical granularity greatly benefits efficiency.}
\subsubsection{Hierarchical granularity greatly benefits efficiency}
The high time cost of executing the full analysis workflow makes debugging errors very time-consuming.
Ideally it should not be necessary to re-execute the entire workflow once an error is potentially resolved.
For this it is beneficial to segment the workflow into as many hierarchical steps as feasible.
This is not always possible, and is particularly not feasible for higher-level statistical analysis, where hierarchical contingency can easily be lost.
However, particularly for such steps as preprocessing, and first-level general linear modelling, this is commonly feasible, and should be done.
An easy implementation of this is having the workflow coordination system check for the presence of each hierarchical result, and, if present, proceed to the next hierarchical step.

\subsubsection{Container image size should be kept small.}
\subsubsection{Container image size should be kept small}
Due to a lack of persistency, addressing issues in container image contents requires rebuilding, which can be a time-consuming process.
The smaller the container image, the easier it is.
In particular, when using containers, it is thus advisable to \textit{not} provide data via a package manager or via manual download inside the build script.
A suitable alternative is to assure provision of these resources outside of the container image, e.g. by bind-mounting Git (and DataLad) directories present on the host machine.

\subsubsection{Containers should fit the scope of the underlying workflow steps.}
\subsubsection{Containers should fit the scope of the underlying workflow steps}
In order to not artificially extend the workload of rebuilding a container image, it is further advisable to not create a bundled container image if separate containers can be used for separate hierarchical steps of the workflow.
Container image granularity is of course capped by article workflow granularity, but it is paramount to not compromise the former at the container level.
At a minimum, as seen in this study, the article reexecution container image should be distinct from container images required for producing a summary meta-article.

\subsubsection{Do not write debug-relevant data inside the container.}
\subsubsection{Do not write debug-relevant data inside the container}
Debug-relevant data, such as intermediary data processing steps or debugging logs should not be deleted by the workflow, and further, should be written to persistent storage.
When using containers, if this data is written to a hard-coded path, as they would be on a persistent operating system, it will disappear once the container is removed.
This data is vital for debugging, and thus should not be lost.
This can be avoided by making sure that the paths used for intermediary and debugging outputs are bind-mounted to real directories on the parent system, from which they can be freely inspected.

\subsubsection{Parameterize scratch directories.}
\subsubsection{Parameterize scratch directories}
Complex workflows commonly generate large amounts of scratch data — intermediary data processing steps, with no other utility than being read by subsequent steps.
If these data are written to a hard-coded path, multiple executions will lead to race conditions, compromising one or multiple execution attempts.
This is avoided by parameterizing the path and/or setting a default value based on a unique string (e.g. generated from the timestamps).
When using containers, this should be done at the container initiation level, as the relevant path is the path on the parent system, and not the path inside the container.

\subsubsection{Dependency versions inside container environments should be frozen as soon as feasible.}
\subsubsection{Dependency versions inside container environments should be frozen as soon as feasible}
The need for image rebuilding also means that assuring functionality in view of frequent updates is a more difficult inside containers than on a persistent system.
This is compounded by the frequent and API-breaking update schedules of many scientific software packages.
While dependency version freezing is not without cost in terms of assuring continued real-life functionality for an article, it is paramount that this be done as soon as all required processing capabilities are provided.
Expand Down

0 comments on commit 55dbced

Please sign in to comment.