Skip to content

Commit

Permalink
Update orderly2.Rmd
Browse files Browse the repository at this point in the history
Fix a couple of typos
  • Loading branch information
david-mears-2 authored Nov 19, 2024
1 parent fb87926 commit 5fe4bf5
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions vignettes/orderly2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Many analyses only involve a single person and a single machine; in this case th

If you update the data, the whole pipeline should rerun. But if you update the code for the forecast, then only the forecasts and report should rerun.

In our experience, this model works well for a single-user setting but falls over in a collaborative setting, especially where the analysis is partitioned by person; so Alice is handling the data pipeline, Bob is running fits, while Carol is organising forecasts and the final report. In this context, changes upstream affect downstream analysis, and require the same set of care around integration as you might be used to with version controlling source code.
In our experience, this model works well for a single-user setting but falls over in a collaborative setting, especially where the analysis is partitioned by person; so Alice is handling the data pipeline, Bob is running fits, while Carol is organising forecasts and the final report. In this context, changes upstream affect downstream analysis, and require the same sort of care around integration as you might be used to with version controlling source code.

For example, if Alice is dealing with a change in the incoming data format which is going to break the analysis at the same time that Bob is trying to get the model fits working, Bob should not be trying to integrate both his code changes *and* Alice's new data. We typically deal with this for source code by using branches within git; for the code Bob would work on a branch that is isolated from Alice's changes. But in most contexts like this you will not have (and _should not have_) the data and analysis products in git. What is needed is a way of versioning the outputs of each step of analysis and controlling when these are integrated into subsequent analyses.

Expand Down Expand Up @@ -80,6 +80,6 @@ Our approach flips the perspective around a bit, based on our experiences with c
We quickly found that this was impossible, but provided a few systems were in place one could be satisfied with this statement to a given level of trust in a system. So if a piece of analysis comes from a server where the primary way people run analyses is through our web front-end (currently [OrderlyWeb](https://github.com/mrc-ide/orderly-web), soon to be [Packit](https://github.com/mrc-ide/packit)) we *know* that the analysis was run end-to-end with no modification and that `orderly2` preserves inputs alongside outputs so the files that are present in the final packet **were** the files that went into the analysis, and the recorded R and package versions **were** the full set that were used.

Because this system naturally involves running on multiple machines (typically we will have the analysts' laptops, a server and perhaps a HPC environment), and because of the way that `orderly2` treats paths, practically there is very little problem getting analyses working in multiple places, trivially satisfying the typical reproducibility aim, even though it is not what people are typically focussed on.
Because this system naturally involves running on multiple machines (typically we will have the analysts' laptops, a server and perhaps an HPC environment), and because of the way that `orderly2` treats paths, practically there is very little problem getting analyses working in multiple places, trivially satisfying the typical reproducibility aim, even though it is not what people are typically focussed on.

This shift in focus has proved valuable. In any analysis that is run on more than one occasion (e.g., regular reporting, or simply updating a figure for a final submission of a manuscript after revision), the outputs may change. Understanding *why* these changes have happened is important. Because `orderly2` automatically saves a lot of metadata about what was run it is easy to find out why things might have changed. Further, you can start interrogating the graph among packets to find out what effect that change has had; so find all the previously run packets that pulled in the old version of a data set, or that used the previous release of a package.

0 comments on commit 5fe4bf5

Please sign in to comment.