diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/.nojekyll @@ -0,0 +1 @@ + diff --git a/404.html b/404.html new file mode 100644 index 00000000..511502e9 --- /dev/null +++ b/404.html @@ -0,0 +1,92 @@ + + +
+ + + + +YEAR: 2021 +COPYRIGHT HOLDER: Imperial College of Science, Technology and Medicine ++ +
One of the core purposes of orderly2
is to allow
+collaborative workflows. For this to work we need to be able to find
+packets that were run by someone else, somewhere else, or to share
+packets that we have run with others. To do that, orderly2
+has the idea of “locations”.
Locations behave in a similar way to “remotes” in git
,
+and some of the terminology for interacting with them is similar
+(notably we can pull
from and perhaps push
to
+a location). However, we have used a different name because every
+orderly2
working directory has at least one location - the
+local
location - which is not remote at all. We also
+imagine that almost everyone using orderly2
will be using
+git
and as git
remotes and
+orderly2
locations refer to totally different things this
+slightly different terminology may help keep things clearer.
Packets that are distributed by locations might have been created by
+orderly2
, but may have been created by some other system
+that implements the outpack
spec (see
+vignette("metadata")
). As a result, through this
+documentation you will see reference to software or details to do with
+outpack
but if you only use orderly2
then you
+can read that as being synonymous with orderly2
.
orderly2
supports two types of locations by default:
path
: these are any orderly2
working copy
+that you can find on your filesystem. Note that this could be a copy
+that other people can see (for example on a network share, or on a
+cloud-synced file system such as Dropbox or OneDrive) see Sharing
+packets with collaborators using a shared file system for more
+detailshttp
: these require running an HTTP API, either via outpack_server
+or packit
+The location system is somewhat extensible, but the details of this +are subject to change.
+All the documentation below will behave the same regardless of where +the location is stored and the mechanism of transport, so we can focus +instead on workflows.
+Suppose we have two researchers, Alice and Bob, +who are collaborating on a piece of analysis. For the purposes of this +example (and we find the most common situation in practice) they share a +source tree with their analysis in git. +
+Both Alice and Bob clone this repo onto their machines, using their
+favourite git client (at the terminal with git
, via GitHub
+desktop, or as here with gert
)
At the moment, including hidden files (except .git
),
+Alice sees:
+fs::dir_tree(path_alice, all = TRUE, glob = ".git", invert = TRUE)
+## /tmp/RtmpQJXbGT/file1a4dfc14dbe/alice
+## ├── .git
+## │ ├── FETCH_HEAD
+## │ ├── HEAD
+## │ ├── config
+## │ ├── description
+## │ ├── hooks
+## │ │ └── README.sample
+## │ ├── index
+## │ ├── info
+## │ │ └── exclude
+## │ ├── logs
+## │ │ ├── HEAD
+## │ │ └── refs
+## │ │ ├── heads
+## │ │ │ └── main
+## │ │ └── remotes
+## │ │ └── origin
+## │ │ └── main
+## │ ├── objects
+## │ │ ├── 6a
+## │ │ │ └── bda73589d6fe4d944867153123cc610efb8e01
+## │ │ ├── 98
+## │ │ │ └── 6a194cd5085e4eff51c94d6e95ebb5a1264645
+## │ │ ├── cf
+## │ │ │ └── e1d005594122646d1fb581caea88c822b9ffb3
+## │ │ ├── da
+## │ │ │ └── 368e53e01c1f8f7c2572a2018dc21fb677ce87
+## │ │ ├── df
+## │ │ │ └── 2a40a4b9a055b4ea68f35e6f205963d0fb70b3
+## │ │ ├── f1
+## │ │ │ └── 516ae02d3fa109a893a0ec5fb9a6027b0930b5
+## │ │ ├── info
+## │ │ └── pack
+## │ └── refs
+## │ ├── heads
+## │ │ └── main
+## │ ├── remotes
+## │ │ └── origin
+## │ │ ├── HEAD
+## │ │ └── main
+## │ └── tags
+## ├── orderly_config.yml
+## └── src
+## └── data
+## └── data.R
At this point, after cloning Alice does not have the
+.outpack
directory (see below)
+orderly2::orderly_list_src()
+## [1] "data"
Alice needs to run orderly::orderly_init()
first:
+orderly2::orderly_init(path_alice)
+## ✔ Created orderly root at '/tmp/RtmpQJXbGT/file1a4dfc14dbe/alice'
+## ✔ Wrote '.gitignore'
after which Alice can use orderly commands:
+
+orderly2::orderly_list_src()
+## [1] "data"
The plan is to work with Bob, sharing results on their shared “Packit” server, so Alice
+adds that to her orderly2
configuration with
+orderly2::orderly_location_add_packit()
:
+orderly2::orderly_location_add_packit("server", "http://packit.example.com")
Once done, she can run any analysis:
+
+id <- orderly2::orderly_run("data")
+## ℹ Starting packet 'data' `20241022-124926-465a973b` at 2024-10-22 12:49:26.28292
+## > orderly2::orderly_artefact("data.rds", description = "Final data")
+## > saveRDS(mtcars, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124926-465a973b at 2024-10-22 12:49:26.344551 (0.06163096 secs)
Perhaps it takes several goes for Alice to be happy with the +analysis, but at some point she has something ready to share. She can +then “push” the final packet up onto their server:
+
+orderly2::orderly_location_push(id, "server")
Now, consider Bob. He also needs the source code cloned, orderly +initialised, and the server added:
+
+orderly2::orderly_init(path_bob)
+## ✔ Created orderly root at '/tmp/RtmpQJXbGT/file1a4dfc14dbe/bob'
+## ✔ Wrote '.gitignore'
+orderly2::orderly_location_add_packit("server", "http://packit.example.com")
(Here, Bob has also used the name server
to refer to
+their shared server, but could have used anything; this is similar to
+git’s use of origin
.)
Bob can now query for packets available on the server:
+
+orderly2::orderly_metadata_extract(
+ name = "data",
+ allow_remote = TRUE, pull_metadata = TRUE)
+## id name parameters
+## 1 20241022-124926-465a973b data
Having seen there is a new “data” packet here, he can pull this down +locally (TODO: mrc-4414 makes this nicer):
+
+orderly2::orderly_location_pull_packet(id)
+## ℹ Looking for suitable files already on disk
+## ℹ Need to fetch 2 files (1.3 kB) from 1 location
+## ⠙ Fetching file 1/2 (95 B) from 'server' | ETA: 0s [3ms]
+## ✔ Fetched 2 files (1.3 kB) from 'server' in 25ms.
+##
Now Bob is in a position to develop against the same packet that +Alice ran (20241022-124926-465a973b)
+We have seen several broad patterns of distributing packets.
+Central server, push allowed: This is the case as +above; this is very flexible but requires that everyone is relaxed about +how packets are created as you are ultimately trusting that your +colleagues are exercising good environment hygiene. Just because rules +are not being enforced by the computer though, doesn’t mean that you +might not have ideas within the group about who can or should push. It +may be useful to only push from your HPC system after running +computationally demanding tasks there.
+Central server, final copy is run here: In this
+case, Alice and Bob can both run things on the server, and these are the
+“canonical” copies, nothing is pushed to the server. Reports can be run
+on the server by ssh
-ing onto the server and running
+packets manually (TODO: mrc-4412) or using the web interface (this was
+possible with OrderlyWeb
and will be possible again with
+packit
).
This approach works well where a primary goal is confidence that the +packets everyone works with as dependencies are created under a “clean” +environment with no unexpected global dependencies. With the web +version, you can also enforce things like only running off the default +branch. Alice and Bob will then end up with a collection of packets in +their local archives that are a mix of canonical ones (from the server) +and locally-created ones, but the local ones are always a dead end as +they never get shared with anyone. As a result, Alice and Bob may delete +their archive directories without any great concern.
+When running their own packets, to make sure that the packet will run
+in a similar way to a packet run on the server, Bob may want to ensure
+that only dependencies that could be found on the server will be
+considered. To that, he can pass the location
argument to
+orderly_run()
, which controls how search is performed:
{.r, as="bob"} orderly_run(..., location = "server")
and then pass this in to orderly2::orderly_run
when
+running a report. In developing, Bob may find passing this into
+orderly2::orderly_interactive_set_search_options()
helpful
+as that will mean that any calls to
+orderly2::orderly_description()
will use only packets from
+the server.
Staging and production servers: You might set up +multiple servers, one or more for staging, and one for production. The +staging servers are never the source of truth and you might (or might +not) allow people to push to them, or have less strict rules about what +gets run on them (perhaps allowing running packets from a feature +branch). In this setting, users will end up with packets that were +created locally, from staging machines and from the production server in +their local archive. The staging server will probably end up with a mix +of its own packets and packets from production. The same approach with +specifying search options as above will be useful here for choosing what +is included in any final packet.
+When you pull packets from a server (with
+orderly2::orderly_location_pull_packet()
), depending on
+your settings you will either end up with just the packet that you
+request, or that packet plus all of its recursive
+dependencies. This is controlled by the configuration option
+core.require_complete_tree
, which is FALSE
by
+default; this is suitable for most users, but servers should be set with
+core.require_complete_tree = TRUE
.
In most of the situations above, the users are never holding the +canonical source of the packets - so it does not matter if they don’t +have all if their dependencies. They do hold all the metadata for all +packets (again, recursively) and could download the dependencies later +if they wanted to inspect them. Typically though, this would just be +extra space used up on disk.
+When you push a packet to a server, it always pushes the complete +tree.
+When pulling packets, first the user pulls metadata from the server.
+When orderly2::orderly_location_metadata_pull
is run, all
+metadata from the server is updated on the client (mrc-4444 will make
+this lazier). We now know everything the server holds, and the hashes of
+all the files contained in each packet.
It is very likely that we already know of some of the files within a
+packet being requested, so the client first looks to see if these exist
+within its archive, verifying that they have not been changed. It then
+requests from the server all the files that it cannot resolve locally.
+If doing a recursive pull (with
+core.require_complete_tree = TRUE
) then we look at the
+union over all files in the set of packets that are currently missing
+from our archive. This is frequently much less than the full set of
+files within the packets.
The algorithm is similar for a push:
+This multi-step process means that we avoid copying data that is +already known about, as well as avoiding moving the same data multiple +times. It also means that if the process of pushing (or pulling) is +interrupted it can be safely resumed as the state is always +consistent.
+.outpack
+It is important that the .outpack
directory is
+not shared via git; we warn about this now, and you can use
+orderly2::ordery_gitignore_update()
to automatically create
+a suitable .gitignore
file that will prevent it being
+accidentally committed.
However,
+If Alice and Bob were starting on a new machine they would:
+orderly2::orderly_init()
from within the
+directoryorderly2
commands as normalWithout running orderly2::orderly_init()
, they will get
+an error prompting them to initialise the repo.
There is no requirement for different repositories to share
+configuration options passed to orderly2::orderly_init()
;
+so you can enable or disable the file store or have different sets of
+locations enabled, depending on your workflows. This means that there
+is, in practice, only an association by convention between a
+set of related orderly2
locations and their source tree and
+you can have an orderly2
repository that points at
+locations that refer to different source trees!
One of the simplest ways to share packets with a collaborator is +through a shared file system, for example on a network share, or on a +cloud-synced file system such as Dropbox or OneDrive. You can do this as +follows.
+Create a a new folder for your orderly remote in the shared file
+system. Here Alice has synced to
+path_sharepoint_alice
.
Initialise an orderly location on the shared file system
+
+orderly2::orderly_init(
+ root = path_sharepoint_alice,
+ path_archive = NULL,
+ use_file_store = TRUE,
+ require_complete_tree = TRUE
+)
+## ✔ Created orderly root at '/tmp/RtmpQJXbGT/file1a4dfc14dbe/sharepoint'
Create an orderly store with a file store and a complete tree. See
+orderly2::orderly_init()
for more details.
Add this as a location
+
+orderly2::orderly_location_add_path("sharepoint", path = path_sharepoint_alice)
Push any packets you want to share
+
+orderly2::orderly_location_push(id, "sharepoint")
Then these will be available for your collaborator to pull. Note that
+the data is pushed into a file store in the .outpack
+directory. .dot
files are considered hidden by your
+operating system, so if you have “show hidden files” off in your file
+browser you will not see the pushed packet. But your collaborator can
+now pull the shared file. To do so, they will have to:
Sync the same drive to a location on their machine. Here Bob has
+synced to path_sharepoint_bob
Add the location
+
+orderly2::orderly_location_add_path("alices_orderly", path_sharepoint_bob)
Pull the metadata and use the packets as desired
+
+orderly2::orderly_location_pull_metadata("alices_orderly")
One of the core aims of orderly2
is to allow
+collaborative analysis; to do this the end of one piece of work is an
+input for another piece of work, perhaps someone else’s. To make this
+work in practice, one orderly2
report can “depend” on some
+completed packet (or several completed packets) in order to pull in
+files as inputs.
There are two levels that it is useful to think about +dependencies:
+This perspective differs somewhat from workflow managers where it is +common to talk about “outdated dependencies” and have some single idea +of an end result that a chain of dependencies builds up to.
+This vignette walks through some of the practical issues around
+creating and working with dependencies between reports, starting from
+simple cases (these will be familiar to users of orderly1
)
+through to more advanced cases. We then cover how to interrogate the
+dependency graph and our ideas for extending this in future, and some
+practical issues around how dependencies interact with different
+locations (there is some overlap here with
+vignette("collaboration")
, which we will highlight).
Here, we show how to practically use dependencies in a few common +scenarios of increasing complexity. The code examples are purposefully +too-simple in order to keep the presentation straightforward, see the +end of this document for a discussion of how complex these pieces of +code might “optimally” be.
+The primary mechanism for using dependencies is to call
+orderly2::orderly_dependency()
from within an orderly file;
+this finds a suitable completed packet and copies files that are found
+from within that packet into your current report.
## src
+## ├── analysis
+## │ └── analysis.R
+## └── data
+## ├── data.R
+## └── data.csv
+and src/analysis/analysis.R
contains:
+orderly2::orderly_dependency("data", "latest()", "data.rds")
+d <- readRDS("data.rds")
+png("analysis.png")
+plot(y ~ x, d)
+dev.off()
Here, we’ve used orderly2::orderly_dependency()
to pull
+in the file data.rds
from the most recent version
+(latest()
) of the data
packet, then we’ve used
+that file as normal to make a plot, which we’ve saved as
+analysis.png
(this is very similar to the example from
+vignette("introduction")
, to get us started).
+id1 <- orderly2::orderly_run("data")
+## ℹ Starting packet 'data' `20241022-124929-be678140` at 2024-10-22 12:49:29.748093
+## > d <- read.csv("data.csv")
+## > d$z <- resid(lm(y ~ x, d))
+## > saveRDS(d, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124929-be678140 at 2024-10-22 12:49:29.784305 (0.03621149 secs)
+id2 <- orderly2::orderly_run("analysis")
+## ℹ Starting packet 'analysis' `20241022-124929-d0f4f0bc` at 2024-10-22 12:49:29.820493
+## > orderly2::orderly_dependency("data", "latest()", "data.rds")
+## ℹ Depending on data @ `20241022-124929-be678140` (via latest(name == "data"))
+## > d <- readRDS("data.rds")
+## > png("analysis.png")
+## > plot(y ~ x, d)
+## > dev.off()
+## agg_png
+## 2
+## ✔ Finished running analysis.R
+## ℹ Finished 20241022-124929-d0f4f0bc at 2024-10-22 12:49:29.957371 (0.1368778 secs)
When we look at the metadata for the packet created from the
+analysis
report, we can see it has used
+20241022-124929-be678140
as its dependency:
+orderly2::orderly_metadata(id2)$depends
+## packet query files
+## 1 20241022-124929-be678140 latest(name == "data") data.rds....
(indeed it had to, there is only one copy of the data
+packet to pick from).
In the above example, our query was as simple as it could be — the
+most recently created packet with the name data
. One common
+pattern we see is that an analysis might have a parameter (for example a
+country name) and a downstream analysis might share that parameter and
+want to pull in data for a country.
## src
+## ├── analysis
+## │ └── analysis.R
+## └── data
+## └── data.R
+with src/data/data.R
containing:
+orderly2::orderly_parameters(cyl = NULL)
+d <- mtcars[mtcars$cyl == cyl, ]
+saveRDS(d, "data.rds")
We can run this for several values of cyl
:
+orderly2::orderly_run("data", list(cyl = 4))
+## ℹ Starting packet 'data' `20241022-124930-55c408eb` at 2024-10-22 12:49:30.33949
+## ℹ Parameters:
+## • cyl: 4
+## > orderly2::orderly_parameters(cyl = NULL)
+## > d <- mtcars[mtcars$cyl == cyl, ]
+## > saveRDS(d, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124930-55c408eb at 2024-10-22 12:49:30.37047 (0.03098011 secs)
+## [1] "20241022-124930-55c408eb"
+orderly2::orderly_run("data", list(cyl = 6))
+## ℹ Starting packet 'data' `20241022-124930-63fc3cbb` at 2024-10-22 12:49:30.39806
+## ℹ Parameters:
+## • cyl: 6
+## > orderly2::orderly_parameters(cyl = NULL)
+## > d <- mtcars[mtcars$cyl == cyl, ]
+## > saveRDS(d, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124930-63fc3cbb at 2024-10-22 12:49:30.425681 (0.02762103 secs)
+## [1] "20241022-124930-63fc3cbb"
+orderly2::orderly_run("data", list(cyl = 8))
+## ℹ Starting packet 'data' `20241022-124930-72a0f2b1` at 2024-10-22 12:49:30.452069
+## ℹ Parameters:
+## • cyl: 8
+## > orderly2::orderly_parameters(cyl = NULL)
+## > d <- mtcars[mtcars$cyl == cyl, ]
+## > saveRDS(d, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124930-72a0f2b1 at 2024-10-22 12:49:30.482232 (0.03016353 secs)
+## [1] "20241022-124930-72a0f2b1"
Our follow-on analysis contains:
+
+orderly2::orderly_parameters(cyl = NULL)
+orderly2::orderly_dependency(
+ "data",
+ "latest(parameter:cyl == this:cyl)",
+ "data.rds")
+d <- readRDS("data.rds")
+png("analysis.png")
+plot(mpg ~ disp, d)
+dev.off()
Here the query latest(parameter:cyl == this:cyl)
says
+“find the most recent packet where it’s parameter”cyl”
+(parameter:cyl
) is the same as the parameter in the
+currently running report (this:cyl
).
+orderly2::orderly_run("analysis", list(cyl = 4))
+## ℹ Starting packet 'analysis' `20241022-124930-a32b4c27` at 2024-10-22 12:49:30.64182
+## ℹ Parameters:
+## • cyl: 4
+## > orderly2::orderly_parameters(cyl = NULL)
+## > orderly2::orderly_dependency(
+## + "data",
+## + "latest(parameter:cyl == this:cyl)",
+## + "data.rds")
+## ℹ Depending on data @ `20241022-124930-55c408eb` (via latest(parameter:cyl == this:cyl && name == "data"))
+## > d <- readRDS("data.rds")
+## > png("analysis.png")
+## > plot(mpg ~ disp, d)
+## > dev.off()
+## agg_png
+## 2
+## ✔ Finished running analysis.R
+## ℹ Finished 20241022-124930-a32b4c27 at 2024-10-22 12:49:30.700454 (0.05863333 secs)
+## [1] "20241022-124930-a32b4c27"
If your query fails to resolve a candidate it will error:
+
+orderly2::orderly_run("analysis", list(cyl = 9000))
+## ℹ Starting packet 'analysis' `20241022-124930-c9ed80d5` at 2024-10-22 12:49:30.793096
+## ℹ Parameters:
+## • cyl: 9000
+## > orderly2::orderly_parameters(cyl = NULL)
+## > orderly2::orderly_dependency(
+## + "data",
+## + "latest(parameter:cyl == this:cyl)",
+## + "data.rds")
+## ✖ Error running analysis.R
+## ℹ Finished 20241022-124930-c9ed80d5 at 2024-10-22 12:49:30.860223 (0.06712675 secs)
+## Error in `orderly2::orderly_run()`:
+## ! Failed to run report
+## Caused by error in `outpack_packet_use_dependency()`:
+## ! Failed to find packet for query 'latest(parameter:cyl == this:cyl &&
+## name == "data")'
+## ℹ See 'rlang::last_error()$explanation' for details
The error message here tries to be fairly self explanatory; we have
+failed to find a packet that satisfies our
+querylatest(parameter:cyl == this:cyl && name == "data")
;
+note that the report name data
has become part of this
+query, so there are two conditions being matched on.
The error suggests running
+rlang::last_error()$explanation
for more information, which
+we can do:
+rlang::last_error()$explanation
+## Evaluated query: 'latest(A && B)' and found 0 packets
+## • A (parameter:cyl == this:cyl): 0 packets
+##
+## • B (name == "data"): 3 packets
This is an orderly_query_explain
object, which tries to
+come up with reasons why your query might not have matched; we’ll expand
+this in the future so let us know what you might like to see.
This tells you that your query can be decomposed into two subqueries
+A
(the match against the parameter cyl
being
+9000), which matched no packets and B
(the match against
+the packet name being data
), which matched 3 packets. If
+each subquery matched packets but some pairs don’t then it will
+try and guide you towards problematic pairs.
You can also ask orderly2
to explain any query for
+you:
+orderly2::orderly_query_explain(
+ quote(latest(parameter:cyl == 9000)), name = "data")
+## Evaluated query: 'latest(A && B)' and found 0 packets
+## • A (parameter:cyl == 9000): 0 packets
+##
+## • B (name == "data"): 3 packets
If you save this object you can explore it in more detail:
+
+explanation <- orderly2::orderly_query_explain(
+ quote(latest(parameter:cyl == 9000)), name = "data")
+explanation$parts$B
+## $name
+## [1] "B"
+##
+## $str
+## [1] "name == \"data\""
+##
+## $expr
+## name == "data"
+##
+## $n
+## [1] 3
+##
+## $found
+## [1] "20241022-124930-55c408eb" "20241022-124930-63fc3cbb"
+## [3] "20241022-124930-72a0f2b1"
(this would have worked with
+rlang::last_error()$explanation$parts$A
too).
You can also use orderly2::orderly_metadata_extract
to
+work out what values you might have looked for:
+orderly2::orderly_metadata_extract(
+ name = "data",
+ extract = c(cyl = "parameters.cyl is number"))
+## id cyl
+## 1 20241022-124930-55c408eb 4
+## 2 20241022-124930-63fc3cbb 6
+## 3 20241022-124930-72a0f2b1 8
Above we saw two types of filtering candidates: latest()
+selected the most recent packet while
+latest(parameter:cyl == this:cyl)
found a packet whose
+parameter matched one of our parameters.
We could have used latest(parameter:cyl == 4)
to hard
+code in a specific parameter value, and used
+latest(parameter:cyl == environment:cyl)
to match against
+whatever value cyl
took in the evaluating environment.
Instead of a query, you can provide a single id (e.g,
+20241022-124930-a32b4c27
), which would mean that even as
+new copies of the data
packet are created, this dependency
+will always resolve to the same value.
You can chain together logical operations with
+&&
(both sides must be true) or ||
+(either side must be true), and group conditions with parentheses. In
+addition to ==
, the usual complement of comparison
+operators will work. So you might have complex queries like
+latest((parameter:x == 1 || parameter:x == 2) && parameter:y > 10)
but in practice most people have queries that are a series of
+restrictions with &&
.
One common pattern is the map-reduce pattern over a set of orderly
+reports. With this, a set of packets are created over a vector of
+parameters, or perhaps a chain of different reports for each parameter,
+then they are all combined together. For some parameter p
+that takes values “x”, “y” and “z”, this might look like:
B(p = "x") -- C(p = "x")
+ / \
+A - B(p = "y") -- C(p = "y") - D
+ \ /
+ B(p = "z") -- C(p = "z")
+So here, D will want to combine all of the three copies of the
+C
packet, one for each of p
as “x”, “y” and
+“z”.
Especially if there are only three values and these are hard coded, +you might just write it out as
+
+orderly2::orderly_dependency("C", quote(latest(parameter:p == "x")),
+ c("data/x.rds" = "result.rds"))
+orderly2::orderly_dependency("C", quote(latest(parameter:p == "y")),
+ c("data/y.rds" = "result.rds"))
+orderly2::orderly_dependency("C", quote(latest(parameter:p == "z")),
+ c("data/z.rds" = "result.rds"))
Note here that in each call we vary the second argument to select a
+different parameter value, and in the third argument we are naming our
+destination file a different name (so we end up with three files in
+data/
).
You can write this out as a for
loop:
+for (p in c("x", "y", "z")) {
+ orderly2::orderly_dependency("C", quote(latest(parameter:p == environment:p)),
+ c("data/${p}.rds" = "result.rds"))
+}
Here, in the second argument we use environment:p
to
+fetch the value of p
from the calling environment - this is
+the looping value so will take all three values. In the name of the
+third argument, we use the special interpolation format
+${p}
to substitute in the value of p
to build
+a filename.
By default, any packet that you have unpacked on your local archive
+is considered a candidate for inclusion by
+orderly_dependency()
. This is not always what you want.
The locations that are selected, and the packets within them that are
+considered as candidates can be controlled by the
+search_options
argument to
+orderly2::orderly_run
(note that the argument is to
+orderly_run()
, not to orderly_dependency()
+because this is an effect controlled by the runner of the
+report, not the writer of the report).
There are three components here that affect how packets are +selected
+location
: is a character vector of locations, matching
+your location names. Only packets that can be found at these locations
+will be considered. So if you have a mix of locally created packets as
+well as ones that other people can see, specifying
+location = "server"
would limit to packets that are
+available on the server, which means that you will end up with
+dependencies that you colleagues would also get.allow_remote
: controls if we are willing to download
+files from a location in order to satisfy a dependency. If
+TRUE
, then when you run the report, it might download files
+if more recent packets are available on a location than what you have
+locally.pull_metadata
: only has an effect if
+allow_remote
is also TRUE
; this causes the
+metadata to be refreshed before dependency resolution.There is further discussion of the details in
+?orderly_run
If you are used to systems like targets
, it is easy to
+make reports smaller than they need to be. There’s on real need to make
+these very small, and picking the right size is a challenge.
If they are too small, you’ll end up writing a lot of code to +orchestrate running different reports and pulling things together. +You’ll end spending a lot of time about whether things are “up to date” +with one another because really a group of things always wants to run +together.
+If they’re too big then you might end up doing more work than you +want to do, because in order to make a change to part of a piece of +analysis you must run the whole thing again.
+This vignette provides a how-to style introduction to
+orderly2
, an overview of key ingredients to writing orderly
+reports, and a summary of key features and ideas. It may be useful to
+look at vignette("orderly2")
for a more roundabout
+discussion of what orderly2
is trying to achieve, or
+vignette("migrating")
if you are familiar with version 1 of
+orderly as this explains concepts in terms of differences from the
+previous version.
+install.packages(
+ "orderly2",
+ repos = c("https://mrc-ide.r-universe.dev", "https://cloud.r-project.org"))
The first step is to initialise an empty orderly2
+repository. An orderly2
repository is a directory with the
+file orderly_config.yml
within it, and since version 2 also
+a directory .outpack/
. Files within the
+.outpack/
directory should never be directly modified by
+users and this directory should be excluded from version control (see
+orderly2::orderly_gitignore_update
).
Create an orderly2 repository by calling
+orderly2::orderly_init()
:
+path <- tempfile() # we'll use a temporary directory here - see note below
+orderly2::orderly_init(path)
+## ✔ Created orderly root at '/tmp/RtmpwywFDU/file1ad71149fd79'
which creates a few files:
+## .
+## ├── .outpack
+## │ ├── config.json
+## │ ├── location
+## │ └── metadata
+## └── orderly_config.yml
+This step should be performed on a completely empty directory,
+otherwise an error will be thrown. Later, you will re-initialise an
+orderly2
repository when cloning to a new machine, such as
+when working with others; this is discussed in
+vignette("collaboration")
.
The orderly_config.yml
file contains very little by
+default:
For this vignette, the created orderly root is in R’s per-session
+temporary directory, which will be deleted once R exits. If you want to
+use a directory that will persist across restarting R (which you would
+certainly want when using orderly2
on a real project!) you
+should replace this with a path within your home directory, or other
+location that you control.
For the rest of the vignette we will evaluate commands from within +this directory, by changing the directory to the path we’ve created:
+
+setwd(path)
An orderly report is a directory src/<name>
+containing an orderly file <name>.R
. That file may
+have special commands in it, but for now we’ll create one that is as
+simple as possible; we’ll create some random data and save it to disk.
+This seems silly, but imagine this standing in for something like:
Our directory structure (ignoring .outpack
) looks
+like:
## .
+## ├── orderly_config.yml
+## └── src
+## └── incoming_data
+## ├── data.csv
+## └── incoming_data.R
+and src/incoming_data/incoming_data.R
contains:
To run the report and create a new packet, use
+orderly2::orderly_run()
:
+id <- orderly2::orderly_run("incoming_data")
+## ℹ Starting packet 'incoming_data' `20241022-124934-262628cf` at 2024-10-22 12:49:34.155388
+## > d <- read.csv("data.csv")
+## > d$z <- resid(lm(y ~ x, d))
+## > saveRDS(d, "data.rds")
+## ✔ Finished running incoming_data.R
+## ℹ Finished 20241022-124934-262628cf at 2024-10-22 12:49:34.21933 (0.06394196 secs)
+id
+## [1] "20241022-124934-262628cf"
The id
that is created is a new identifier for the
+packet that will be both unique among all packets (within reason) and
+chronologically sortable. A packet that has an id that sorts after
+another packet’s id was started before that packet.
Having run the report, our directory structure looks like:
+## .
+## ├── archive
+## │ └── incoming_data
+## │ └── 20241022-124934-262628cf
+## │ ├── data.csv
+## │ ├── data.rds
+## │ └── incoming_data.R
+## ├── draft
+## │ └── incoming_data
+## ├── orderly_config.yml
+## └── src
+## └── incoming_data
+## ├── data.csv
+## └── incoming_data.R
+A few things have changed here:
+data.rds
; see the script above)incoming_data.R
and data.csv
, the original
+input that have come from our source treedraft/incoming_data
which
+was created when orderly ran the report in the first placeIn addition, quite a few files have changed within the
+.outpack
directory, but these are not covered here.
That’s it! Notice that the initial script is just a plain R script,
+and you can develop it interactively from within the
+src/incoming_data
directory. Note however, that any paths
+referred to within will be relative to src/incoming_data
+and not the orderly repository root. This is important
+as all reports only see the world relative to their
+incoming_data.R
file.
Once created, you can then refer to this report by id and pull its
+files wherever you need them, both in the context of another orderly
+report or just to copy to your desktop to email someone. For example, to
+copy the file data.rds
that we created to some location
+outside of orderly’s control you could do
+dest <- tempfile()
+fs::dir_create(dest)
+orderly2::orderly_copy_files(id, files = c("final.rds" = "data.rds"),
+ dest = dest)
which copies data.rds
to some new temporary directory
+dest
with name final.rds
. This uses
+orderly2
’s outpack_
functions, which are
+designed to interact with outpack archives regardless of how they were
+created (orderly2
is a program that creates
+outpack
archives). Typically these are lower-level than
+orderly_
functions.
Creating a new dataset is mostly useful if someone else can use it. +To do this we introduce the first of the special orderly commands that +you can use from an orderly file
+The src/
directory now looks like:
## src
+## ├── analysis
+## │ └── analysis.R
+## └── incoming_data
+## ├── data.csv
+## └── incoming_data.R
+and src/analysis/analysis.R
contains:
+orderly2::orderly_dependency("incoming_data", "latest()",
+ c("incoming.rds" = "data.rds"))
+d <- readRDS("incoming.rds")
+png("analysis.png")
+plot(y ~ x, d)
+dev.off()
Here, we’ve used orderly2::orderly_dependency()
to pull
+in the file data.rds
from the most recent version
+(latest()
) of the data
packet with the
+filename incoming.rds
, then we’ve used that file as normal
+to make a plot, which we’ve saved as analysis.png
.
We can run this just as before, using
+orderly2::orderly_run()
:
+id <- orderly2::orderly_run("analysis")
+## ℹ Starting packet 'analysis' `20241022-124934-ad20b612` at 2024-10-22 12:49:34.680723
+## > orderly2::orderly_dependency("incoming_data", "latest()",
+## + c("incoming.rds" = "data.rds"))
+## ℹ Depending on incoming_data @ `20241022-124934-262628cf` (via latest(name == "incoming_data"))
+## > d <- readRDS("incoming.rds")
+## > png("analysis.png")
+## > plot(y ~ x, d)
+## > dev.off()
+## agg_png
+## 2
+## ✔ Finished running analysis.R
+## ℹ Finished 20241022-124934-ad20b612 at 2024-10-22 12:49:34.766681 (0.08595872 secs)
For more information on dependencies, see
+vignette("dependencies")
.
The function orderly2::orderly_dependency()
is designed
+to operate while the packet runs. These functions all act by adding
+metadata to the final packet, and perhaps by copying files into the
+directory.
orderly2::orderly_description()
: Provide a longer name
+and description for your report; this can be reflected in tooling that
+uses orderly metadata to be much more informative than your short
+name.orderly2::orderly_parameters()
: Declares parameters
+that can be passed in to control the behaviour of the report. Parameters
+are key-value pairs of simple data (booleans, numbers, strings) which
+your report can respond to. They can also be used in queries to
+orderly2::orderly_dependency()
to find packets that satisfy
+some criteria.orderly2::orderly_resource()
: Declares that a file is a
+resource; a file that is an input to the the report, and which
+comes from this source directory. By default, orderly treats all files
+in the directory as a resource, but it can be useful to mark these
+explicitly, and necessary to do so in “strict mode” (see below). Files
+that have been marked as a resource are immutable and
+may not be deleted or modified.orderly2::orderly_shared_resource()
: Copies a file from
+the “shared resources” directory shared/
, which can be data
+files or source code located at the root of the orderly repository. This
+can be a reasonable way of sharing data or commonly used code among
+several reports.orderly2::orderly_artefact()
: Declares that a file (or
+set of files) will be created by this report, before it is even run.
+Doing this makes it easier to check that the report behaves as expected
+and can allow reasoning about what a related set of reports will do
+without running them. By declaring something as an artefact (especially
+in conjunction with “strict mode”) it is also easier to clean up
+src
directories that have been used in interactive
+development (see below).orderly2::orderly_dependency()
: Copy files from one
+packet into this packet as it runs, as seen above.orderly2::orderly_strict_mode()
: Declares that this
+report will be run in “strict mode” (see below).In addition, there is also a function
+orderly::orderly_run_info()
that can be used while running
+a report that returns information about the currently running report
+(its id, resolved dependencies etc).
Let’s add some additional annotations to the previous reports:
+
+orderly2::orderly_strict_mode()
+orderly2::orderly_resource("data.csv")
+orderly2::orderly_artefact("Processed data", "data.rds")
+
+d <- read.csv("data.csv")
+d$z <- resid(lm(y ~ x, d))
+saveRDS(d, "data.rds")
Here, we’ve added a block of special orderly commands; these could go
+anywhere, for example above the files that they refer to. If strict mode
+is enabled (see below) then orderly2::orderly_resource
+calls must go before the files are used as they will only be made
+available at that point (see below).
+id <- orderly2::orderly_run("incoming_data")
+## ℹ Starting packet 'incoming_data' `20241022-124934-fbb8d21d` at 2024-10-22 12:49:34.987005
+## > orderly2::orderly_strict_mode()
+## > orderly2::orderly_resource("data.csv")
+## > orderly2::orderly_artefact("Processed data", "data.rds")
+## Warning: Please use a named argument for the description in 'orderly_artefact()'
+## In future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## > d <- read.csv("data.csv")
+## > d$z <- resid(lm(y ~ x, d))
+## > saveRDS(d, "data.rds")
+## ✔ Finished running incoming_data.R
+## ! 1 warning found:
+## • Please use a named argument for the description in 'orderly_artefact()' In
+## future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## ℹ Finished 20241022-124934-fbb8d21d at 2024-10-22 12:49:35.036081 (0.04907632 secs)
Much of the flexibility that comes from the orderly graph comes from +using parameterised reports; these are reports that take a set of +parameters and then change behaviour based on these parameters. +Downstream reports can depend on a parameterised report and filter based +on suitable parameters.
+For example, consider a simple report where we generate samples based +on some parameter:
+
+orderly2::orderly_parameters(n_samples = 10)
+x <- seq_len(n_samples)
+d <- data.frame(x = x, y = x + rnorm(n_samples))
+saveRDS(d, "data.rds")
This creates a report that has a single parameter
+n_samples
with a default value of 10. We could have
+used
+orderly2::orderly_parameters(n_samples = NULL)
to define a parameter with no default, or defined multiple parameters +with
+
+orderly2::orderly_parameters(n_samples = 10, distribution = "normal")
You can do anything in your report that switches on the value of a +parameter:
+However, you should see parameters as relatively heavyweight things +and try to have a consistent set over all packets created from a report. +In this report we use it to control the size of the generated data +set.
+
+id <- orderly2::orderly_run("random", list(n_samples = 15))
+## ℹ Starting packet 'random' `20241022-124935-41a98cb6` at 2024-10-22 12:49:35.260934
+## ℹ Parameters:
+## • n_samples: 15
+## > orderly2::orderly_parameters(n_samples = 10)
+## > x <- seq_len(n_samples)
+## > d <- data.frame(x = x, y = x + rnorm(n_samples))
+## > saveRDS(d, "data.rds")
+## ✔ Finished running random.R
+## ℹ Finished 20241022-124935-41a98cb6 at 2024-10-22 12:49:35.291909 (0.03097415 secs)
Our resulting file has 15 rows, as the parameter we passed in +affected the report:
+
+orderly2::orderly_copy_files(id, files = c("random.rds" = "data.rds"),
+ dest = dest)
+readRDS(file.path(dest, "random.rds"))
+## x y
+## 1 1 0.4463006
+## 2 2 2.6289820
+## 3 3 5.0650249
+## 4 4 2.3690106
+## 5 5 5.5124269
+## 6 6 4.1369885
+## 7 7 6.4779875
+## 8 8 7.9473981
+## 9 9 9.5429963
+## 10 10 9.0859252
+## 11 11 11.4681544
+## 12 12 12.3629513
+## 13 13 11.6954565
+## 14 14 14.7377763
+## 15 15 16.8885049
You can use these parameters in orderly’s search functions. For +example we can find the most recent version of a packet by running:
+
+orderly2::orderly_search('latest(name == "random")')
+## [1] "20241022-124935-41a98cb6"
But we can also pass in parameter queries here:
+
+orderly2::orderly_search('latest(name == "random" && parameter:n_samples > 10)')
+## [1] "20241022-124935-41a98cb6"
These can be used within orderly2::orderly_dependency()
+(the name == "random"
part is implied by the first
+name
argument), for example
+orderly2::orderly_dependency("random", "latest(parameter:n_samples > 10)",
+ c("randm.rds" = "data.rds"))
In this case if the report that you are querying from also
+has parameters you can use these within the query, using the
+this
prefix. So suppose our downstream report simply uses
+n
for the number of samples we might write:
+orderly2::orderly_dependency("random", "latest(parameter:n_samples == this:n)",
+ c("randm.rds" = "data.rds"))
to depend on the most recent packet called random
where
+it has a parameter n_samples
which has the same value as
+the current report’s parameter n
.
See the outpack query documentation for much more detail on this.
+Sometimes it is useful to share data between different reports, for +example some common source utilities that don’t warrant their own +package, or some common data.
+To do this, create a directory shared
at the orderly
+root and put in it any files or directories you might want to share.
Suppose our shared directory contains a file
+data.csv
:
## .
+## ├── archive
+## │ ├── analysis
+## │ │ └── 20241022-124934-ad20b612
+## │ │ ├── analysis.R
+## │ │ ├── analysis.png
+## │ │ └── incoming.rds
+## │ ├── incoming_data
+## │ │ ├── 20241022-124934-262628cf
+## │ │ │ ├── data.csv
+## │ │ │ ├── data.rds
+## │ │ │ └── incoming_data.R
+## │ │ └── 20241022-124934-fbb8d21d
+## │ │ ├── data.csv
+## │ │ ├── data.rds
+## │ │ └── incoming_data.R
+## │ └── random
+## │ └── 20241022-124935-41a98cb6
+## │ ├── data.rds
+## │ └── random.R
+## ├── draft
+## │ ├── analysis
+## │ ├── incoming_data
+## │ └── random
+## ├── orderly_config.yml
+## ├── shared
+## │ └── data.csv
+## └── src
+## ├── analysis
+## │ └── analysis.R
+## ├── incoming_data
+## │ ├── data.csv
+## │ └── incoming_data.R
+## └── random
+## └── random.R
+We can then write an orderly report use_shared
that uses
+this shared file, with its use_shared.R
containing:
+orderly2::orderly_shared_resource("data.csv")
+orderly2::orderly_artefact("analysis", "analysis.png")
+
+d <- read.csv("data.csv")
+png("analysis.png")
+plot(y ~ x, d)
+dev.off()
We can run this:
+
+id <- orderly2::orderly_run("use_shared")
+## ℹ Starting packet 'use_shared' `20241022-124935-c53e3256` at 2024-10-22 12:49:35.774852
+## > orderly2::orderly_shared_resource("data.csv")
+## > orderly2::orderly_artefact("analysis", "analysis.png")
+## Warning: Please use a named argument for the description in 'orderly_artefact()'
+## In future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## > d <- read.csv("data.csv")
+## > png("analysis.png")
+## > plot(y ~ x, d)
+## > dev.off()
+## agg_png
+## 2
+## ✔ Finished running use_shared.R
+## ! 1 warning found:
+## • Please use a named argument for the description in 'orderly_artefact()' In
+## future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## ℹ Finished 20241022-124935-c53e3256 at 2024-10-22 12:49:35.826686 (0.05183363 secs)
In the resulting archive, the file that was used from the shared +directory is present:
+## archive/use_shared
+## └── 20241022-124935-c53e3256
+## ├── analysis.png
+## ├── data.csv
+## └── use_shared.R
+This is a general property of orderly: it tries to save all the +inputs alongside the final results of the analysis, so that later on you +can check to see what went into an analysis and what might have changed +between versions.
+The previous version of orderly (orderly1
; see
+vignette("migrating")
) was very fussy about all input being
+strictly declared before a report could be run, so that it was clear
+what was really required in order to run something. From version 2 this
+is relaxed by default, but you can opt into most of the old behaviours
+and checks by adding
+orderly2::orderly_strict_mode()
anywhere within your orderly file (conventionally at the top). We may +make this more granular in future, but by adding this we:
+src/<reportname>/
) to the draft directory where the
+report runs (draft/<reportname>/<packet-id>
)
+that were declared with orderly2::orderly_resource
; this
+leaves behind any extra files left over in developmentUsing strict mode also helps orderly2
clean up the
+src/<reportname>
directory more effectively after
+interactive development (see next section).
Set your working directory to src/<reportname>
and
+any orderly script should be fully executable (e.g., source with
+Rstudio’s Source
button, or R’s source()
+function). Dependencies will be copied over as needed.
After doing this, you will have a mix of files within your source
+directory. We recommend a per-source-directory .gitignore
+which will keep these files out of version control (see below). We will
+soon implement support for cleaning up generated files from this
+directory.
For example, suppose that we have interactively run our
+incoming_data/incoming_data.R
script, we would leave behind
+generated files. We can report on this with
+orderly2::orderly_cleanup_status
:
+orderly2::orderly_cleanup_status("incoming_data")
+## ✖ incoming_data is not clean:
+## ℹ 1 file can be deleted by running 'orderly2::orderly_cleanup("incoming_data")':
+## • data.rds
If you have files here that are unknown to orderly it will tell you +about them and prompt you to tell it about them explicitly.
+You can clean up generated files by running (as suggested in the +message):
+
+orderly2::orderly_cleanup("incoming_data")
+## ℹ Deleting 1 file from 'incoming_data':
+## • data.rds
There is a dry_run = TRUE
argument you can pass if you
+want to see what would be deleted without using the status function.
You can also keep these files out of git by using the
+orderly2::orderly_gitignore_update
function:
+orderly2::orderly_gitignore_update("incoming_data")
+## ✔ Wrote 'src/incoming_data/.gitignore'
This creates (or updates) a .gitignore
file within the
+report so that generated files will not be included by git. If you have
+already accidentally committed them then the gitignore has no real
+effect and you should do some git surgery, see the git manuals or this
+handy, if profane, guide.
If you delete packets from your archive/
directory then
+this puts orderly2
into an inconsistent state with its
+metadata store. Sometimes this does not matter (e.g., if you delete old
+copies that would never be candidates for inclusion with
+orderly2::orderly_dependency
you will never notice).
+However, if you delete the most recent copy of a packet and then try and
+depend on it, you will get an error.
At the moment, we have two copies of the incoming_data
+task:
+orderly2::orderly_metadata_extract(
+ name = "incoming_data",
+ extract = c(time = "time.start"))
+## id time
+## 1 20241022-124934-262628cf 2024-10-22 12:49:34
+## 2 20241022-124934-fbb8d21d 2024-10-22 12:49:34
When we run the analysis
task, it will pull in the most
+recent version (20241022-124934-fbb8d21d
). However, if you
+had deleted this manually (e.g., to save space or accidentally) or
+corrupted it (e.g., by opening some output in Excel and letting it save
+changes) it will not be able to be included, and running
+analysis
will fail:
+orderly2::orderly_run("analysis")
+## ℹ Starting packet 'analysis' `20241022-124936-6c9ca234` at 2024-10-22 12:49:36.428663
+## > orderly2::orderly_dependency("incoming_data", "latest()",
+## + c("incoming.rds" = "data.rds"))
+## ✖ Error running analysis.R
+## ℹ Finished 20241022-124936-6c9ca234 at 2024-10-22 12:49:36.513207 (0.08454394 secs)
+## Error in `orderly2::orderly_run()`:
+## ! Failed to run report
+## Caused by error in `orderly_copy_files()`:
+## ! Unable to copy files, due to deleted packet 20241022-124934-fbb8d21d
+## ℹ Consider 'orderly2::orderly_validate_archive("20241022-124934-fbb8d21d",
+## action = "orphan")' to remove this packet from consideration
+## Caused by error:
+## ! File not found in archive
+## ✖ data.rds
The error here tries to be fairly informative, telling us that we
+failed because when copying files from
+20241022-124934-fbb8d21d
we found that the packet was
+corrupt, because the file data.rds
was not found in the
+archive. It also suggests a fix; we can tell orderly2
that
+20241022-124934-fbb8d21d
is “orphaned” and should not be
+considered for inclusion when we look for dependencies.
We can carry out the suggestion and just validate this packet by +running
+
+orderly2::orderly_validate_archive("20241022-124934-fbb8d21d", action = "orphan")
or we can validate all the packets we have:
+
+orderly2::orderly_validate_archive(action = "orphan")
+## ✔ 20241022-124934-262628cf (incoming_data) is valid
+## ✔ 20241022-124934-ad20b612 (analysis) is valid
+## ✖ 20241022-124934-fbb8d21d (incoming_data) is invalid due to its files
+## ✔ 20241022-124935-41a98cb6 (random) is valid
+## ✔ 20241022-124935-c53e3256 (use_shared) is valid
If we had the option core.require_complete_tree
enabled,
+then this process would also look for any packets that used our
+now-deleted packet and orphan those too, as we no longer have a complete
+tree that includes them.
If you want to remove references to the orphaned packets, you can use
+orderly2::orderly_prune_orphans()
to remove them
+entirely:
+orderly2::orderly_prune_orphans()
+## ℹ Pruning 1 orphan packet
Some guidelines:
+Make sure to exclude some files from git
by listing them
+in .gitignore
:
.outpack/
- nothing in here is suitable for version
+controlarchive/
- if you have core.archive_path
+set to a non-null value, this should be excluded. The default is
+archive
+draft/
- the temporary draft directoryorderly_envir.yml
- used for setting machine-specific
+configurationYou absolutely should version control some files:
+src/
the main source of your analysesorderly_config.yml
- this high level configuration is
+suitable for sharingorderly_config.yml
) should probably be version
+controlledYour source repository will end up in multiple people’s machines,
+each of which are configured differently. The configuration option set
+via orderly2::orderly_config_set
are designed to be
+(potentially) different for different users, so this configuration needs
+to be not version controlled. It also means that reports/packets can’t
+directly refer to values set here. This includes the directory used to
+save archive packets at (if enabled) and the names of locations
+(equivalent to git remotes).
You may find it useful to include scripts that help users set up
+common locations, but like with git, different users may use different
+names for the same remote (e.g., one user may have a location called
+data
while for another it is called
+data-incoming
, depending on their perspective about the use
+of the location).
orderly2
will always try and save information about the
+current state of the git source repository alongside the packet
+metadata. This includes the current branch, commit (sha) and remote url.
+This is to try and create links between the final version of the packet
+and the upstream source repository.
As alluded to above, the .outpack
directory contains
+lots of information about packets that have been run, but is typically
+“out of bounds” for normal use. This is effectively the “database” of
+information about packets that have been run. Understanding how this
+directory is structured is not required for using orderly, but is
+included here for the avoidance of mystery! See the outpack
+documentation (vignette("outpack")
) for more details about
+the ideas here.
After all the work above, our directory structure looks like:
+## .outpack
+## ├── config.json
+## ├── index
+## │ └── outpack.rds
+## ├── location
+## │ ├── local
+## │ │ ├── 20241022-124934-262628cf
+## │ │ ├── 20241022-124934-ad20b612
+## │ │ ├── 20241022-124935-41a98cb6
+## │ │ └── 20241022-124935-c53e3256
+## │ └── orphan
+## └── metadata
+## ├── 20241022-124934-262628cf
+## ├── 20241022-124934-ad20b612
+## ├── 20241022-124935-41a98cb6
+## └── 20241022-124935-c53e3256
+As can be perhaps inferred from the filenames, the files
+.outpack/metadata/<packet-id>
are the metadata for
+each packet as it has been run. The files
+.outpack/location/<location-id>/<packet-id>
+holds information about when the packet was first known about by a
+location (here the location is the special “local” location).
The default orderly configuration is to store the final files in a
+directory called archive/
, but alternatively (or
+additionally) you can use a content-
+addressable file store. With this enabled, the .outpack
+directory looks like:
## .outpack
+## ├── config.json
+## ├── files
+## │ └── sha256
+## │ ├── 0a
+## │ │ └── a82571c21c4e5f1f435e8bef2328dda5ef47e177d78d63d1c4ec647a5a388a
+## │ ├── 25
+## │ │ └── 4947c281b203719c72949745123a1d017e2f9b50c048b1d24a0803d73ba0b8
+## │ ├── 40
+## │ │ └── f8c6c3993686ee8acf38ce936911d6f65372d489e78748ec6d1f27d277daa7
+## │ ├── 42
+## │ │ └── 438203a847c7cd496e3e603ddeeceb4e136f1bd18f1efb81332fa093964f7b
+## │ ├── 51
+## │ │ └── f8070f04f57a8040ad215474b976bcddc891bfc544e6c6a0519e74d79cb104
+## │ ├── 5f
+## │ │ └── 96f49230c2791c05706f24cb2335cd0fad5d3625dc6bca124c44a51857f3f8
+## │ ├── a6
+## │ │ └── 80ab7c65a52327a3d9c5499d114f513f18eabe7f63a98f9fc308c2b3744c82
+## │ ├── d9
+## │ │ └── 1699ae410cbd811e1f028f8a732e5162b7df854eec08d921141f965851272d
+## │ ├── de
+## │ │ └── 1329bc0e9f8ee0c8b3376eb09b71e93587f707dfeaa01d1a03ae32f97c928a
+## │ └── ec
+## │ └── b53285781a4d36c65168c80ee14f2af2c885423c6166b9425f40c3c6cd8297
+## ├── index
+## │ └── outpack.rds
+## ├── location
+## │ ├── local
+## │ │ ├── 20241022-124934-262628cf
+## │ │ ├── 20241022-124934-ad20b612
+## │ │ ├── 20241022-124935-41a98cb6
+## │ │ └── 20241022-124935-c53e3256
+## │ └── orphan
+## └── metadata
+## ├── 20241022-124934-262628cf
+## ├── 20241022-124934-ad20b612
+## ├── 20241022-124935-41a98cb6
+## └── 20241022-124935-c53e3256
+The files under .outpack/files/
should never be modified
+or deleted. This approach to storage naturally deduplicates the file
+archive, so that a large file used in many places is only ever stored
+once.
orderly
and
+outpack
+The orderly2
package is built on a metadata and file
+storage system called outpack
; we will be implementing
+support for working with these metadata archives in other languages (see
+outpack_server
+for our server implementation in Rust and outpack-py
+in Python). The metadata is discussed in more detail in
+vignette("metadata")
and we will document the general ideas
+more fully at mrc-ide/outpack
The orderly2
package is the reference implementation at
+the moment of the outpack specification; a collection of schemas and
+directory structures that outpack requires. Once we release (or possibly
+before), we will split this specification from the package, though the
+package will continue to bundle a copy.
We make use of JSON schema to +document the schemas used.
+This vignette outlines the basic structure of files within the
+.outpack/
directories, and is not itself an overview of how
+outpack works; the primary audience is people working on outpack itself
+(though a small introduction is provided below).
Each “packet” is conceptually a directory, corresponding to a +particular analysis or data product, though this is not necessarily how +it is stored. The internal representation includes:
+Every packet is referenced uniquely by a primary key. We use a key +format that encodes the current date and time, as well as random data to +avoid collisions.
+There exists some dependency graph among packets, as one packet +depends on another. Each edge of this graph has a hard link (from one +packet to another by an id) and also a query (e.g., latest packet with +some name) which was used to find the packet. This means that there are +many ways of looking at or thinking about the dependency graph.
+Not all packets are available locally, some are on other outpack +repositories, typically (but not always) on other machines and accessed +over an HTTP API. These are conceptually similar to git “remotes”.
+We will need to distinguish between packets which are “unpacked” +(that is, packets with every file available in the current archive) and +packets that are merely known about (those for which we have the +metadata but not the files). We will sometimes refer to these unpacked +packets as “local” as they are known to the “local” location which is +special.
+We use the terms “archive” and “repository” fairly +interchangeably below and will try and nail that down.
+Each packet must have a few things:
+model_fits
). This cannot be changed
+(or rather changes cannot be tracked) and there is not currently a way
+of namespacing this between different repositoriesoutput/data.csv
) and also a hash (e.g.,
+sha256:69f6cf230416cf40828da251a0dad17cbbf078587883e826f3345ff08d1aaa7d
)In addition it may contain information about:
+There are a few types of “persona” of outpack user that we imagine +exist and which guide some decisions abut layout below. At the extremes +we have:
+This impacts two configuration options and associated parts of the +directory structure below:
+We expect the first persona wants the human readable archive and not +to contain a full tree, while the second wants the opposite.
+This section discusses the files and directory that make outpack +work, but not so much how these come to be; see below for that.
+A typical .outpack
directory layout looks like this:
.outpack/
+ config.json
+ files/
+ location/
+ metadata/
+archive/
+(note that archive/
and .outpack
here are
+at the same level). Not all of these directories will necessarily be
+present; indeed the only required file is
+.outpack/config.json
.
.outpack/config.json
)
+The outpack configuration schema is defined in config.json
The configuration format is still subject to change…
+.outpack/metadata/
)
+Each file within this directory has a filename that is an outpack id
+(matching the regular expression
+^[0-9]{8}-[0-9]{6}-[0-9a-f]{8}$
, see below. Each file is a
+json file conforming to the schema metadata.json
.
Being present here means that an outpack implementation can report +information back about a packet (when it was created, what files it +contains, etc), but packet metadata are not very meaningful on their +own; we want to know where they might have come from (a location that is +distributing this packet) and if we have a copy of the packet +locally.
+.outpack/location/
)
+This directory matches the regular expression ^[0-9]{8}$
+(e.g., 457f4f2a
) and is a “location id” (see below)
+corresponding to a “location”. Each file within this directory has an
+outpack id as name, and contains json about when that location unpacked
+(or installed) the packet, and the hash of the metadata. This file
+conforms to the schema location.json
.
.outpack/files
)
+If the configuration option core.use_file_store
is
+true
, then outpack keeps a content addressable file store
+of all files that it knows about. This is much more space efficient than
+having the entire packet unpacked as it automatically deduplicates
+shared content among packets (e.g., if a large file is present in two
+packets it will only be stored once). The file store layout is described
+below.
This storage format is not human-readable (and indeed present only
+within the hidden directory .outpack
). It can be enabled on
+either server or user
archive/
by default)
+If the configuration option core.path_archive
is
+non-null
then there will be a directory with that path
+containing unpacked packets. Each packet will be available at the
+path
archive/<name>/<id>/<files...>
+With <name>
being the “name” of the packet,
+<id>
being its outpack id. There will be several
+files per packet, possibly themselves in directories. This storage
+approach is designed to be human readable, and will typically only be
+enabled where the outpack repository is being used on a laptop where a
+user wants to interactively work with files.
In order to make a packet available locally, you need to import the +metadata and the files, then mark the packet as available. This will be +roughly the same if you are creating a packet (i.e., you are the first +place where a packet has ever existed) or if you are importing a packet +from elsewhere.
+Making the packet available allows it to be used as a dependency, +allows serving that packet if you are acting as a location (over the +http or file protocols), and guarantees that the files are actually +present locally.
+You can simply copy metadata as the file
+.outpack/metadata/<packet id>
if it does not yet
+exist. This does not make it available to anything yet as it is not
+known from any location. Dangling metadata (that is, metadata present in
+this directory but not known anywhere) is currently mostly ignored.
If the repository uses a file store, you should fill this first, +because it is much easier to think about. You can easily get the +difference between files used by a packet (the list of files in the +packet manifest) and what you already have in the file store by looking +up each hash in turn. You should then request and missing files and +insert them into the store. This may leave “dangling” files for a while +(files referred to by no packet) but that is not a problem.
+If the repository has a human-readable archive and +uses a file store, then after the files are all present in the file +store it is easy enough to check them out of the file store to the +requested path (the local relative path in the packet manifest). Because +you update the file store first, all files are guaranteed to be +present.
+If the repository only uses a human readable +archive, the simplest thing is to request each file from the remote. +However, it might be more efficient to check locally for any previously +fetched copies of files with the same content, verify that they have not +been modified, and then copy those into place rather than +re-downloading.
+For your local location id, write out a file
+.outpack/<local location id>/<packet id>
+conforming to the location.json
+schema, and containing the packet id, the time that it was marked as
+unpacked and the hash of the metadata.
We only need both files and metadata once the packet is marked as +unpacked; note that some configurations guarantee that every packet is +unpacked in a complete tree.
+You can import files first, or metadata; there is not a lot of +disadvantage to either. You should only mark a package unpacked and +known locally though once both components are present.
+Outpack ids match the regular expression
+^[0-9]{8}-[0-9]{6}-[0-9a-f]{8}$
; they are encode UTC
+date-time as with the prefix YYYYMMDD-HHMMSS-
and are
+followed by 8 hexadecimal digits. In the R implementation, we encode the
+current second as the first four digits (2 bytes) and append 2 bytes of
+cryptographically random data.
The id tries to balance a reasonable degree of collision resistance +(65536 combinations per millisecond), lexicographic sortability and a +reasonable degree of meaningfulness.
+Location ids are meaningless 4-byte (8 character) hex strings. They
+are immutable once created and are different between different machines
+even if they point to the same location. This location id is then mapped
+(via .outpack/config.json
) to a location name
+which is a human-readable name (e.g., production
or
+staging
). There is no requirement that this name is the
+same for different machines.
One of these directories represents the local location; you can find +that mapping within the configuration.
+Outpack typically uses sha256
hashes, but we want to be
+able to change this in future. So wherever a hash is presented, the
+algorithm is included as part of the string. For example
sha256:69f6cf230416cf40828da251a0dad17cbbf078587883e826f3345ff08d1aaa7d
+If we had instead used the md5 algorithm we would have written
+md5:bd57f7123c6bfb95c3234ff56373b7f4
+The schema currently assumes that the hash value is represented as a +hex string.
+We store information about times in a few places (e.g., times that a
+packet was run, imported, etc). Rather than trying to deal with strings,
+we always store time in seconds since
+1970-01-01 00:00.00 UTC
(including fractional seconds, to
+whatever accuracy your system allows).
The file store is designed to be simple, and is not as sophisticated +as that in git, whose object store does a similar thing.
+The general layout looks like:
+.outpack/files
+ sha256/
+ 5d/
+ dfaf1f4a2e15e8fe46dbed145bf2f84bba1b3367d0a56f73de08f8585dd153
+ ...
+ 77/
+ ...
+With hopefully a fairly obvious structure. Paths have format:
+<algorithm>/<first two bytes>/<remaining bytes>
+The reason for the second level is to prevent performance degradation +with directories containing millions of files, again copying git.
+The store is designed to cope with different hashing algorithms,
+though the R implementation of outpack
only supports
+sha256
for now.
Multiple hashing algorithms could be supported by hard linking +content into multiple places with in the tree, so we might link
+sha256/5d/dfaf1f4a2e15e8fe46dbed145bf2f84bba1b3367d0a56f73de08f8585dd153
+as
+md5/84/0bc6ad3ae479dccc1c49a1910b37bd
+The new version of orderly (codename orderly2
for now)
+is very different to the previously released version on CRAN (orderly
+1.4.3; September 2021) or the last development version of the 1.x line
+(orderly 1.6.x; June 2023). These changes constitute a ground-up rewrite
+in order to bring out the best features we found that orderly enabled
+within workflows, while removing some features we felt have outlived
+their usefulness. This is disruptive change, but we hope that it will be
+worth it.
This vignette is divided into two parts; one covers the conceptual
+differences between orderly1
and orderly2
+while the second covers the mechanical process of migrating from an
+existing orderly source tree and archive to take advantage of the new
+features.
If you have never used version 1.x of orderly, you should not read
+this document unless you are curious about the history of design
+decisions. Instead you should read the introductory vignette
+(vignette("orderly2")
).
The most obvious user-facing change is that there is (almost) no YAML, with the definition
+of inputs and outputs for a report now defined within an orderly file,
+<reportname>.R
. So an orderly report that previously
+had an orderly.yml
file that looked like
parameters:
+ n_min:
+ default: 10
+script: script.R
+source:
+ - functions.R
+resources:
+ - metadata.csv
+depends:
+ raw_data:
+ id: latest
+ use:
+ raw_data.csv: data.csv
+artefacts:
+ data:
+ description: Processed data
+ filenames: data.rds
would end up within an orderly file that looks like:
+
+orderly2::orderly_parameters(n_min = 10)
+orderly2::orderly_dependency("raw_data", "latest",
+ files = c("raw_data.csv" = "data.csv"))
+orderly2::orderly_resource("metadata.csv")
+orderly2::orderly_artefact("Processed data", "data.rds")
+source("functions.R")
We think this is much clearer, and comes with documentation and +autocomplete support in most IDEs.
+In fact, for simple reports, no special functions
+are required, though you’ll find that some will be useful (see
+vignette("orderly2")
)
Some specific changes:
+packages:
,
+instead just use library()
as in an ordinary script. We
+will record the state of the session regardless so you will get a record
+of what was usedsources:
to list scripts you
+want to source, instead use source()
as normalglobal_resources
has become
+orderly2::orderly_shared_resource
(these aren’t really
+global so much as shared). Note that the directory they are in is now
+always shared/
at the orderly root, you may not configure
+it.This change has widespread implications:
+for
loop over a series of parameter values, or
+conditionally depending on other reportsIn version 1, we had built-in support for accessing data from SQL
+databases; this has moved within the orderly.db
+plugin. All major features are supported.
orderly2
no longer requires a separate
+orderly_commit()
call after orderly_run()
; we
+no longer make a distinction between local draft and archive packets.
+Instead, we have added finer-grained control over where dependencies are
+resolved from (locally, or from some subset of your servers), which
+generalises the way that draft/archive was used in practice. See
+?orderly_run
for more details on how dependencies are
+resolved.
This has implications for deleting things; the draft directory was
+always an easy target for deletion, but now after deletion you will need
+to tell orderly2
that you have deleted things. See
+vignette("introduction")
for details on this (section
+“Deleting things from the archive”).
We have had two different, but unsatisfactory, mechanisms for +developing an orderly report:
+orderly_test_start
(up to version 1.1.0)orderly_develop_start
(from 1.1.0 onwards)These worked by doing the initial setup, and copying of dependencies
+etc to a location you could work (a new draft for
+orderly_test_start
and the source directory for
+orderly_develop_start
). From orderly2
you can
+just work directly within the source directory, and so long as your
+working directory is set to src/<report-name>
, all
+orderly2
commands will work as expected.
As with orderly1
, you will need to be careful not to
+commit (to git) results of running your analysis, and we encourage
+per-report .gitignore
files to help with this.
The biggest change, but perhaps the least visible, is that orderly is +now built on an open spec outpack which can be +implemented for any language. We will develop a Python implementation of +this, and possibly other languages.
+This takes control of all the metadata. As such there is a split
+between “orderly_
” and “outpack_
” functions in
+this package, for more information see the last section of
+vignette("introduction")
and also
+vignette("outpack")
.
“Global” resources have become “shared” resources,
+and always live in shared/
at the orderly root (i.e., this
+is no longer configurable). The reason for this is that we want reports
+to be able to be run fairly independently of the orderly2
+configuration (the only exception to this are plugins). In practice
+people did not really vary this.
orderly1
+There are two parts to a migration: updating the canonical copy of
+your orderly archive (ideally you only have one of these) and updating
+your source tree. These steps should be done via the outpack.orderly
+package.
You should migrate your archive first. Do this for every archive that
+you want to retain (you might have archives stored locally, on
+production servers and on staging servers). Archive migration happens
+out of place; that is, we do not modify anything in the
+original location. If your archive is old and has been used with very
+old versions of orderly1
it is possible that this process
+will have a few hiccups. Please let us know if that is the case. The
+result of this process is that you will end up with a new directory that
+contains a new archive conforming to the outpack
spec and
+containing orderly2
metadata.
Next, migrate your source tree. This will be done in place
+so should be done on a fresh clone of your source git repository. For
+each report, we will examine your orderly.yml
files and
+your script files (often script.R
), delete these, and then
+write out a new orderly file that will adapt your report to work for
+orderly2
. It is possible that this will not be perfect and
+might need some minor tweaking but hopefully it will be reasonable. One
+thing that is not preserved (and we probably cannot do so) is the
+comments from the yaml
but as these often refer to
+yaml
formatting or orderly1
features hopefully
+this is not too much of a problem. You will probably want to manually
+tweak the generated code anyway, to take advantage of some of the new
+orderly2
features such as being able to compute
+dependencies.
If you are using OrderlyWeb, you probably need to pause before +migrating, as the replacement is not yet ready.
+We will merge orderly2
into the orderly
+package, so once we are ready for release you can use that. However, we
+anticipate a period of coexistence of both legacy orderly1
+systems while we develop orderly2
. To help with this we
+have a small helper package orderly.helper
+which can smooth over these namespace differences; this may be useful if
+you interact with both versions.
orderly2
is a package designed with two complementary
+goals in mind:
In this vignette we will expand on these two aims, and show that the
+first one is a prerequisite for the second. The second is more
+interesting though and we start there. If you just want to get started
+using orderly2
, you might prefer
+vignette("introduction")
and if you are already familiar
+with version 1 you might prefer vignette("migrating")
.
Many analyses only involve a single person and a single machine; in +this case there are any number of workflow tools that will make +orchestrating this analysis easy. In a workflow model you have a graph +of dependencies over your analysis, over which data flows. So for +example you might have
+[raw data] -> [processed data] -> [model fits] -> [forecasts] -> [report]
+If you update the data, the whole pipeline should rerun. But if you +update the code for the forecast, then only the forecasts and report +should rerun.
+In our experience, this model works well for a single-user setting +but falls over in a collaborative setting, especially where the analysis +is partitioned by person; so Alice is handling the data pipeline, Bob is +running fits, while Carol is organising forecasts and the final report. +In this context, changes upstream affect downstream analysis, and +require the same set of care around integration as you might be used to +with version controlling source code.
+For example, if Alice is dealing with a change in the incoming data +format which is going to break the analysis at the same time that Bob is +trying to get the model fits working, Bob should not be trying to +integrate both his code changes and Alice’s new data. We +typically deal with this for source code by using branches within git; +for the code Bob would work on a branch that is isolated from Alice’s +changes. But in most contexts like this you will not have (and +should not have) the data and analysis products in git. What is +needed is a way of versioning the outputs of each step of analysis and +controlling when these are integrated into subsequent analyses.
+Another way of looking at the problem is that we seek a way of making +analysis composable in the same way that functions and +OOP achieve for programs, or the way that docker and containerisation +have achieved for deploying software. To do this we need a way of +putting interfaces around pieces of analysis and to allow people to +refer to them and fetch them from somewhere where they have been +run.
+The conceptual pieces that are needed here are:
+We refer to a transportable unit of analysis as a “packet”. This
+conceptually is a directory of files created by running some code, and
+is our atomic unit of work from the point of view of
+orderly
. Each packet has an underlying source form, which
+anyone can run. However, most of the time people will use
+pre-run packets that they or their collaborators have run as inputs to
+onward analyses (see vignette("dependencies")
and
+vignette("collaboration")
for more details).
Any degree of collaboration in the style above requires +reproducibility, but there are several aspects of this.
+With the system we describe here, even though everyone can +typically run any step of an analysis, they typically don’t. +This differs from workflow tools, which users may be familiar with.
+Workflow systems have been hugely influential in scientific +computing, from people co-opting build systems like make through to +sophisticated systems designed for parallel running of large and complex +workflows such as nextflow. The +general approach is to define interdependencies among parts of an +analysis, forming a graph +over parts of an analysis and track inputs and outputs through the +workflow.
+This model of computation has lots of good points:
+We have designed orderly2
for working patterns that do
+not suit the above. Some motivating reasons include:
In all these cases the missing piece we need is a way of versioning +the nodes within the computational graph, and shifting the emphasis from +automatically rerunning portions of the graph to tracking how data has +flowed through the graph. This in turn shifts the reproducibility +emphasis from “everyone will run the same code and get the same +results” to “everyone could run the same code, but will instead +work with the results”.
+For those familiar with docker, our approach is similar to working +with pre-built docker images, whereas the workflow approach is more +similar to working directly with Dockerfiles; in many situations the end +result is the same, but the approaches differ in guarantees, in where +the computation happens, and in how users refer to versions.
+[*] We discourage trying to force determinism by manually +setting seeds, as this has the potential to violate the statistical +properties of random number streams, and is fragile at best. +
+Reproducibility means different things to different people, even +within the narrow sense of “rerun an analysis and retrieve the same +results”. In the last decade, the idea that one should be able +to rerun a piece of analysis and retrieve the same results has slightly +morphed into one must rerun a piece of analysis. Similarly, the +emphasis on the utility of reproducibility has shifted from authors +being able to rerun their own work (or have confidence that they could +rerun it) to some hypothetical third party wanting to rerun an +analysis.
+Our approach flips the perspective around a bit, based on our +experiences with collaborative research projects, and draws from an +(overly) ambitious aim we had:
+++Can we prove that a given set of inputs produced a given set of +outputs?
+
We quickly found that this was impossible, but provided a few systems
+were in place one could be satisfied with this statement to a given
+level of trust in a system. So if a piece of analysis comes from a
+server where the primary way people run analyses is through our web
+front-end (currently OrderlyWeb, soon to be
+Packit) we know
+that the analysis was run end-to-end with no modification and that
+orderly2
preserves inputs alongside outputs so the files
+that are present in the final packet were the files
+that went into the analysis, and the recorded R and package versions
+were the full set that were used.
Because this system naturally involves running on multiple machines
+(typically we will have the analysts’ laptops, a server and perhaps a
+HPC environment), and because of the way that orderly2
+treats paths, practically there is very little problem getting analyses
+working in multiple places, trivially satisfying the typical
+reproducibility aim, even though it is not what people are typically
+focussed on.
This shift in focus has proved valuable. In any analysis that is run
+on more than one occasion (e.g., regular reporting, or simply updating a
+figure for a final submission of a manuscript after revision), the
+outputs may change. Understanding why these changes have
+happened is important. Because orderly2
automatically saves
+a lot of metadata about what was run it is easy to find out why things
+might have changed. Further, you can start interrogating the graph among
+packets to find out what effect that change has had; so find all the
+previously run packets that pulled in the old version of a data set, or
+that used the previous release of a package.
You may not need to read this: the intended readers are
+authors of orderly2
plugins, not users of such
+plugins.
In order to make orderly2
more extensible without
+bloating the core, we have designed a simple plugin interface. Our first
+use case for this is shifting all of orderly1
’s database
+functionality out of the main package, but other uses are possible!
This vignette is intended to primarily serve as a design document, +and will be of interest to the small number of people who might want to +write a new plugin, or to edit an existing one.
+A plugin is provided by a package, possibly it will be the only thing
+that a package provides. The plugin name must (currently) be the same as
+the package name. The only functions that the package needs to call are
+orderly2::orderly_plugin
and
+orderly2::orderly_plugin_register
which create and register
+the plugin, respectively.
To make a plugin available for an orderly project, two new bits of
+configuration may be present in orderly_config.yml
- one
+declares the plugin will be used, the other configures the plugin.
To use a plugin for an individual report, functions from the plugin +should be used, which configure and use the plugin.
+Finally, we can save information back into the final
+orderly2
metadata about what the plugin did.
With the yaml-less design of orderly2
(see
+vignette("migrating")
if you are familiar with
+orderly1
), the line between a plugin and just package code
+is fairly blurred, but reasons for writing a plugin are typically that
+you want to make something easier in reports, and you want that action
+reflected in the orderly metadata.
As an example, we’ll implement a stripped down version of the +database plugin that inspired this work (see `orderly.db for a +fuller implementation). To make this work we need functions:
+orderly_config.yml
+that describe where to find the databaseWe’ll start with the report side of things, describing what we want +to happen, then work on the implementation.
+Here is the directory structure of our minimal project
+## .
+## ├── orderly_config.yml
+## └── src
+## └── example
+## └── example.R
+The orderly_config.yml
file contains the information
+shared by all possible uses of the plugin - in the case the connection
+information for the database:
Our plugin is called example.db
and is listed within the
+plugins
section, along with its configuration; in this case
+indicating the path where the SQLite file can be loaded from.
The example.R
file contains information about use of the
+database for this specific report; in this case, making the results of
+the query SELECT * from mtcars WHERE cyl == 4
against the
+database available as some R object dat
+dat <- example.db::query("SELECT * FROM mtcars WHERE cyl == 4")
+orderly2::orderly_artefact("Summary of data", "data.rds")
+
+saveRDS(summary(dat), "data.rds")
Normally, we imagine some calculation here but this is kept minimal +for the purpose of demonstration.
+To implement this we need to:
+orderly_config.yml
+query()
used in example.R
+to do the query itselfThere are lots of package skeleton tools out there, and if you do not
+have a favourite, usethis::create_package()
will probably
+do a reasonable job. The only thing your package needs to do is to
+contain Imports: orderly2
in its DESCRIPTION
+field.
A simple package may have a structure like
+## .
+## ├── DESCRIPTION
+## ├── NAMESPACE
+## └── R
+## └── plugin.R
+Here, our DESCRIPTION
file contains:
Package: example.db
+Version: 0.0.1
+License: CC0
+Title: Orderly Database Example Plugin
+Description: Simple example of an orderly plugin.
+Authors@R: person('Orderly Authors', role = c('aut', 'cre'),
+ email = 'email@example.com')
+Imports: orderly2
+and the NAMESPACE
and R/plugin.R
files are
+shown below.
The only required function that a plugin needs to provide is one to
+process the data from orderly_config.yml
. This is probably
+primarily concerned with validation so can be fairly simple at first,
+later we’ll expand this to report errors nicely:
+db_config <- function(data, filename) {
+ data
+}
The arguments here are
+data
: the deserialised section of the
+orderly_config.yml
specific to this pluginfilename
: the full path to
+orderly_config.yml
+The return value here should be the data
argument with
+any auxiliary data added after validation.
Finally, for our minimal example, we need the function that actually
+does the query; in our example above this is
+example.db::query
:
+query <- function(sql) {
+ ctx <- orderly2::orderly_plugin_context("example.db")
+ dbname <- ctx$config$path
+ con <- DBI::dbConnect(RSQLite::SQLite(), dbname)
+ on.exit(DBI::dbDisconnect(con))
+ DBI::dbGetQuery(con, sql)
+}
The arguments here are whatever you want the user to provide –
+nothing here is special to orderly2
. The important function
+here to call is orderly2::orderly_plugin_context
which
+returns information that you can use to make the plugin work. This is
+explained in ?orderly2::orderly_plugin_context
, but in this
+example we use just one element, config
, the configuration
+for this plugin (i.e., the return value from our function
+db_config
); see
+orderly2::orderly_plugin_context
for other context that can
+be accessed here.
The last bit of package code is to register the plugin, we do this by
+calling orderly2::orderly_plugin_register
within
+.onLoad()
which is a special R function called when a
+package is loaded. This means that whenever your packages is loaded
+(regardless of whether it is attached) it will register the plugin.
+.onLoad <- function(...) {
+ orderly2::orderly_plugin_register(
+ name = "example.db",
+ config = db_config)
+}
(It is important that the name
argument here matches
+your package name, as orderly2 will trigger loading the package based on
+this name in the configuration; we may support multiple plugins within
+one package later.)
Note that our query
function here does not appear within
+this registration, just the function to read and process the
+configuration.
Our final (minimal) package code is:
+
+db_config <- function(data, filename) {
+ data
+}
+
+query <- function(sql) {
+ ctx <- orderly2::orderly_plugin_context("example.db")
+ dbname <- ctx$config$path
+ con <- DBI::dbConnect(RSQLite::SQLite(), dbname)
+ on.exit(DBI::dbDisconnect(con))
+ DBI::dbGetQuery(con, sql)
+}
+
+.onLoad <- function(...) {
+ orderly2::orderly_plugin_register(
+ name = "example.db",
+ config = db_config)
+}
and the NAMESPACE
file contains
export(query)
+In order to test your package, it needs to be loaded. You can do this
+by either installing the package or by using
+pkgload::load_all()
(you may find doing so with
+pkgload::load_all(export_all = FALSE)
gives the most
+reliable experience.
+pkgload::load_all()
## ℹ Loading example.db
+Now, we can run the report:
+
+orderly2::orderly_run("example", root = path_root)
+## ℹ Starting packet 'example' `20241022-124944-e3db05e1` at 2024-10-22 12:49:44.896146
+## > dat <- example.db::query("SELECT * FROM mtcars WHERE cyl == 4")
+## > orderly2::orderly_artefact("Summary of data", "data.rds")
+## Warning: Please use a named argument for the description in 'orderly_artefact()'
+## In future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## > saveRDS(summary(dat), "data.rds")
+## ✔ Finished running example.R
+## ! 1 warning found:
+## • Please use a named argument for the description in 'orderly_artefact()' In
+## future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## ℹ Finished 20241022-124944-e3db05e1 at 2024-10-22 12:49:45.101785 (0.2056396 secs)
+## [1] "20241022-124944-e3db05e1"
The plugin above is fairly fragile because it does not do any
+validation on the input data from orderly_config.yml
or
+orderly.yml
. This is fairly annoying to do as yaml is
+incredibly flexible and reporting back information to the user about
+what might have gone wrong is hard.
In our case, we expect a single key-value pair in
+orderly_config.yml
with the key being path
and
+the value being the path to a SQLite database. We can easily expand our
+configuration function to report better back to the user when they
+misconfigure the plugin:
+db_config <- function(data, filename) {
+ if (!is.list(data) || is.null(names(data)) || length(data) == 0) {
+ stop("Expected a named list for orderly_config.yml:example.db")
+ }
+ if (length(data$path) != 1 || !is.character(data$path)) {
+ stop("Expected a string for orderly_config.yml:example.db:path")
+ }
+ if (!file.exists(data$path)) {
+ stop(sprintf(
+ "The database '%s' does not exist (orderly_config:example.db:path)",
+ data$path))
+ }
+ data
+}
This should do an acceptable job of preventing poor input while +suggesting to the user where they might look within the configuration to +fix it. Note that we return the configuration data here, and you can +augment (or otherwise change) this data as you need.
+Nothing about what the plugin does is saved into the report metadata +unless you save it. Partly this is because the orderly.yml, which is +saved into the final directory, serves as some sort of record. However, +you probably want to know something about the data that you returned +here. For example we might want to save
+orderly.yml
fileTo save metadata, use the function
+orderly2::orderly_plugin_add_metadata
; this takes as
+arguments your plugin name, any string you like to structure the saved
+metadata (here we’ll use query
) and whatever data you want
+to save:
+query <- function(sql) {
+ ctx <- orderly2::orderly_plugin_context("example.db")
+ dbname <- ctx$config$path
+ con <- DBI::dbConnect(RSQLite::SQLite(), dbname)
+ on.exit(DBI::dbDisconnect(con))
+ d <- DBI::dbGetQuery(con, sql)
+ info <- list(sql = sql, rows = nrow(d), cols = names(d))
+ orderly2::orderly_plugin_add_metadata("example.db", "query", info)
+ d
+}
This function is otherwise the same as the minimal version above.
+We also need to provide a serialisation function to ensure that the
+metadata is saved as expected. Because we saved our metadata under the
+key query
, we will get a list back with an element
+query
and then an unnamed list with as many elements as
+there were query
calls in a given report.
+db_serialise <- function(data) {
+ for (i in seq_along(data$query)) {
+ # Always save cols as a vector, even if length 1:
+ data$query[[i]]$cols <- I(data$query[[i]]$cols)
+ }
+ jsonlite::toJSON(data$query, auto_unbox = TRUE)
+}
Here, we ensure that everything except cols
that is
+length 1 (which will be everything) gets turned into a scalar (so
+1
not [1]
) and then serialise with
+jsonlite::toJSON
with auto_unbox
as
+TRUE
.
Taking this a step further, we can also specify a schema that this metadata will +conform to
+{
+ "$schema": "http://json-schema.org/draft-07/schema#",
+
+ "type": "array",
+ "items": {
+ "type": "object",
+ "properties": {
+ "sql": {
+ "type": "string"
+ },
+ "rows": {
+ "type": "number"
+ },
+ "cols": {
+ "type": "array",
+ "items": {
+ "type": "string"
+ }
+ }
+ },
+ "required": ["sql", "rows", "cols"],
+ "additionalProperties": false
+ }
+}
We save this file as inst/schema.json
within the package
+(any path within inst
is fine).
Finally, we can also add a deserialiation hook to convert the loaded
+metadata into a nice data.frame
:
Now, when we register the plugin, we provide the path to this schema, +along with the serialisation and deserialisation functions:
+
+.onLoad <- function(...) {
+ orderly2::orderly_plugin_register(
+ name = "example.db",
+ config = db_config,
+ serialise = db_serialise,
+ deserialise = db_deserialise,
+ schema = "schema.json")
+}
Now, when the orderly metadata is saved (just before running the
+script part of a report) we will validate output that was passed into
+orderly2::orderly_plugin_add_metadata
against the schema,
+if jsonvalidate
is installed (currently this requires our
+development version) and if the R option
+outpack.schema_validate
is set to TRUE
(e.g.,
+by running options(outpack.schema_validate = TRUE)
).
Our final package has structure:
+## .
+## ├── archive
+## │ └── example
+## │ └── 20241022-124944-e3db05e1
+## │ ├── data.rds
+## │ └── example.R
+## ├── draft
+## │ └── example
+## ├── orderly_config.yml
+## └── src
+## └── example
+## └── example.R
+The DESCRIPTION
file and NAMESPACE
are
+unchanged from above, and the schema is shown just above.
The plugin.R
file contains the code collected from
+above:
+db_config <- function(data, filename) {
+ if (!is.list(data) || is.null(names(data)) || length(data) == 0) {
+ stop("Expected a named list for orderly_config.yml:example.db")
+ }
+ if (length(data$path) != 1 || !is.character(data$path)) {
+ stop("Expected a string for orderly_config.yml:example.db:path")
+ }
+ if (!file.exists(data$path)) {
+ stop(sprintf(
+ "The database '%s' does not exist (orderly_config:example.db:path)",
+ data$path))
+ }
+ data
+}
+
+query <- function(sql) {
+ ctx <- orderly2::orderly_plugin_context("example.db")
+ dbname <- ctx$config$path
+ con <- DBI::dbConnect(RSQLite::SQLite(), dbname)
+ on.exit(DBI::dbDisconnect(con))
+ d <- DBI::dbGetQuery(con, sql)
+ info <- list(sql = sql, rows = nrow(d), cols = names(d))
+ orderly2::orderly_plugin_add_metadata("example.db", "query", info)
+ d
+}
+
+.onLoad <- function(...) {
+ orderly2::orderly_plugin_register(
+ name = "example.db",
+ config = db_config,
+ serialise = db_serialise,
+ deserialise = db_deserialise,
+ schema = "schema.json")
+}
(this code could be in any .R file in the package, or across +several).
+
+id <- orderly2::orderly_run("example", root = path_root)
+## ℹ Starting packet 'example' `20241022-124945-fb73a61c` at 2024-10-22 12:49:45.986707
+## > dat <- example.db::query("SELECT * FROM mtcars WHERE cyl == 4")
+## > orderly2::orderly_artefact("Summary of data", "data.rds")
+## Warning: Please use a named argument for the description in 'orderly_artefact()'
+## In future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## > saveRDS(summary(dat), "data.rds")
+## ✔ Finished running example.R
+## ! 1 warning found:
+## • Please use a named argument for the description in 'orderly_artefact()' In
+## future versions of orderly, we will change the order of the arguments to
+## 'orderly_artefact()' so that 'files' comes first. If you name your calls to
+## 'description' then you will be compatible when we make this change.
+## ℹ Finished 20241022-124945-fb73a61c at 2024-10-22 12:49:46.043002 (0.05629492 secs)
+meta <- orderly2::orderly_metadata(id, root = path_root)
+meta$custom$example.db
+## sql rows cols
+## 1 SELECT * FROM mtcars WHERE cyl == 4 11 mpg, cyl....
Our need for this functionality are similar to this example - pulling +out the database functionality from the original version of orderly into +something that is more independent, as it turns out to be useful only in +a fraction of orderly use-cases. We can imagine other potential uses +though, such as:
+orderly_config.yml
would
+contain account connection details and orderly.yml
would
+contain mapping between the remote data/files and local files. Rather
+than writing to the environment as we do above, use the
+path
argument to copy files into the correct place.These all follow the same basic pattern of requiring some +configuration in order to be able to connect to the resource service, +some specification of what resources are to be fetched, and some action +to actually fetch the resource and put it into place.
+
+knitr::opts_chunk$set(
+ collapse = TRUE)
This vignette provides a detailed overview of of the query
+language used to find packets and use them as dependencies of other
+packets. You may prefer to start with
+vignette("dependencies")
Orderly includes a query DSL (domain specific language), extending
+the one used by version 1 of orderly (see orderly1::orderly_search()
).
Queries are used in identifying ids to pull in as dependencies, so +rather than providing an identifier, you might want to depend on
+The most simple query is
+
+latest()
which finds the most recent packet; this is unlikely to be very +useful without scoping - see below.
+More complex queries are expressed in a syntax that is valid R (this +is also valid Julia and close to valid Python). A complex query is +composed of “tests”
+
+name == "some_name"
+parameter:x > 1
Every “test” uses a boolean operator (<
,
+>
, <=
, >=
,
+==
, or !=
) and the left and right hand side
+can be one of:
parameter:x
is the
+value of a parameter called x
, name
is the
+name of the packet, and id
is the id of a packet)pars
+(this:x
is the value of pars$x
)"some_name"
, 1
, or
+TRUE
)Tests can be grouped together (
, !
,
+&&
, and ||
as you might expect:
parameter:x == 1 || parameter:x == 2
finds packets
+where the parameter x
was 1 or 2name == "data" && parameter:x > 3
finds
+packets called “data” where parameter x
is greater than
+3(parameter:y == 2) && !(parameter:x == 1 || parameter:x == 2)
+finds where parameter y
is 2 and parameter x
+is anything other than 1 or 2 (could also be written
+(parameter:y == 2) && (parameter:x != 1 && parameter:x != 2)
)There are four other functions
+latest(expr)
finds the latest packet satisfying
+expr
- it always returns a length 1 character, but this is
+NA_character_
if no suitable packet is found. If no
+expr
is given then the latest of all packets is
+returned.single(expr)
is like latest(expr)
except
+that it is an error if expr
does not evaluate to exactly
+one packet idusedby(expr, FALSE)
where expr
is either a
+literal id
or an expression which returns 1
+id
. This finds all packets which were used in generation of
+packet with id
returned from expr
(see
+dependencies section for more details).There are two shorthand queries:
+latest
is equivalent to latest()
(most
+useful when applied with a scope)^([0-9]{8}-[0-9]{6}-[[:xdigit:]]{8})$
) is equivalent to
+single(id == "<id>")
where "<id>"
+is the string provided)WARNING: we may remove this
+Scoping queries can be used as a shorthand for filtering the returned
+packets. In the future they could be used to reduce the set of packets
+that are searched over to speed up query evaluation. They join together
+with the main query as (scope) && (expr)
, except
+when the expr
is a call to latest
or
+single
. In this case they combine as
+latest((scope) && (expr))
or
+single((scope) && (expr))
. This is useful if you
+want to limit the search to a particular name or location but perform
+some more detailed search.
For example, the query
+
+orderly_query(quote(parameter:x == 1), scope = quote(name == "data"))
is equivalent to
+
+orderly_query(quote(parameter:x == 1 && name == "data"))
orderly2
uses this functionality when resolving
+dependencies with orderly2::orderly_dependency
.
Very often users will want to scope by name so instead of passing
+scope
argument there is a shorthand name
+argument for usability.
+orderly_query(quote(parameter:x == 1), name = "data")
Which is the equivalent of
+
+orderly_query(quote(parameter:x == 1), scope = quote(name == "data"))
If we have 2 packets, where B depends on output from A (i.e. we call
+(id_a, ...)
when running packet B) we can draw this as.
We could equivalently say
+With the tree of dependencies among our packets we might want to
+search for packets which have been used by another packet. We can use
+the query function usedby(id)
to list all packets which are
+used by id
. This will search recursively through all
+packets used by id
and its parents and its parents’ parents
+and so on.
The optional second arg immediate
is FALSE
+by default, if set to TRUE
then we search only for
+immediate (e.g. level 1) dependencies.
Being able to search through dependencies like this means if we have +some packet structure like
+ +and we want to know the id
of A
which was
+used by C
we can find this using
+orderly_search
+orderly_search(quote(usedby(latest(name == "C"))), name = "A")
usedby
can be combined with groupings and scope:
+orderly_search(quote(usedby(latest(name == "C")) && parameter:year == 2022),
+ name = "A")
The depth that usedby
will recurse can be controlled by
+setting the depth
e.g.
+orderly_search(quote(usedby(latest(name == "C"), depth = 1)), name = "A")
will search for just immediate parents of C
.
+depth
can be any positive integer, by default
+depth
will recurse until it finds all parents.
usedby
can be simplified by using subqueries. Subqueries
+are denoted by curly braces {}
and can either be named and
+passed in subquery
arg or can be anonymous. The query below
+is equivalent to the above but uses a subquery for C
.
+orderly_search(quote(usedby({C}) && parameter:year == 2022),
+ name = "A",
+ subquery = list(C = quote(latest(name == "C"))))
There are two important things to note about usedby
:
usedby
will search the entire index,
+ignoring any scope
or name
parameters. This is
+because we want to find all packets which are used by
+latest
C
. If the subquery C
was
+scoped this would return no results.usedby
must return a single result. To
+ensure this it must either be a literal id
, a call to
+latest
or a call to single
+As well as searching up the dependency tree using usedby
+we can search down with the uses
function. In the same
+setup above with reports A
, B
and
+C
if we want to know the id
of C
+which uses A
we can find this by using
+orderly_search(quote(uses(latest(name == "A"))),
+ name = "C")
uses
and usedby
can be combined to search
+more complex arrangements of dependencies. If we have something like
If we want to search for the version of E
which depends
+on the version of A
which was used in the latest
+C
we can do this via
orderly_search(
+ quote(latest(uses(single(usedby(latest(name == "C")) && name == "A")))),
+ name = "E")
+This searches up the tree from C
to A
and
+then down the tree to find the version of E
. Note that is
+is important we added the name == "A"
condition here, if
+that was missing usedby(latest(name == "C"))
would also
+return B
and single
would throw an error
+because we have multiple packets.
We can also search up the tree and then down to find A
+from D
e.g.
orderly_query(
+ quote(usedby(single(uses(name == "D")))),
+ name = "A")
+note as E
is the only packet which uses D
+we do not need to add a name == "E"
clause.
We can combine usedby
and uses
in more
+complex searches, such as to find D
from C
we
+could run
orderly_query(
+ quote(usedby(single(uses(single(usedby(latest(name == "C")) && name == "A"))) && name == "E"))),
+ name = "D")
+As discussed in the
+orderly introduction, you do not want to commit any files from
+.outpack/
, drafts/
or archive/
+(if used) to git as this will create all sorts of problems down the
+line.
If you were directed here, it is probably because you have ended up +with these files in git and want to undo this situation. The least +painful way depends on your situation.
+We have now put in guard rails to try and prevent this happening, but
+it could still happen to you if you modify the .gitignore
+file or force-add files for example.
Once you are in this situation, orderly2
will shout at
+you:
+orderly2::orderly_run("data")
+## Error in `orderly2::orderly_run()`:
+## ! Detected 6 outpack files committed to git
+## ✖ Detected files were found in '.outpack/' and 'archive/'
+## ℹ For tips on resolving this, please see
+## <https://mrc-ide.github.io/orderly2/articles/troubleshooting.html>
+## ℹ To turn this into a warning and continue anyway set the option
+## 'orderly_git_error_is_warning' to TRUE by running
+## options(orderly_git_error_is_warning = TRUE)
which may have directed you to this very page. If you just want to +continue working anyway, then run the suggested command:
+
+options(orderly_git_error_is_warning = TRUE)
after which things will work with a warning the first time that +session:
+
+orderly2::orderly_run("data")
+## Warning in orderly2::orderly_run("data"): Detected 6 outpack files committed to git
+## ✖ Detected files were found in '.outpack/' and 'archive/'
+## ℹ For tips on resolving this, please see
+## <https://mrc-ide.github.io/orderly2/articles/troubleshooting.html>
+## This warning is displayed once per session.
+## ✔ Wrote '.gitignore'
+## ℹ Starting packet 'data' `20241022-124952-6bc202bc` at 2024-10-22 12:49:52.425409
+## > orderly2::orderly_artefact("data.rds", description = "Final data")
+## > saveRDS(mtcars, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124952-6bc202bc at 2024-10-22 12:49:52.461774 (0.03636503 secs)
+## [1] "20241022-124952-6bc202bc"
subsequent calls will not display the warning:
+
+orderly2::orderly_run("data")
+## ℹ Starting packet 'data' `20241022-124952-8e2a4858` at 2024-10-22 12:49:52.560022
+## > orderly2::orderly_artefact("data.rds", description = "Final data")
+## > saveRDS(mtcars, "data.rds")
+## ✔ Finished running data.R
+## ℹ Finished 20241022-124952-8e2a4858 at 2024-10-22 12:49:52.583371 (0.02334881 secs)
+## [1] "20241022-124952-8e2a4858"
The rest of this section discusses how you might permanently fix the +issue.
+This is the case if you have just started a project, and are not yet +collaborating on it with anyone else (or if that person is willing to +re-clone their sources). The simplest thing to do is:
+.git
directory entirelyorderly2::orderly_gitignore_update("(root)")
to set
+up a reasonable .gitignore
that will prevent this situation
+happening againgit add .
to add everything back in (review this
+with git status
to make sure you’re happy)git commit -m "Initial commit"
to create a single
+commit that contains all the files in currently in your repo with no
+history, and also with no .outpack
filesIf you have previously pushed this repo to GitHub or similar then you +will need to set that up again
+git remote add origin https://github.com/user/repo
+(replacing user/repo
with your path, or using
+git@github.com:user/repo
if you use ssh to talk with
+GitHub)git branch -M main
assuming you are using
+main
for your default branch, which is now most commongit push --force -u origin main
Note that this is destructive and will require +coordination with any collaborators as you have changed history.
+If you do care about your history, but you also have only committed a
+few files (e.g., you have committed files from .outpack/
+which are small but not a large 100MB file in archive/
that
+is preventing you pushing
+to GitHub) then you could just delete the offending files from git
+without updating the history, or affecting your local copies.
git rm --cached .outpack
(repeating with
+draft
and archive
as needed)orderly2::orderly_gitignore_update("(root)")
to set
+up a reasonable .gitignore
that will prevent this situation
+happening againgit add .gitignore
to also stage thisgit commit -m "Delete unwanted outpack files"
+You can then push this without any issues.
+If you are working on a branch, and the unwanted files were committed
+on that branch, the simplest thing to do is to copy the changes you have
+made somewhere safe, create a new branch against the current
+main
and copy those changes over there. You could do this
+somewhat automatically by generating and applying a patch:
git diff -- src > changes.patch
+git checkout main
+git checkout -b changes-attempt2
+git apply changes.patch
+git push -u origin changes-attempt2
+If the unwanted files have been committed onto your default branch, +then you will have to do some potentially gory history rewriting. See this +StackOverflow question, the git docs +and the currently +recommended tool for doing this. Good luck!
+