-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathdynamic.Rmd
322 lines (235 loc) · 9.38 KB
/
dynamic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
# Dynamic branching {#dynamic}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
```{r, echo = FALSE, message = FALSE}
library(broom)
library(drake)
library(gapminder)
library(tidyverse)
```
## A note about versions
The first release of dynamic branching was in `drake` version 7.8.0. In subsequent versions, dynamic branching behaves differently. This manual describes how dynamic branching works in development `drake` (to become version 7.9.0 in early January 2020). If you are using version 7.8.0, please refer to [this version of the chapter](https://github.com/ropensci-books/drake/blob/c4dfa6dd71b5ffa4c6027633ae048d2ab0513c6d/dynamic.Rmd) instead.
## Motivation
In large workflows, you may need more targets than you can easily type in a plan, and you may not be able to fully specify all targets in advance. Dynamic branching is an interface to declare new targets while `make()` is running. It lets you create more compact plans and graphs, it is easier to use than [static branching](#static), and it improves the startup speed of `make()` and friends.
## Which kind of branching should I use?
With dynamic branching, `make()` is faster to initialize, and you have far more flexibility. With [static branching](#static), you have meaningful target names, and it is easier to predict what the plan is going to do in advance. There is a ton of room for overlap and personal judgement, and you can even use both kinds of branching together.
## Dynamic targets
A dynamic target is a [vector](https://vctrs.r-lib.org/) of *sub-targets*. We let `make()` figure out which sub-targets to create and how to aggregate them.
As an example, let's fit a regression model to each continent in [Gapminder data](https://github.com/jennybc/gapminder). To activate dynamic branching, use the `dynamic` argument of `target()`.
```{r}
library(broom)
library(drake)
library(gapminder)
library(tidyverse)
# Split the Gapminder data by continent.
gapminder_continents <- function() {
gapminder %>%
mutate(gdpPercap = scale(gdpPercap)) %>%
split(f = .$continent)
}
# Fit a model to a continent.
fit_model <- function(continent_data) {
data <- continent_data[[1]]
data %>%
lm(formula = gdpPercap ~ year) %>%
tidy() %>%
mutate(continent = data$continent[1]) %>%
select(continent, term, statistic, p.value)
}
plan <- drake_plan(
continents = gapminder_continents(),
model = target(fit_model(continents), dynamic = map(continents))
)
make(plan)
```
The data type of every sub-target is the same as the dynamic target it belongs to. In other words, `model` and `model_23022788` are both data frames, and `readd(model)` and friends automatically concatenate all the `model_*` sub-targets.
```{r}
readd(model)
```
This behavior is powered by the [`vctrs`](https://vctrs.r-lib.org/). A dynamic target like `model` above is really a "`vctr`" of sub-targets. Under the hood, the aggregated value of `model` is what you get from calling `vec_c()` on all the `model_*` sub-targets. When you dynamically `map()` over a non-dynamic object, you are taking slices with `vec_slice()`. (When you `map()` over a dynamic target, each element is a sub-target and `vec_slice()` is not necessary.)
```{r}
library(vctrs)
# same as readd(model)
s <- subtargets(model)
vec_c(
readd(s[1], character_only = TRUE),
readd(s[2], character_only = TRUE),
readd(s[3], character_only = TRUE),
readd(s[4], character_only = TRUE),
readd(s[5], character_only = TRUE)
)
loadd(model)
# Second slice if you were to map() over mtcars.
vec_slice(mtcars, 2)
# Fifth slice if you were to map() over letters.
vec_slice(letters, 5)
```
You can use `vec_c()` and `vec_slice()` to anticipate edge cases in dynamic branching.
```{r}
# If you map() over a list, each sub-target is a single-element list.
vec_slice(list(1, 2), 1)
```
```{r}
# If each sub-target has multiple elements,
# the aggregated target (e.g. from readd())
# will have more elements than sub-targets.
subtarget1 <- c(1, 2)
subtarget2 <- c(3, 4)
vec_c(subtarget1, subtarget2)
```
Back in our plan, `target(fit_model(continents), dynamic = map(continents))` is equivalent to commands `fit_model(continents[1])` through `fit_model(continents[5])`. Since `continents` is really a list of data frames, `continents[1]` through `continents[5]` are also lists of data frames, which is why we need the line `data <- continent_data[[1]]` in `fit_model()`.
To post-process our models, we can work with either the individual sub-targets or the whole vector of all the models. Below, `year` uses the former and `intercept` uses the latter.
```{r}
plan <- drake_plan(
continents = gapminder_continents(),
model = target(fit_model(continents), dynamic = map(continents)),
# Filter each model individually:
year = target(filter(model, term == "year"), dynamic = map(model)),
# Aggregate all the models, then filter the whole vector:
intercept = filter(model, term != "year")
)
make(plan)
```
```{r}
readd(year)
```
```{r}
readd(intercept)
```
If automatic concatenation of sub-targets is confusing (e.g. if some sub-targets are `NULL`, as in <https://github.com/ropensci-books/drake/issues/142>) you can read the dynamic target as a named list (only in `drake` version 7.10.0 and above).
```{r}
readd(model, subtarget_list = TRUE) # Requires drake >= 7.10.0.
```
Alternatively, you can identify an individual sub-target by its index.
```{r}
subtargets(model)
readd(model, subtargets = 2) # equivalent to readd() on a single model_* sub-target
```
If you don't know the index offhand, you can find out using the sub-target's name.
```{r, echo = FALSE}
subtarget <- subtargets(model)[2]
```
```{r}
print(subtarget)
which(subtarget == subtargets(model))
```
If the sub-target errored out and `subtargets()` fails, the individual sub-target metadata will have a `subtarget_index` field.
```{r, eval = FALSE}
diagnose(subtarget, character_only = TRUE)$subtarget_index
#> [1] 2
```
Either way, once you have the sub-target's index, you can retrieve the section of data that the sub-target took as input. Below, we load the part of `contenents` that the second sub-target of `model` used during `make()`.
```{r}
vctrs::vec_slice(readd(continents), 2)
```
If `continents` were dynamic, we could have just used `readd(continents, subtargets = 2)`. But `continents` was a static target, so we needed to replicate `drake`'s dynamic branching behavior using `vctrs`.
## Dynamic transformations
Dynamic branching supports transformations `map()`, `cross()`, and `group()`. These transformations tell `drake` how to create sub-targets.
### `map()`
`map()` iterates over the [vector slices](https://vctrs.r-lib.org/reference/vec_slice.html) of the targets you supply as arguments. We saw above how `map()` iterates over lists. If you give it a data frame, it will map over the rows.
```{r}
plan <- drake_plan(
subset = head(gapminder),
row = target(subset, dynamic = map(subset))
)
make(plan)
```
```{r}
readd(row_9939cae3)
```
If you supply multiple targets, `map()` iterates over the slices of each.
```{r}
plan <- drake_plan(
numbers = seq_len(2),
letters = c("a", "b"),
zipped = target(paste0(numbers, letters), dynamic = map(numbers, letters))
)
make(plan)
```
```{r}
readd(zipped)
```
### `cross()`
`cross()` creates a new sub-target for each combination of targets you supply as arguments.
```{r}
plan <- drake_plan(
numbers = seq_len(2),
letters = c("a", "b"),
combo = target(paste0(numbers, letters), dynamic = cross(numbers, letters))
)
make(plan)
```
```{r}
readd(combo)
```
### `group()`
With `group()`, you can create multiple aggregates of a given target. Use the `.by` argument to set a grouping variable.
```{r}
plan <- drake_plan(
data = gapminder,
by = data$continent,
gdp = target(
tibble(median = median(data$gdpPercap), continent = by[1]),
dynamic = group(data, .by = by)
)
)
make(plan)
```
```{r}
readd(gdp)
```
## Trace
All dynamic transforms have a `.trace` argument to record optional metadata for each sub-target. In the example from `group()`, the trace is another way to keep track of the continent of each median GDP value.
```{r}
plan <- drake_plan(
data = gapminder,
by = data$continent,
gdp = target(
median(data$gdpPercap),
dynamic = group(data, .by = by, .trace = by)
)
)
make(plan)
```
The `gdp` target no longer contains any explicit reference to continent.
```{r}
readd(gdp)
```
However, we can look up the continents in the trace.
```{r}
read_trace("by", gdp)
```
## `max_expand`
Suppose we want a model for each *country*.
```{r}
gapminder_countries <- function() {
gapminder %>%
mutate(gdpPercap = scale(gdpPercap)) %>%
split(f = .$country)
}
plan <- drake_plan(
countries = gapminder_countries(),
model = target(fit_model(countries), dynamic = map(countries))
)
```
The Gapminder dataset has 142 countries, which can get overwhelming. In the early stages of the workflow when we are still debugging and testing, we can limit the number of sub-targets using the `max_expand` argument of `make()`.
```{r}
make(plan, max_expand = 2)
```
```{r}
readd(model)
```
Then, when we are confident and ready, we can scale up to the full number of models.
```{r, eval = FALSE}
make(plan)
```