-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.qmd
220 lines (188 loc) · 8.25 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: Model Benchmark
format: gfm
---
This repository compares the time-to-train of two gradient-boosted decision tree (GBDT) frameworks - LightGBM and XGBoost - across different hardware and versions. The purpose of this comparison is to help the CCAO make two decisions:
1. Which GBDT framework to use for its 2024 automated valuation models
2. Whether or not to purchase or rent additional hardware (a GPU) in order to improve model training speed
Below are the results of our tests:
> :warning: **NOTE:**
>
> The performance statistics presented here are only for cross-model comparison and do not reflect any real model results e.g. they are included only to show that each GBDT framework and version generates similar results, given the same data.
## Results
```{r setup, echo=FALSE, message=FALSE}
library(arrow)
library(dplyr)
library(gt)
library(kableExtra)
library(purrr)
library(readr)
model_specs <- read_csv("input/model_specs.csv") %>%
mutate(
across(starts_with("Approx") & where(is.character), ~ parse_number(.x))
) %>%
select(Key:`Approx. Peak Device Utilization`)
perf_df <- map_dfr(
list.files("output/performance", pattern = "*.parquet", full.names = TRUE),
read_parquet
)
time_df <- map_dfr(
list.files("output/timing", pattern = "*.parquet", full.names = TRUE),
read_parquet
) %>%
select(-c(tic, toc)) %>%
tidyr::pivot_wider(id_cols = run_id, names_from = stage, values_from = time)
full_data <- model_specs %>%
left_join(time_df, by = c("Key" = "run_id")) %>%
left_join(perf_df, by = c("Key" = "run_id")) %>%
select(-c(Key, time)) %>%
rename(
`Wall Time (Full Run)` = train,
`Wall Time (Prediction)` = predict,
`Wall Time (SHAP)` = shap,
r2 = rsq
) %>%
rename_with(toupper, rmse:mki) %>%
mutate(
`Approx. Peak Device Utilization` = `Approx. Peak Device Utilization` / 100,
MAPE = MAPE / 100
)
```
```{r create_table, echo=FALSE, message=FALSE}
full_data %>%
select(-`Hardware Specifications`) %>%
gt() %>%
tab_footnote(
footnote = "1",
locations = cells_body(Type, c(3, 7, 8, 11, 12))
) %>%
tab_footnote(footnote = "2", locations = cells_body(Type, c(6, 10))) %>%
fmt_percent(starts_with("Approx."), decimals = 0) %>%
fmt_percent(MAPE, decimals = 2) %>%
fmt_duration(
starts_with("Wall Time"),
input_units = "seconds",
output_units = c("hours", "minutes", "seconds")
) %>%
fmt_currency(c(contains("RMSE"), contains("MAE")), decimals = 0) %>%
fmt_number(R2:MKI, decimal = 3) %>%
extract_body() %>%
# Replacing with non-line-break space characters for better formatting
mutate(across(starts_with("Wall Time"), ~ gsub(" ", " ", .x))) %>%
rename(
`Wall Time (Full Run)` = `Wall Time (Full Run)`,
`Peak Device Utilization` = `Approx. Peak Device Utilization`
) %>%
knitr::kable(format = "markdown", align = "llllrrrrrrrrrr")
```
1. Categoricals with over 50 values are hashed, otherwise one-hot encoded.
2. Categoricals with over 50 values are hashed, otherwise natively handled.
## Cost estimates
```{r costs, message=FALSE, echo=FALSE}
# g5.4xlarge prices as of 2023-10-06 in us-east-1
price_full <- 1.624
price_spot <- 0.6716
# Runs in this repo don't use early stopping, but most runs during CV will and
# will thus train faster. Thus we multiply to get the typical num of iters
# reached when stopping early
early_stop_mult <- 0.8
# Average number of CV iterations per run
n_iters <- 60
# Total number of folds per CV iteration
n_folds <- floor((9 * 12) / 15)
# Time per CV fold, based on proportion of the total training data relative
# to a full run (all data)
n_folds_mult <- seq_len(n_folds) / n_folds
# Total number of runs per year
n_runs <- 200
# Estimated MSRP of an A40
a40_cost <- 7000
# Create an estimated cost table
est_costs <- full_data %>%
slice(c(2, 3, 8, 12, 12)) %>%
bind_cols(tibble(instance = c(NA, NA, NA, "Normal", "Spot"))) %>%
rename(
time_run = `Wall Time (Full Run)`,
time_pred = `Wall Time (Prediction)`,
time_shap = `Wall Time (SHAP)`
) %>%
rowwise() %>%
mutate(
time_cv = sum(time_run * n_folds_mult) * n_iters * early_stop_mult,
time_total = time_cv + time_pred + (time_shap * (1.1e6 / 50000)),
) %>%
ungroup() %>%
mutate(
cpr = case_when(
Server == "NVIDIA" ~ a40_cost / n_runs,
Server == "AWS" & instance == "Normal" ~ (time_total / 3600) * price_full,
Server == "AWS" & instance == "Spot" ~ (time_total / 3600) * price_spot,
TRUE ~ NA_real_
),
tot_cost = case_when(
Server == "NVIDIA" ~ a40_cost,
Server == "AWS" ~ cpr * n_runs,
TRUE ~ NA_real_
)
) %>%
select(
Server:Device,
`Instance Type` = instance,
`Est. Time Per Run` = time_total,
`Est. Cost Per Run` = cpr,
`Est. Total 2024 Cost` = tot_cost
)
est_costs %>%
gt() %>%
tab_footnote(
footnote = "1",
locations = cells_body(`Server`, 3)
) %>%
tab_footnote(
footnote = "2",
locations = cells_body(`Server`, 4:5)
) %>%
fmt_duration(
`Est. Time Per Run`,
input_units = "seconds",
output_units = c("hours", "minutes", "seconds")
) %>%
fmt_currency(contains("Cost"), decimals = 2) %>%
extract_body() %>%
mutate(
across(everything(), ~ tidyr::replace_na(.x, "-")),
across(everything(), ~ replace(.x, .x == "NA", "-"))
) %>%
rename(
`Est. Time Per Run` = `Est. Time Per Run`,
`Est. Cost Per Run` = `Est. Cost Per Run`,
`Est. Total 2024 Cost` = `Est. Total 2024 Cost`
) %>%
knitr::kable(format = "markdown", align = "llllrrr")
```
1. Estimate assumes a fixed cost for an NVIDIA A40 of `r scales::dollar(a40_cost)`.
2. Estimates use AWS costs for `g5.4xlarge` instances created ephemerally using AWS Batch + Fargate. As of 2023-10-06, costs are:
- Normal hourly pricing: `r scales::dollar(price_full)`
- Spot hourly pricing: `r scales::dollar(price_spot)`
3. All estimates assume 200 total runs in 2024, costs per run decrease for the NVIDIA option as number of runs increases.
## Hardware
These tests were run on three different machines: an on-prem modeling server used by the CCAO, a test server provided by NVIDIA via the [LaunchPad](https://www.nvidia.com/en-us/launchpad/) program, and a standard AWS `g5.4xlarge` EC2 instance. The machines have the following specifications:
| | CCAO | NVIDIA | g5.4xlarge |
|--------------|------------------------------------------|----------------------------------------| ---------------------------------------|
| **CPU** | Xeon Silver 4208 CPU @ 2.10GHz, 16 cores | Xeon Gold 6354 CPU @ 3.00GHz, 16 cores | AMD EPYC 7R32 @ 3.3GHz, 16 cores |
| **Memory** | 128GiB | 512GiB | 64GiB |
| **GPU** | - | NVIDIA A40, 48GB | NVIDIA A10G, 24GB |
| **OS** | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
| **Compiler** | gcc (11.4.0) -O3 -march=native | gcc (11.4.0) -O3 -march=native | gcc (11.4.0) -O3 -march=native |
## Tasks
The tasks performed for this benchmark (as shown in the results table) are as follows:
- **Full Run** - The model is trained on the training + test set, to be used for prediction on unseen data.
- **Prediction** - The trained model is used to predict on the assessment data.
- **SHAP** - The trained model is used to predict SHAP values for the first 50K rows of assessment data.
- For performance metrics, the model is trained on a training set, then predicts on a holdout test set. Performance is calculated using the test set predictions.
## Inputs
The benchmark uses the following inputs:
- Input data from the 2023 CCAO residential valuation model:
- [`input/training_data.parquet`](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/res/2023/training_data.parquet) - 424,950 rows
- [`input/assessment_data.parquet`](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/res/2023/assessment_data.parquet) - 1,099,226 rows
- (Hyper)parameters used for this benchmark can be found in [`params.yaml`](./params.yaml)