-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathmodels_01_error_minimization.Rmd
195 lines (132 loc) · 10 KB
/
models_01_error_minimization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: 'Model Fitting: Error Minimization'
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
author: Marius 't Hart
---
```{r setup, cache=FALSE, include=FALSE}
library(knitr)
opts_chunk$set(comment='', eval=FALSE)
```
Computational models provide a (mathematically) precise explanation for observed data. Some people will say that all models are wrong, but then again, some models are more wrong than others. Here you will learn how to build and fit some simple models using optimization techniques, and then pick the least wrong model.
# Simple linear example
For some models, there is an analytical solution, that doesn't require model fitting. We'll start with that as we can then verify that a fitted model is pretty good. A simple model problem that has an analytic solution is a linear model. Here we'll consider a linear model with two parameters, an intercept _a_, and a slope _b_:
$y = a + bx$
First, we'll generate some data:
```{r}
# for reproducibility:
set.seed(1937)
# these are the intercept and slope we'll use:
a <- 5.2
b <- 3.1
# x will have random numbers:
x <- runif(n=25, min=0, max=5)
# y will depend on the generative model plus some uniform noise:
y <- (b*x) + a + rnorm(25)
# let's see our artificial data:
plot(x,y, xlim=c(0,5), ylim=c(0,25))
```
We can find the optimal intercept and slope given the data, by using the function `lm()`:
```{r}
data <- data.frame(x,y)
linmod <- lm(y ~ x, data)
summary(linmod)
```
Not only does that fit the data pretty well (look at those p-values!) but the linmod object has a property `coefficients` that allows us to get a vector corresponding to a & b:
```{r}
print(linmod$coefficients)
```
You'll notice that these arent't exactly the ones that we put in, which is because we also added uniform noise to the dependent variable. The intercept and slope that we get here can be shown to be the optimal intercept and slope, if you want to minimize the residual errors between the model and the data. That is, you can't find an intercept and/or slope that further reduces the errors between model and data.
There's two things to unpack here. First, any model fit can only be as good as `lm()` but not better (and probably a bit worse). We'll see this later. Second, what exactly do we mean with errors between model and data?
That, we can illustrate:
```{r}
plot(x,y, xlim=c(0,5), ylim=c(0,25))
abline(linmod, col='blue')
ab <- linmod$coefficients
y_pred <- ab[1] + (x*ab[2])
segments(x0=x, y0=y, x1=x, y1=y_pred, col='red')
```
The blue line represents prediction of y coordinates produced by `lm()`, based on x coordinates. But since the real data is at the black circles, that line is off somewhat for every observation. It is off by the red lines. Those are the errors, and we want them to be as small as possible.
# Model fitting
Instead of using `lm()` which has only limited application, we want to do something more general. There are algorithms that can search through the possible values of the parameter of a model until they can't reduce the errors of the model anymore. To do this, those algorithms need a function that calculates the error of the model, so that is the first thing we need. However, it should be independent of the amount of data that goes into fitting the model, so we take the mean error. If a model is good, the mean error should work as well for old data as for new data, regardless of how many data points there are. Sometimes you see people minimize the mean squared error, and sometimes the root mean squared error. Here we'll do both.
Let's say we strongly suspect that the data is explained by a line with a slope and intercept.
Here is a function that takes the parameters for a line (intercept and slope) as well as a data.frame with x and y coordinates of the data, and returns the MSE (mean squared error):
```{r}
lineMSE <- function(par, data) {
# predicted y coordinates:
# par[1] + (par[2]*data$x)
# error with real y coordinates:
# data$y - ( par[1] + (par[2]*data$x) )
# squared errors:
# ( data$y - ( par[1] + (par[2]*data$x) ) )^2
# mean squared errors:
# mean( ( data$y - ( par[1] + (par[2]*data$x) ) )^2 )
return( mean( ( data$y - ( par[1] + (par[2]*data$x) ) )^2 ) )
}
```
This can easily be extended into a function that returns the RMSE (root mean squared error) for a model that assumes a line:
```{r}
lineRMSE <- function(par, data) {
return( sqrt( mean( ( data$y - ( par[1] + (par[2]*data$x) ) )^2 ) ) )
}
```
Let's test these functions with the parameters that were already optimized by `lm()`. We can use the MSE:
```{r}
lineMSE(par=ab, data=data)
```
Or the RMSE:
```{r}
lineRMSE(par=ab, data=data)
```
Which is indeed the square root of the error returned by `lineMSE()`.
## Loss functions
These two functions specify how "bad" a model is if it has specific values for the parameters (intercept & sope) and is applied to a specific dataset. Each uses a way to take the errors between the data and the prediction to calculate a "loss". The trick is then to find parameter values that minimize the loss. There are other ways to specify loss.
### Mean Absolute Error
We could also use the mean absolute error (MAE). This has advantages if errors are on a ratio scale, for example in business, 100 dollars loss, is really 10 times as bad as 10 dollars loss. It's also less sensitive to ouliers. It doesn't always work though, and has some issues (e.g. when the prediction is perfect or MAE=0, the gradient is undefined and the model can't be evaluated - not sure how often this happens). So unless you have realy good reasons to use MAE, you should clean up your data and pick either MSE or RMSE.
### Huber Loss
One way to overcome the disadvantages of MAE, is to use something like MAE for larger errors (which makes it less sensitive to ouliers), and use MSE for smaller errors (which ensures the function can always be differentiated). The cut-off for what larger or smaller errors are can be set. We'll not cover this here.
### R Squared
This is related to a regular correlation coefficient, or better the residual variance in ANOVAs:
$R^2 = 1 - \frac{MSE(model)}{MSE(mean)}$
That is: how good is the model relative to a baseline model that just predicts the mean of the data? Your model could be worse than the mean, so there's no lower limit to this value. However, usually, we make somewhat smart models, and those are going to be better than the mean. In that case, MSE(model) will be less than MSE(mean), and $R^2$ will be between 0 and 1, and either way better models will have a higher $R^2$. That also means that it shouldn't be minimized, but maximized (multiply by -1 in functions that minimize something).
### Adjusted R Squared
_To be written..._
$R_{adj}^2 = \frac{MSE(mean)-MSE(model)}{MSE(mean)}$ Is this correct...? **CHECK!**
### Other metrics
There are some other, infrequently used metrics, that I will just list here.
- Adjusted R Squared ($R^2$)
- Mean Square Percentage Error (MSPE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Logarithmic Error (RMSLE)
# Error minimization with `optim()`
There's a function that comes with R that can be used to minimize errors by searching model parameters: `optim()`.
```{r}
help(optim)
```
As you can see in the help page, we need to provide starting parameters (`par`), and it seems pretty clear that the slope is positive, and likely the intercept, so we'll put them at 1 and 1. We also need to provide the name of a function (`fn`) which is going to be one of the two error functions we just created. In place of the dots, we can provide any further arguments for the function, which in this case will just be the data frame with data, called `data`. For now, we can probably ignore the other arguments to `optim()` and just see what it does.
```{r}
lineFitMSE <- optim( par = c(1,1),
fn = lineMSE,
data = data )
print(lineFitMSE)
```
The `$par` property of the model fit shows us the parameter values where `optim()` thought it could not really improve on the fit anymore. These are fairly close to the best parameter values, but not exactly the same:
```{r}
lineFitMSE$par - ab
```
This means that `optim()` can get us pretty close to the optim-al solution. In some specific situations there might be a better way, and if that is the case, you should use that. But if there is no perfect analytical solution, finding pretty good parameters is a pretty good solution.
The `$value` property is the value the error function returns with these parameter values. The `$count()` property says how often the error function was evaluated (and how often the gradient function was evaluated, but since we didn't specify one, it was never evaluated). The other properties, I'm not sure about.
And we can try this for the RMSE as well:
```{r}
lineFitRMSE <- optim( par = c(1,1),
fn = lineRMSE,
data = data )
print(lineFitRMSE)
```
This gives pretty much the same output, so it seems that it doesn't matter if we use MSE or RMSE. And that makes some sense: the parameter values that minimize MSE will also minimize RMSE and vice versa (the same is true for the $R^2$ metric mentioned above, since it is based on MSE). There might be slight differences, as `optim()` will usually stop trying to improve the parameters if the decrease in error is smaller than some set number. Since RMSE is smaller than MSE, it might give a slightly better fit, but it should never matter much.
In other words: pick whichever makes sense to you. For the next few tutorials, we'll stick to MSE, and the reason is that we can use it to calculate a model evaluation metric (AIC: Akaike's Information Criterion) which will be introduced in the tutorial on model evaluation.
# Take home
What you should take away from this tutorial is that if you can describe your data with a model that doesn't have an analytical solution (or even if it does) you can find the parameters that minimize the error between the model's predictions and the real data. And that in R, we can use `optim()` for that, and this requires specifying an error function using MSE.