forked from ScPoEcon/ScPoEconometrics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-linear-reg.Rmd
88 lines (63 loc) · 2.73 KB
/
04-linear-reg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Linear Regression {#linreg}
Notes:
stop at R-squared
* different data: missing variable
* non-linear realtionship
## Data on Cars
We will look at the built-in `cars` dataset. Let's get a view of this by just typing `View(cars)` in Rstudio. You can see something like this:
```{r,echo=FALSE}
head(cars)
```
We have a `data.frame` with two columns: `speed` and `dist`. Type `help(cars)` to find out more about the dataset. There you could read that
>The data give the speed of cars (mph) and the distances taken to stop (ft).
It's good practice to know the extent of a dataset. You could just type
```{r}
dim(cars)
```
to find out that we have 50 rows and 2 columns. A central question that we want to ask now is the following:
### How are `speed` and `dist` related?
The simplest way to start plot the data. Remembering that we view each row of a data.frame as an observation, we could just label one axis of a graph `speed`, and the other one `dist`, and go through our table above row by row. We just have to read off the x/y coordinates and mark them in the graph. In `R`:
```{r}
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "red")
```
Here, each dot represents one observation. In this case, one particular measurement `speed` and `dist` for a car. Now, again:
```{block2, type='INFO'}
How are `speed` and `dist` related? How could one best *summarize* this relationship?
```
### non-linear data
```{r}
with(mtcars,plot(hp,mpg))
```
1. scatter plot
1. label observations
1. how do the data come to us? spreadsheet
1. approx link x and y by a line
1. OLS gives the best line for this
1. $y_i = a+b x_i$. find a,b s.t. dist is minimal
1. write out sum of least-squares and call it MSE: u_1 + u_2 + ... / N
1. plot fitted values - see imperfect approximation
1. R-squared: goodness of fit / measure of goodness
1. 1 - sum of squared errors / SST
1. how much of total variance is explained by the model?
1. regression on mean
1. How come there are residuals?
1. measurement error?
1. there is more to this than just x
1. misspecification
1. There is statistical uncertainty about those estimates
1. plot a second data set with a less clear interpretation
1. do you *really* think there is a linear relationship?
1. SE tells us whethe rwe really think this is a positive slope
1. poor R2 and large standard error
1. How **confident** are you about this relationship? Is there enought data?
1. SE is ameasure of precision depending on N
## Try to find the Slope!
```{r}
knitr::include_url("https://gallery.shinyapps.io/simple_regression/")
```