forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4.Rmd
200 lines (138 loc) · 8.75 KB
/
chapter4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# 4. Clustering and classification
```{r}
date()
```
Let's load the Boston data and explore its dimensions. The dataset contains information collected by the U.S. Census Service, about housing in the Boston area.
```{r}
library(MASS)
data("Boston")
str(Boston)
dim(Boston)
```
The dataset has 506 observations and 14 variables. All the variables are numeric, and they are about phenomena such as crime rate per capita by town (crim), proportion of non-retail business acres per town (indus), and average number of rooms per dwelling (rm).
Let's now summarize the variables in the data, and then look at a graphical overview of them.
```{r}
summary(Boston)
```
It seems that the first two variables (crim - crime rate per capita, and zn - proportion of residential land zoned for lots over 25,000 sq.ft) vary between quite a large range, but are strongly skewed towards small values. The rest of the variables seem to be more evenly distributed between their minimum and maximum values.
Let's now look at a graphical overview of the variables.
```{r}
pairs(Boston)
```
It's pretty hard to get a good overview of so many variables at a glance, so we'd need to do more detailed examinations of the relationships between them. Let's use the correlation plot to first look at the relationships between all the variables, and then get a better look at the most interesting ones.
```{r}
library(corrplot)
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper")
```
It seems that crim is positively correlated with rad and tax, whereas the indus variable is positively correlated with nox and tax, but negatively correlated with dis. Let's plot these variables against each other.
We'll begin with crim, rad and tax, which measure crime rate per capita by town, index of accessibility to radial highways, and full-value property tax rate per $10,000, respectively
```{r}
pairs(~crim+rad+tax,data=Boston)
```
High values of both rad and tax seem to also bring crime rate up, but there's a break in the relationship in mid values of these variables. Let's next look at the indus variable against nox, tax and dis. These variables measure the proportion of non-retail business acres per town (indus), nitric oxides concentration (nox), property tax, and the distance to five Boston employment centres (dis).
```{r}
pairs(~indus+nox+tax+dis, data=Boston)
```
Interestingly, the proportion of non-retail business acres per town indeed seems to grow with nitric oxides concentration. The positive relationship seems to be less clear with property tax, while the negative relationshipn with distance to Boston employment centres seems quite strong.
Let's now standardize the dataset and create a categorical variable of crime rate per quantiles in order to predict it using linear discriminant analysis (LDA).
We'll begin by scaling and summarizing the data again.
```{r}
boston_scaled <- scale(Boston)
summary(boston_scaled)
```
The variables now have zero means and their values have been normalized in relation to the original data's standard deviation. This means that the value distributions now tell us how much each variable varies around their means in relation to the standard deviation, which makes their values comparable against each other. For instance, the max value in the crim variable now is at 9.924110, while the max of the tax variable is only at 1.7964. The positive values of the tax variable seem to be closer to the variable's mean, so it has fewer outliers than crim. This we can also see from the boxplots of these variables
```{r}
par(mfrow =c(1,2))
boxplot(Boston$crim, xlab="Boston$crim")
boxplot(Boston$tax, xlab="Boston$tax")
```
Let's change the scaled data into a data frame and create a categorical variable of crime rate using the quantiles as break points.
```{r}
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
# We'll label the values of the new variable "low", "med_low", "med_high", and "high"
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
```
Finally, let's replace the old crime variable in the scaled dataset with this new variable and tabulate its values.
```{r}
boston_scaled$crim <- crime
table(boston_scaled$crim)
```
To evaluate our classification, let's divide the data into sets used for training and testing the classifier.
```{r}
library(dplyr)
# Randomly choose 80% of the observations and select these for the training set
n <- nrow(boston_scaled)
ind <- sample(n, size=n * 0.8)
train <- boston_scaled[ind,]
# Use the rest for the test set
test <- boston_scaled[-ind,]
```
Now we're ready to fit the LDA model to the training data and evaluate it using the test data. Let's also draw the biplot of the model.
```{r}
# linear discriminant analysis
lda.fit <- lda(crim ~., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
```
From the model results below we can see that the first dimension which separates high values from the rest explains over 94% of the variance in the data (look at the proportion of trace at the end of the summary). The second dimension seems to discriminate between low and med_low and med_high values. However most of the variance in the data seems to be between high values and the rest.
```{r}
lda.fit
```
Let's now test the model on unseen data. First we'll save the correct categories of the crime variable and remove them from the test data. Then we'll predict the test data and display a confusion matrix of the results.
```{r}
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
```
Our classifier is able to correctly predict 22/22 of the high values in the test data, and 20/30 of the med_high values. However, it fares worse on the low end, with 17/24 of the med_low values predicted correctly, and only 12/26 low values predicted correctly. This might be due to most of the variance in the data being between high values and the rest.
To end this exercise we'll reload the Boston data to do k-means clustering. We'll begin by scaling the data again. Let's also look at a summary of the data to see that everything is ok.
```{r}
data("Boston")
boston_scaled <- scale(Boston)
summary(boston_scaled)
```
Everything seems to be as before, so we can go on. Next we'll calculate euclidean distances between the observations and run k-means (which by default uses euclidean distances and also does the calculation for us).
Let's first try with 3 clusters.
```{r}
dist_eu <- dist(boston_scaled)
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
```
Again it's difficult to get a good overview of so many variables. Let's look in more detail at the last four.
```{r}
pairs(boston_scaled[,10:14], col = km$cluster)
```
The model seems to capture some clusters of values in the data well. For instance, it correctly clusters high values of the tax variables into one cluster. However, it seems to mix up the low values in two different clusters, although they are pretty closely located together, as can be seen from the plots.
Let's now investigate what the optimal number of clusters is by checking the elbow plot using the total of within cluster sum of squares as the error measure.
```{r}
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
plot(x = 1:k_max, y = twcss)
```
From the plot we see that the "elbow" is at two clusters, so we'll rerun kmeans with 2 clusters. Let's again look at the last four variables.
```{r}
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,10:14], col = km$cluster)
```
This model seems to do somewhat better in capturing different clusters of values in the data. For instance, observations with low values of the tax variable are now much more clearly allocated to the same cluster. However, the model still cannot tell apart for instance observations which have low values in black from those that have high values in black. This is might be due to those observations being more clearly separated from each other along some other variable.