-
Notifications
You must be signed in to change notification settings - Fork 53
/
2024_SISBID_Clustering_Lab.Rmd
255 lines (213 loc) · 8.02 KB
/
2024_SISBID_Clustering_Lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: '2024 SISBID Clustering Lab'
author: "Genevera I. Allen & Yufeng Liu"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Data set - Author Data.
This data set consists of word counts from chapters written by four authors.
This lab will put together concepts from both dimension reduction and clustering.
There are ultimately 3 goals to this lab:
* Correctly cluster author texts in an unsupervised manner.
* Determine which words are responsible for correctly separating the author texts.
* Visualize the author texts, words and the results of your analysis.
#### Problem 1 - Visualization
* Problem 1a - We wish to plot the author texts as well as the words via a 2D scatterplot. Which method would be best to use? Why?
* Problem 1b - Apply PCA to visualize the author texts. Explain the results.
* Problem 1c - Apply MDS to visualize the author texts. Interpret the results.
* Problem 1d - Can you use MDS to help determine which distance is appropriate for this data? Which one is best and why?
* Problem 1e - Apply MDS with your chosen distance to visualize the words. Interpret the results.
#### Problem 2 - K-means
* Problem 2a - Apply K-means with K=4 to this data.
* Problem 2b - How well does K-mean do at separating the authors?
* Problem 2c - Is K-means an appropriate clustering algorithm for this data? Why or Why not?
#### Problem 3 - Hierarchical Clustering
* Problem 3a - Apply hierarchical clustering to this data set.
* Problem 3b - Which distance is best to use? Why?
* Problem 3c - Which linkage is best to use? Why?
* Problem 3d - Do any linkages perform particularly poorly? Explain this result.
* Problem 3e - Visualize your hierarchical clustering results.
#### Problem 4 - Biclustering
* Problem 4a - Apply the cluster heatmap method to visualize this data. Which distance and linkage functions did you use?
* Problem 4b - Interpret the cluster heatmap. Which words are important for distinguishing author texts?
#### Problem 5 - NMF
* Problem 5a - Apply NMF with K = 4 and use W to assign cluster labels to each observation.
* Problem 5b - How well does NMF perform? Interpret and explain this result.
* Problem 5c - Can you use the NMF to determine which words are important for distinguishing author texts? How? What did you find?
#### Problem 6 - Wrap-up
* Problem 6a - Overall, which method is the best at clustering the author texts? Why is this the case?
* Problem 6b - Which words are key for distinguishing the author texts? How did you determine these?
* Problem 6c - Overall, which is the best method for providing a visual summary of the data?
# R scripts to help out with the Clustering Lab
## Don't peek at this if you want to practice coding on your own!!
Load packages
```{r,message= FALSE}
library(NMF)
library(ggplot2)
library(umap)
```
Load dataset: Author data
```{r}
load("UnsupL_SISBID_2024.Rdata")
# understand the data a bit
dim(author)
colnames(author)
unique(rownames(author))
TrueAuth = as.factor(rownames(author))
```
```{r}
hist(author[,colnames(author)=="the"],breaks=25,main="Frequency of word \"the\"",xlab = "Word Counts")
hist(author[,colnames(author)=="a"],breaks=25,main="Frequency of word \"a\"",xlab = "Word Counts")
hist(author[,colnames(author)=="and"],breaks=25,main="Frequency of word \"and\"",xlab = "Word Counts")
hist(author[,colnames(author)=="things"],breaks=25,main="Frequency of word \"things\"",xlab = "Word Counts")
```
Take out bookID
```{r}
AuthorData = author[,1:69]
```
### Problem 1 - Visualization
- how to visulaize texts? words? in 2-dimensions
Trying PCA
```{r}
sv = svd(AuthorData)
V = sv$v
Z = AuthorData%*%V
# projected matrix
PCData = data.frame(cbind(Z[,1],Z[,2],rownames(AuthorData)),stringsAsFactors = FALSE)
colnames(PCData) = c("PC1","PC2","Author")
PCData$PC1 = as.numeric(PCData$PC1)
PCData$PC2 = as.numeric(PCData$PC2)
# plot
ggplot(PCData) +
geom_point(mapping=aes(x = PC1,y= PC2,color = Author,shape= Author))
```
Why doesn't this work well?
Trying MDS (classical)
Can you use MDS to decide which distance is best to understand this data?
Visualizing author texts
```{r}
Dmat = dist(log(AuthorData+1),method="canberra")
mdsres = cmdscale(Dmat,k=2)
# MDS matrix
MDSData = data.frame(cbind(mdsres[,1],mdsres[,2],rownames(AuthorData)),
stringsAsFactors = FALSE)
colnames(MDSData) = c("MDS1","MDS2","Author")
MDSData$MDS1 = as.numeric(MDSData$MDS1)
MDSData$MDS2 = as.numeric(MDSData$MDS2)
# plot
ggplot(MDSData) +
geom_point(mapping=aes(x = MDS1,y= MDS2,color = Author,shape= Author))
```
Trying UMAP
```{r}
AuthorUMAP = umap(AuthorData)
# UMAP matrix
UMAPData = data.frame(cbind(AuthorUMAP$layout[,1],AuthorUMAP$layout[,2],
rownames(AuthorData)),stringsAsFactors = FALSE)
colnames(UMAPData) = c("UMAP1","UMAP2","Author")
UMAPData$UMAP1 = as.numeric(UMAPData$UMAP1)
UMAPData$UMAP2 = as.numeric(UMAPData$UMAP2)
# plot
ggplot(UMAPData) +
geom_point(mapping=aes(x = UMAP1,y= UMAP2,color = Author,shape= Author))
```
Visualizing words
```{r}
Dmat = dist(t(AuthorData),method="canberra")
mdsresW = cmdscale(Dmat,k=2)
# MDS matrix for words
MDSDataW = data.frame(cbind(mdsresW[,1],mdsresW[,2],colnames(AuthorData)),
stringsAsFactors = FALSE)
colnames(MDSDataW) = c("MDS1","MDS2","Word")
MDSDataW$MDS1 = as.numeric(MDSDataW$MDS1)
MDSDataW$MDS2 = as.numeric(MDSDataW$MDS2)
ggplot(MDSDataW) +
geom_text(mapping=aes(x = MDS1,y= MDS2,label = Word))
```
### Problem 2 - K-means
```{r}
K = 4
km = kmeans(AuthorData,centers=K)
table(km$cluster,TrueAuth)
```
Visualization of K-means clustering results via MDS matrix
```{r}
PredData = data.frame(cbind(MDSData[,1:2],km$cluster,rownames(AuthorData)))
colnames(PredData) = c("MDS1","MDS2","PredictedLabel","TrueAuthor")
PredData$PredictedLabel = factor(PredData$PredictedLabel)
ggplot(PredData) +
geom_point(mapping=aes(x = MDS1,y= MDS2,color = PredictedLabel,shape= TrueAuthor)) +
ggtitle("Clustering results by K-means with K = 4") +
theme(plot.title = element_text(hjust = 0.5))
```
### Problem 3 - Hierarchical Clustering
Which distance is appropraite? Why?
canberra distance & complete linkage
```{r}
Dmat = dist(AuthorData,method="canberra")
com.hc = hclust(Dmat,method="complete")
res.com = cutree(com.hc,4)
table(res.com,TrueAuth)
```
```{r}
plot(com.hc,cex=.5)
```
Which linkage is best? Why?
canberra distance & ward.D linkage
```{r}
Dmat = dist(AuthorData,method="canberra")
ward.hc = hclust(Dmat,method="ward.D")
res.ward = cutree(ward.hc,4)
table(res.ward,TrueAuth)
```
```{r}
plot(ward.hc,cex=.5)
```
We can see that canberra distance and ward.D linkage give excellent clustering results.
Do any preform terribly? Why?
Visualizing hierarchical clustering results using MDS.
```{r}
PredData = data.frame(cbind(MDSData[,1:2],res.ward,rownames(AuthorData)))
colnames(PredData) = c("MDS1","MDS2","PredictedLabel","TrueAuthor")
PredData$PredictedLabel = factor(PredData$PredictedLabel)
ggplot(PredData) +
geom_point(mapping=aes(x = MDS1,y= MDS2,color = PredictedLabel,shape= TrueAuthor)) +
ggtitle("Clustering results by hierarchical clustering with K = 4") +
theme(plot.title = element_text(hjust = 0.5))
```
### Problem 4 - Biclustering
Cluster heatmap
```{r}
heatmap(AuthorData,
distfun=function(x)dist(x,method="canberra"),
hclustfun=function(x)hclust(x,method="ward.D"))
```
```{r}
heatmap(log(AuthorData+1),
distfun=function(x)dist(x,method="canberra"),
hclustfun=function(x)hclust(x,method="ward.D"))
```
### Problem 5 - NMF
```{r}
K = 4
nmffit = nmf(AuthorData,rank=K)
W = basis(nmffit)
H = coef(nmffit)
```
```{r}
cmap = apply(W,1,which.max)
table(cmap,TrueAuth)
```
```{r}
basismap(nmffit,annRow=rownames(AuthorData),scale="col",legend=FALSE)
coefmap(nmffit,annCol=colnames(AuthorData),scale="col",legend=FALSE)
```
```{r}
basismap(nmffit,annRow=rownames(AuthorData),scale="col",legend=T)
coefmap(nmffit,annCol=colnames(AuthorData),scale="col",legend=T)
```
Which words are most important for distinguishing authors?