Outline:
- Unsupervised learning.
- Without labels.
- Clustering similar things.
- K-means.
What is cluster analysis:
- A cluster is a collection of data objects.
- Cluster analysis is grouping similar data instances cacording to their features.
Examples:
- Marketing.
- Land use: similar areas of land.
- Insurance.
- City-planning.
- Earth-quake studies.
- Games.
What is good clustering:
- High intra-class similarity, low inter-class similarity.
- Using a metric: d(i, j).
- Separate "quality" function measures "goodness" of a cluster.
Requirements:
- Scalability
- Different types of attributes.
- Dynamic data
- High dimensionality
- Interpretability and usability
- Incorporation of user-specified constraints
Dissimilarity:
- Data matrix for dissimilarity (just half matrix).
Categories:
- Partitioning. k-means, mixture model, k-medoids.
- Hierarchical. Start from bottom-up or viceversa.
- Density-based. DBSCAN, OPTICS. Arbitrary shapes.
Partitioning algorithms:
- Building k clusters.
- The global optimal is impracticable.
- Heuristics: k-means, k-medoids, gaussian mixture.
Problems of k-means:
- Relies on distance-from-center. This means, clusters are defined as circles.
- Items are either in or out.
- Sensitive to outliers.
Evaluation of clustering:
- Assess cluster tendency (are there any clusters in the data?)
- Determine the number of clusters in a dataset (how many clusters are there)
- Measuring clustering quality.
Assess cluster tendency:
- Clustering makes no sense if there is no structure in the data.
The hopkins statistic:
- Sample n points and compute the distance to their nearest neighbor.
- Generate n points uniformly distributed in the space of D and compute their distance to their nearest neihbor in D.
- Compute the Hopkins test: H.
- A mesure of whether the data is uniformly distributed.
Determining the number of clusters in a dataset:
- Elbow method: increase the number of clusters, the sum of within-cluster variance of each cluster.
- Run clustering for different parameters.
Evaluating clustering:
- High intra-class similarity, low inter-class similarity.
- Methods: intrinsic methods (we do not have the groud truth data), extrinsic evaluation (we have the ground truth, we compare the clustering with the ground truth).
Silhouette coefficient.
Hierarchical clustering:
- Bottom-up or agglomerative clustering: take the two closest: and put them together.
- Problem: can never separate clusters that were already merged.
- Problem: it is just one of many.
- The picture is called dendogram.
Hierarchical clustering of words:
- But words are not in a metric.
- Then look at transitions over words.