K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
- Training examples
${x^{(1)}, x^{(2)}, \cdots, x^{(m)}} \in \mathbb{R^n}$ , with no bias terms -
$\mathrm{K}$ clusters
- Randomly Initalise
$\mathrm{K}$ cluster centroids$\mu_1,\mu_2, \cdots,\mu_\mathrm{K} \in \mathbb{R^n}$ - Then,
$(2) \text{ and } (3)$ belong to the cluster assignment step
$(4) \text{ and } (5)$ belong to the move centroid step
Here, k-means can also be applied to non-separated clusters as well.
-
$c^{(i)}$ - index of cluster${1,2,\cdots,\mathrm{K}}$ to which example$x^{(i)}$ is currently assigned -
$\mu_k$ - cluster centroid$k \space (\mu_k \in \mathbb{R^n})$ -
$\mu_{c^{(i)}}$ - cluster centroid of cluster o which$x^{(i)}$ has been assigned
Based on the above notation, the cost function is given as,
$$
J(c^{(1)}, \cdots, c^{(m)}, \mu_1,\cdots, \mu_\mathrm{K}) = \frac{1}{m}\sum_{i=1}^m \lVert x^{(i)} - \mu_{c^{(i)}}\rVert^2
$$
There the objective is to minimise the cost based on
The cost function for k-means is also called the distortion function
According to the algorithm, we do the following, $$ \begin{align} &\text{Cluster Assignment Step - }\ & \quad\text{minimize J wrt } {c^{(1)}, \cdots, c^{(m)}} \text{ while holding } {\mu_1, \cdots, \mu_k} \text{constant}\ &\text{Move Centroid Step - }\ & \quad\text{minimize J wrt } {\mu_1, \cdots, \mu_k}\ \end{align} $$
The cost function vs. no of iterations for k-means is ==always== monotonically decreasing
There are many ways to randomly initialise
-
Manually visualize the data.
-
Elbow Method
- On plotting the distortion as a function of number of clusters,
$\mathbb{K}$ , this methods says that the optimal number of clusters at the point the elbow occurs as can be seen for line B in the plot below. It is a reasonable way of choosing the number of clusters. But this method does not always work because the sometimes the plot would look like line A which does not have clear elbow to choose.
- On plotting the distortion as a function of number of clusters,
-
Sometimes it is ambiguous as to how many clusters exist in the dataset and in such cases it’s more useful to choose the number of clusters on the basis of the end goal.
- Link to Voronoi Cells
- Link to example