KMeans Applet

A simple PHP/Javascript applet for demonstrating the mathematics behind the K-Means Clustering algorithm.

Algorithm Notes

This applet demonstrates a basic implementation of K-means clustering based on Lloyd's algorithm [1]. The algorithm finds a set of means or centroids m_k, k = 1,2,...,K such that each data point x_i, i = 1,2,...,N can be partitioned based on their assignment to one of these K centroids. The centroids are continuously updated such that the distances between each point and their assigned centroids are minimized.

We begin by randomly assigning K points from the data to serve as our initial centroids. More efficient initializations leading to faster convergences can be found [2,3], but for a simple two-dimensional case we can make do with this basic procedure. Once the K centroids have been initialized, we assign each of the x_i to a particular centroid, this assignment indicated by C_i, based on which centroid is closest to x_i by the Euclidean distance.

After assignment, we obtain an update of the centroid by averaging across all currently assigned data points:

Then all points are again reassigned to their new clusters based on the updated positions of the centroids. At each step we measure the total (Euclidean) distance between each point and their assigned centroid. To control the range of distances, we obtain their logarithms.

Finally, convergence is achieved once the minimum distance per cluster:

has changed by no more than a set tolerance level. This applet uses a very small tolerance of 1e-100, but we also set a maximum iteration count of 20 in case the algorithm struggles to find a local minimum (which Lloyd's Algorithm has been known to experience in many cases). On the US Arrests data set (see citations), we reach convergence within around 2 to 5 iterations only.

Features

This applet was written in Javascript by Dominic Dayta. Plots were produced using the Plotly Javascript library.

The user may interact with the applet by clicking on the Iterate button, which performs the updating steps of Lloyd's algorithm. Three plots are provided to track the progress of the algorithm towards convergence.

The first plot visualizes the data points across its dimensions (only two are allowed for this applet) and colored based on their cluster assignments. Each cluster's nearby neighborhood is displayed as a circular backdrop whose radius is proportional to its number of members. As the user moves through iterations, these neighborhoods shift in location across the space and so do the data points change cluster assignments. Because of the way we initialize the clusters, it is possible that in the beginning, clusters may overlap or even be completely inscribed within another cluster. This is immediately corrected as soon the algorithm begins updating.

The second plot displays only the centroids as they are updated across iterations. The data points are provided in the backdrop to provide an idea of how the centroids are moving, and the members that are being included and excluded at each iteration.

The third and last plot tracks the improvements in within-group distances as the algorithm updates. Because the objective of the algorithm is to minimize this value to convergence, the algorithm stops and declares convergence when this line flattens out.

A table is also provided listing at each iteration the clusters and their centroid coordinates, along with the number of members, and the average distance of each member to the centroid.

Citations

Lloyd, Stuart P. (1982). "Least squares quantization in PCM." IEEE Transactions on Information Theory. 28 (2): 129–137.
Hamerly, Greg; Elkan, Charles (2002). "Alternatives to the k-means algorithm that find better clusterings." Proceedings of the eleventh international conference on Information and knowledge management (CIKM).
Celebi, M. E.; Kingravi, H. A.; Vela, P. A. (2013). "A comparative study of efficient initialization methods for the k-means clustering algorithm". Expert Systems with Applications. 40 (1): 200–210.
McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley. [Contains US Arrests data adopted from the R datasets.]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
css		css
data		data
img		img
js		js
.gitignore		.gitignore
index.php		index.php
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMeans Applet

Algorithm Notes

Features

Citations

About

Releases

Packages

Languages

dominicdayta/kmeans

Folders and files

Latest commit

History

Repository files navigation

KMeans Applet

Algorithm Notes

Features

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages