The k-means clustering algorithm is simple: given a set of data points, it finds a number ("k") of centroids which represent the data distribution pretty well. Each centroid is representative of one cluster, and each data point is labelled as belonging to the cluster whose centroid is nearest.
This can be used for unsupervised classification of data, and it can be used "on-line" meaning that results are available even after only a few data points have been added, and you can easily add more data points and the algorithm can update the cluster positions and labels accordingly.
Any dimensionality of data can be clustered (the examples below use 2D data).
Create new instance
k |
Define number of clusters |
Add data points
datum |
Run the learning step
centroid positions
data stored internally
assignments
datum |