Tuesday, October 19, 2010


KMeans clustering algorithm is one of the the most popular technique in the field of pattern recognition, data mining and unsupervised learning. All though, it gives no guarantee (theoretically) about accuracy, its speed and simplicity are very appealing for practical applications. KMeans algorithm is significantly sensitive to initial selection of cluster centers. Usually algorithm begins with k arbitrary centers, typically chosen uniformly at random from the data points. However, to reduce sensitivity of KMeans towards initialization of cluster center, KMeans is run multiple times and only one that minimize the sum of squared distances is selected. This method performs better but still does not guarantee accuracy. There are many datasets for which KMeans generates arbitrary bad clustering [1]. I assume, like me most of you also would have struggled with initialization of KMeans.

KMeans++ algorithm overcomes this weakness in KMeans and makes it even more effective. KMeans++ algorithm proposes a simple probabilistic means of initialization for KMeans clustering that not only has the best known theoretical guarantees on expected outcome quality, it reportedly works very well in practice.

Algorithm :

KMeans++ algorithm uses simple probabilistic method for generating initial centers for K-means from set of points X. At any given time, let D(x) denote the shortest distance from a data point x to the closest center we have already chosen. Then algorithm defined as follows :

  1. Sample the first center c[1] from a uniform distribution over X.
  2. For k = 2 to K
      Sample the k-th center c[k] from a multinomial over X where a point x has has probability θx, defined by:

      \theta_x = \<span class=

Note : Probability of selecting x as next cluster is proportional to D(x) ^ 2.

Code :
  1. C implementation of KMeans++ is provided by authors : Code1 Code2
  2. Java Implementation
  3. Matlab Code
KMeans++ in OpenCV :

To use KMeans++ in OpenCV follow this link.

References :

[1] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding.

Links :
  1. Paper and Code
  2. Wikipedia

1 comment:

  1. Hi thanks for it.
    I am just confused how we will choose the next centroid.
    Let's suppose we have chosen c1 (first centroid) uniformly at random. I calculate prob of each point with c1 and it is {p1,p2,....pn}.

    What is the criteria to choose the next point? e.g. shld I choose the the one with max prob? min prob? or just with random prob (see next)

    for all point i \in n
    if(rand (0.0,1.0) < p(i))
    c2= p(i)