Class KMeans

  • All Implemented Interfaces:
    java.io.Serializable, Clustering<double[]>

    public class KMeans
    extends PartitionClustering<double[]>
    K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

    However, the k-means algorithm has at least two major theoretic shortcomings:

    • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
    • Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.
    In this implementation, we use k-means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

    We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.

    K-means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.

    References

    1. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
    2. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
    3. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
    Author:
    Haifeng Li
    See Also:
    CLARANS, BBDTree, Serialized Form
    • Constructor Summary

      Constructors 
      Constructor Description
      KMeans​(double[][] data, int k)
      Constructor.
      KMeans​(double[][] data, int k, int maxIter)
      Constructor.
      KMeans​(double[][] data, int k, int maxIter, int runs)
      Clustering data into k clusters.
    • Constructor Detail

      • KMeans

        public KMeans​(double[][] data,
                      int k)
        Constructor. Clustering data into k clusters up to 100 iterations.
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
      • KMeans

        public KMeans​(double[][] data,
                      int k,
                      int maxIter)
        Constructor. Clustering data into k clusters.
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
        maxIter - the maximum number of iterations for each running.
      • KMeans

        public KMeans​(double[][] data,
                      int k,
                      int maxIter,
                      int runs)
        Clustering data into k clusters. Run the algorithm for given times and return the best one with smallest distortion.
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
        maxIter - the maximum number of iterations for each running.
        runs - the number of runs of K-Means algorithm.
    • Method Detail

      • distortion

        public double distortion()
        Returns the distortion.
      • centroids

        public double[][] centroids()
        Returns the centroids.
      • predict

        public int predict​(double[] x)
        Cluster a new instance.
        Parameters:
        x - a new instance.
        Returns:
        the cluster label, which is the index of nearest centroid.
      • lloyd

        public static KMeans lloyd​(double[][] data,
                                   int k)
        The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN). The algorithm runs up to 100 iterations.
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
      • lloyd

        public static KMeans lloyd​(double[][] data,
                                   int k,
                                   int maxIter)
        The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN).
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
        maxIter - the maximum number of iterations for each running.
      • lloyd

        public static KMeans lloyd​(double[][] data,
                                   int k,
                                   int maxIter,
                                   int runs)
        The implementation of Lloyd algorithm as a benchmark. Run the algorithm multiple times and return the best one in terms of smallest distortion. The data may contain missing values (i.e. Double.NaN).
        Parameters:
        data - the input data of which each row is a sample.
        k - the number of clusters.
        maxIter - the maximum number of iterations for each running.
        runs - the number of runs of K-Means algorithm.
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object