Class PartitionClustering<T>

  • Type Parameters:
    T - the type of input object.
    All Implemented Interfaces:
    java.io.Serializable, Clustering<T>
    Direct Known Subclasses:
    CLARANS, KMeans

    public abstract class PartitionClustering<T>
    extends java.lang.Object
    implements Clustering<T>
    Abstract class of partition clustering. Partition methods break the observation into distinct non-overlapping groups.
    Author:
    Haifeng Li
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected int k
      The number of clusters.
      protected int[] size
      The number of samples in each cluster.
      protected int[] y
      The cluster labels of data.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int[] getClusterLabel()
      Returns the cluster labels of data.
      int[] getClusterSize()
      Returns the size of clusters.
      int getNumClusters()
      Returns the number of clusters.
      static int[] seed​(double[][] data, int k, ClusteringDistance distance)
      Initialize cluster membership of input objects with KMeans++ algorithm.
      static <T> double seed​(Distance<T> distance, T[] data, T[] medoids, int[] y, double[] d)
      Initialize cluster membership of input objects with KMeans++ algorithm.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • k

        protected int k
        The number of clusters.
      • y

        protected int[] y
        The cluster labels of data.
      • size

        protected int[] size
        The number of samples in each cluster.
    • Constructor Detail

      • PartitionClustering

        public PartitionClustering()
    • Method Detail

      • getNumClusters

        public int getNumClusters()
        Returns the number of clusters.
      • getClusterLabel

        public int[] getClusterLabel()
        Returns the cluster labels of data.
      • getClusterSize

        public int[] getClusterSize()
        Returns the size of clusters.
      • seed

        public static int[] seed​(double[][] data,
                                 int k,
                                 ClusteringDistance distance)
        Initialize cluster membership of input objects with KMeans++ algorithm. Many clustering methods, e.g. k-means, need a initial clustering configuration as a seed.

        K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

        The exact algorithm is as follows:

        1. Choose one center uniformly at random from among the data points.
        2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
        3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
        4. Repeat Steps 2 and 3 until k centers have been chosen.
        5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
        This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.

        References

        1. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
        2. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
        Parameters:
        data - data objects to be clustered.
        k - the number of cluster.
        Returns:
        the cluster labels.
      • seed

        public static <T> double seed​(Distance<T> distance,
                                      T[] data,
                                      T[] medoids,
                                      int[] y,
                                      double[] d)
        Initialize cluster membership of input objects with KMeans++ algorithm. Many clustering methods, e.g. k-means, need a initial clustering configuration as a seed.

        K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

        The exact algorithm is as follows:

        1. Choose one center uniformly at random from among the data points.
        2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
        3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
        4. Repeat Steps 2 and 3 until k centers have been chosen.
        5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
        This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.

        References

        1. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
        2. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
        Type Parameters:
        T - the type of input object.
        Parameters:
        data - data objects array of size n.
        medoids - an array of size k to store cluster medoids on output.
        y - an array of size n to store cluster labels on output.
        d - an array of size n to store the distance of each sample to nearest medoid.
        Returns:
        the initial cluster distortion.