Class PartitionClustering<T>
- java.lang.Object
-
- smile.clustering.PartitionClustering<T>
-
- Type Parameters:
T
- the type of input object.
- All Implemented Interfaces:
java.io.Serializable
,Clustering<T>
public abstract class PartitionClustering<T> extends java.lang.Object implements Clustering<T>
Abstract class of partition clustering. Partition methods break the observation into distinct non-overlapping groups.- Author:
- Haifeng Li
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected int
k
The number of clusters.protected int[]
size
The number of samples in each cluster.protected int[]
y
The cluster labels of data.-
Fields inherited from interface smile.clustering.Clustering
OUTLIER
-
-
Constructor Summary
Constructors Constructor Description PartitionClustering()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int[]
getClusterLabel()
Returns the cluster labels of data.int[]
getClusterSize()
Returns the size of clusters.int
getNumClusters()
Returns the number of clusters.static int[]
seed(double[][] data, int k, ClusteringDistance distance)
Initialize cluster membership of input objects with KMeans++ algorithm.static <T> double
seed(Distance<T> distance, T[] data, T[] medoids, int[] y, double[] d)
Initialize cluster membership of input objects with KMeans++ algorithm.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface smile.clustering.Clustering
predict
-
-
-
-
Method Detail
-
getNumClusters
public int getNumClusters()
Returns the number of clusters.
-
getClusterLabel
public int[] getClusterLabel()
Returns the cluster labels of data.
-
getClusterSize
public int[] getClusterSize()
Returns the size of clusters.
-
seed
public static int[] seed(double[][] data, int k, ClusteringDistance distance)
Initialize cluster membership of input objects with KMeans++ algorithm. Many clustering methods, e.g. k-means, need a initial clustering configuration as a seed.K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
The exact algorithm is as follows:
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
References
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- Parameters:
data
- data objects to be clustered.k
- the number of cluster.- Returns:
- the cluster labels.
-
seed
public static <T> double seed(Distance<T> distance, T[] data, T[] medoids, int[] y, double[] d)
Initialize cluster membership of input objects with KMeans++ algorithm. Many clustering methods, e.g. k-means, need a initial clustering configuration as a seed.K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
The exact algorithm is as follows:
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
References
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- Type Parameters:
T
- the type of input object.- Parameters:
data
- data objects array of size n.medoids
- an array of size k to store cluster medoids on output.y
- an array of size n to store cluster labels on output.d
- an array of size n to store the distance of each sample to nearest medoid.- Returns:
- the initial cluster distortion.
-
-