Package smile.clustering
Class KMeans
- java.lang.Object
-
- smile.clustering.PartitionClustering<double[]>
-
- smile.clustering.KMeans
-
- All Implemented Interfaces:
java.io.Serializable
,Clustering<double[]>
public class KMeans extends PartitionClustering<double[]>
K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.However, the k-means algorithm has at least two major theoretic shortcomings:
- First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
- Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.
We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.
K-means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.
References
- Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- Author:
- Haifeng Li
- See Also:
CLARANS
,BBDTree
, Serialized Form
-
-
Field Summary
-
Fields inherited from class smile.clustering.PartitionClustering
k, size, y
-
Fields inherited from interface smile.clustering.Clustering
OUTLIER
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double[][]
centroids()
Returns the centroids.double
distortion()
Returns the distortion.static KMeans
lloyd(double[][] data, int k)
The implementation of Lloyd algorithm as a benchmark.static KMeans
lloyd(double[][] data, int k, int maxIter)
The implementation of Lloyd algorithm as a benchmark.static KMeans
lloyd(double[][] data, int k, int maxIter, int runs)
The implementation of Lloyd algorithm as a benchmark.int
predict(double[] x)
Cluster a new instance.java.lang.String
toString()
-
Methods inherited from class smile.clustering.PartitionClustering
getClusterLabel, getClusterSize, getNumClusters, seed, seed
-
-
-
-
Constructor Detail
-
KMeans
public KMeans(double[][] data, int k)
Constructor. Clustering data into k clusters up to 100 iterations.- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.
-
KMeans
public KMeans(double[][] data, int k, int maxIter)
Constructor. Clustering data into k clusters.- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations for each running.
-
KMeans
public KMeans(double[][] data, int k, int maxIter, int runs)
Clustering data into k clusters. Run the algorithm for given times and return the best one with smallest distortion.- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations for each running.runs
- the number of runs of K-Means algorithm.
-
-
Method Detail
-
distortion
public double distortion()
Returns the distortion.
-
centroids
public double[][] centroids()
Returns the centroids.
-
predict
public int predict(double[] x)
Cluster a new instance.- Parameters:
x
- a new instance.- Returns:
- the cluster label, which is the index of nearest centroid.
-
lloyd
public static KMeans lloyd(double[][] data, int k)
The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN). The algorithm runs up to 100 iterations.- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.
-
lloyd
public static KMeans lloyd(double[][] data, int k, int maxIter)
The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN).- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations for each running.
-
lloyd
public static KMeans lloyd(double[][] data, int k, int maxIter, int runs)
The implementation of Lloyd algorithm as a benchmark. Run the algorithm multiple times and return the best one in terms of smallest distortion. The data may contain missing values (i.e. Double.NaN).- Parameters:
data
- the input data of which each row is a sample.k
- the number of clusters.maxIter
- the maximum number of iterations for each running.runs
- the number of runs of K-Means algorithm.
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
-