Cluster analysis is a set of data reduction techniques that aim to group similar observations in a data set so that observations in the same group are as similar to each other as possible and, similarly, observations in different groups are as different as possible. Compared to other data reduction techniques such as factor analysis (FA) and principal component analysis (PCA), which aim to group according to similarities between variables (columns) of a data set, cluster analysis aims to group observations according to similarities across rows.
K-Means is a method of cluster analysis that groups observations by minimizing the Euclidean distances between them. Euclidean distances are analogous to measuring the hypotenuse of a triangle, in which the differences between two observations at two variables (x and y) are inserted into the Pythagorean equation to solve the shortest distance between the two points (length of the hypotenuse). Euclidean distances can be expanded to n dimensions with any number n, and the distances refer to numerical differences in any measured continuous variable, not just spatial or geometric distances. This definition of Euclidean distance therefore requires that all variables used to determine clustering using k-means must be continuous.
To perform k-means clustering, the algorithm randomly assigns k initial centers (k specified by the user), either by randomly selecting points in Euclidean space defined by all n variables or by sampling k points of all available observations to serve as first centers. It then iteratively assigns each observation to the closest center. Next, it computes the new center for each cluster as the centroid mean of the clustering variables for each cluster's new observations. K-means repeats this process and assigns observations to the closest center (some observations change the cluster). This process repeats itself until a new iteration no longer assigns any observations to a new cluster. At this point it is assumed that the algorithm has converged and the final cluster assignments form the cluster solution.
Several k-means algorithms are available. The standard algorithm is the Hartigan-Wong algorithm, which aims to minimize the Euclidean distances of all points with their nearest cluster centers by minimizing the sum of squared errors (SSE) within the cluster.
K-means is implemented in many statistical software programs:
In R in the cluster package, use the function: k-means (x, center, iter.max = 10, nstart = 1). The data object to be clustered for is declared in x. The number of clusters k is specified by the user in center = #. k-means () repeats itself with different initial focal points (randomly sampled from the entire data set) nstart = # times and selects the best run (smallest SSE). iter.max = # defines a maximum number of iterations (default is 10) per iteration.
Use the command in STATA: cluster kmeans [varlist], k (#) [options]. Use [varlist] to declare the clustering variables, k (#) to declare k. There are other ways of specifying similarity measures instead of Euclidean distances.
In SAS use the command: PROC FASTCLUS maxclusters = k; var [varlist]. This requires the specification of k and the clustering variables in [varlist].
Use the function in SPSS: Analyze -> Classify -> K-Means-Cluster. Additional help files are available online.
K-means clustering requires that all variables be continuous. Other methods that do not require all variables to be continuous, including some hierarchical clustering methods, have different assumptions and are discussed in the resource list below. K-means clustering also requires a priori specification of the number of clusters, k. Although this can be done empirically with the data (using a scree plot to show the SSE within the group against each cluster solution), the decision should be made theoretically and wrong decisions can lead to faulty clusters. See Peeples' online R-Walkthrough R Script for K-Means Cluster Analysis for examples of cluster solution selection.
The choice of clustering variables is also of particular importance. In general, cluster analysis methods require the assumption that the variables chosen to determine clusters are a comprehensive representation of the underlying construct of interest that groups similar observations together. While variable selection remains a controversial topic, the consensus in this area recommends grouping as many variables as possible, so long as the set fits this description, and those variables that do not describe much of the variance in Euclidean distances between observations, will contribute less to the cluster allocation. Sensitivity analyzes are recommended with various cluster solutions and sets of clustering variables to determine the robustness of the clustering algorithm.
By default, K-means aims to minimize the sum of the squared errors within the group as measured by Euclidean distances, but this is not always justified when the data assumptions are not met. Consult textbooks and online guides in the resources section below, particularly Robinson's R blog: K-Means Clustering Ain't a Free Lunch for examples of problems encountered with k-means clustering when assumptions are violated.
Finally, cluster analysis methods are similar to other data reduction techniques in that they are largely exploratory tools, so the results should be interpreted with caution. There are many techniques for validating results from cluster analysis, including internally with cross-validation or bootstrapping, validation of a priori theorized conceptual groups or with expert opinion, or external validation with separate datasets. A common use of cluster analysis is as a tool to predict cluster membership in future observations using existing data, but does not describe why the observations are grouped this way. Therefore, cluster analysis is often used in conjunction with factor analysis, where cluster analysis is used to describe how similar observations are and factor analysis is used to describe why observations are similar. Ultimately, the validity of cluster analysis results should be determined by theory and the use of cluster descriptions.
Textbooks & Chapters
Aldenderfer MS and Blashfield RK (1984). Cluster analysis. Sage University Paper series on Quantitative Applications in the Social Sciences, Series No. 07-044. Newbury Park, California: Sage Publications. The Green Paper on Cluster Analysis is a classic reference work on the theory and methods of cluster analysis as well as a guide for reporting results.
Everitt BS, Landau S, Leese M, Stahl D (2011). Cluster analysis, 5th edition. Wiley series. Full and up-to-date descriptions of the various types of cluster analysis methods as the field has evolved.
Lorr M. (1983). Cluster analysis for social scientists. Jossey-Bass Social and Behavioral Science Series. Lorr's classic text describes methods using data typically found in the social sciences - k-means data assumptions are often difficult to satisfy with data in the social sciences, and alternatives are discussed.
Hauser J. and Rybakowski J (1997). Three groups of male alcoholics. Drug addiction; 48 (3): 243-50. An example of the clustering of behavior types in addiction research.
BreuhlS, et al. (1999). Using cluster analysis to validate the IHS diagnostic criteria for migraine and tension-type headache. A headache; 39 (3): 181-9. A study to validate diagnostic criteria using k-means on symptom patterns.
Guthrie E. et al. (2003). The cluster analysis of symptoms and health-oriented behavior distinguishes subgroups of patients with severe irritable bowel syndrome. Intestines; 52 (11): 1616-22. Behavioral patterns seeking care are differentiated through cluster analysis.
MacQueen J. (1967). Some methods of classifying and analyzing multivariate observations. Proceedings of the 5th Berkeley Symposium on Math. Statisticians. and Prob., Vol. 2, No. 1. Previous paper on statistical methods using k-means the clustering algorithm from one of the early developers.
Salim SZ and Ismail MA. (1984). K-Means Type Algorithms: A Generalized Theorem of Convergence and Characterization of Local Optimality. IEEE Trans Pattern Anal Mach Intell; 6 (1): 81-7. Methodological considerations and recommendations for the use of k-means clustering.
Saeed F, et al. (2012). Combining K-means clusterings of chemical structures using a cluster-based similarity partitioning algorithm. Communication in computer science and information science; 322: 304-312. A recent article about improving the performance of k-means cluster solutions through multiple iteration and combination approaches.
Various walkthroughs of using R software to perform k-means cluster analysis with applied examples and sample code.
statmethods.net: Quick-R: Clusteranalyse http://www.statmethods.net/advstats/cluster.html
2. R-Statistik-Blog: K-means Clustering http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
3. Peeples MA (2011). R script for K-Means cluster analysis http://www.mattpeeples.net/kmeans.html
4. Robinson D (2015). R-Blogger: K-means clustering isn't a free lunch http://www.r-bloggers.com/k-means-clustering-is-not-a-free-lunch/
York University - Cluster Analysis R commands analysis http://wiki.math.yorku.ca/index.php/R:_Cluster_analysis
R kmeans () help file https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
Associated data reduction techniques
Exploratory Factor Analysis (EFA) on Advanced Epidemiology
Principal Component Analysis (PCA) for advanced epidemiology