Software  
Courses 
overview
Cluster analysis is a set of data reduction techniques that aim to group similar observations in a data set so that observations in the same group are as similar to each other as possible and, similarly, observations in different groups are as different as possible. Compared to other data reduction techniques such as factor analysis (FA) and principal component analysis (PCA), which aim to group according to similarities between variables (columns) of a data set, cluster analysis aims to group observations according to similarities across rows.
description
KMeans is a method of cluster analysis that groups observations by minimizing the Euclidean distances between them. Euclidean distances are analogous to measuring the hypotenuse of a triangle, in which the differences between two observations at two variables (x and y) are inserted into the Pythagorean equation to solve the shortest distance between the two points (length of the hypotenuse). Euclidean distances can be expanded to n dimensions with any number n, and the distances refer to numerical differences in any measured continuous variable, not just spatial or geometric distances. This definition of Euclidean distance therefore requires that all variables used to determine clustering using kmeans must be continuous.
Procedure
To perform kmeans clustering, the algorithm randomly assigns k initial centers (k specified by the user), either by randomly selecting points in Euclidean space defined by all n variables or by sampling k points of all available observations to serve as first centers. It then iteratively assigns each observation to the closest center. Next, it computes the new center for each cluster as the centroid mean of the clustering variables for each cluster's new observations. Kmeans repeats this process and assigns observations to the closest center (some observations change the cluster). This process repeats itself until a new iteration no longer assigns any observations to a new cluster. At this point it is assumed that the algorithm has converged and the final cluster assignments form the cluster solution.
Several kmeans algorithms are available. The standard algorithm is the HartiganWong algorithm, which aims to minimize the Euclidean distances of all points with their nearest cluster centers by minimizing the sum of squared errors (SSE) within the cluster.
Software
Kmeans is implemented in many statistical software programs:
In R in the cluster package, use the function: kmeans (x, center, iter.max = 10, nstart = 1). The data object to be clustered for is declared in x. The number of clusters k is specified by the user in center = #. kmeans () repeats itself with different initial focal points (randomly sampled from the entire data set) nstart = # times and selects the best run (smallest SSE). iter.max = # defines a maximum number of iterations (default is 10) per iteration.
Use the command in STATA: cluster kmeans [varlist], k (#) [options]. Use [varlist] to declare the clustering variables, k (#) to declare k. There are other ways of specifying similarity measures instead of Euclidean distances.
In SAS use the command: PROC FASTCLUS maxclusters = k; var [varlist]. This requires the specification of k and the clustering variables in [varlist].
Use the function in SPSS: Analyze > Classify > KMeansCluster. Additional help files are available online.
Considerations
Kmeans clustering requires that all variables be continuous. Other methods that do not require all variables to be continuous, including some hierarchical clustering methods, have different assumptions and are discussed in the resource list below. Kmeans clustering also requires a priori specification of the number of clusters, k. Although this can be done empirically with the data (using a scree plot to show the SSE within the group against each cluster solution), the decision should be made theoretically and wrong decisions can lead to faulty clusters. See Peeples' online RWalkthrough R Script for KMeans Cluster Analysis for examples of cluster solution selection.
what is kant’s categorical imperative
The choice of clustering variables is also of particular importance. In general, cluster analysis methods require the assumption that the variables chosen to determine clusters are a comprehensive representation of the underlying construct of interest that groups similar observations together. While variable selection remains a controversial topic, the consensus in this area recommends grouping as many variables as possible, so long as the set fits this description, and those variables that do not describe much of the variance in Euclidean distances between observations, will contribute less to the cluster allocation. Sensitivity analyzes are recommended with various cluster solutions and sets of clustering variables to determine the robustness of the clustering algorithm.
By default, Kmeans aims to minimize the sum of the squared errors within the group as measured by Euclidean distances, but this is not always justified when the data assumptions are not met. Consult textbooks and online guides in the resources section below, particularly Robinson's R blog: KMeans Clustering Ain't a Free Lunch for examples of problems encountered with kmeans clustering when assumptions are violated.
Finally, cluster analysis methods are similar to other data reduction techniques in that they are largely exploratory tools, so the results should be interpreted with caution. There are many techniques for validating results from cluster analysis, including internally with crossvalidation or bootstrapping, validation of a priori theorized conceptual groups or with expert opinion, or external validation with separate datasets. A common use of cluster analysis is as a tool to predict cluster membership in future observations using existing data, but does not describe why the observations are grouped this way. Therefore, cluster analysis is often used in conjunction with factor analysis, where cluster analysis is used to describe how similar observations are and factor analysis is used to describe why observations are similar. Ultimately, the validity of cluster analysis results should be determined by theory and the use of cluster descriptions.
Readings
Textbooks & Chapters

Aldenderfer MS and Blashfield RK (1984). Cluster analysis. Sage University Paper series on Quantitative Applications in the Social Sciences, Series No. 07044. Newbury Park, California: Sage Publications. The Green Paper on Cluster Analysis is a classic reference work on the theory and methods of cluster analysis as well as a guide for reporting results.

Everitt BS, Landau S, Leese M, Stahl D (2011). Cluster analysis, 5th edition. Wiley series. Full and uptodate descriptions of the various types of cluster analysis methods as the field has evolved.

Lorr M. (1983). Cluster analysis for social scientists. JosseyBass Social and Behavioral Science Series. Lorr's classic text describes methods using data typically found in the social sciences  kmeans data assumptions are often difficult to satisfy with data in the social sciences, and alternatives are discussed.
Methodical articles

Hauser J. and Rybakowski J (1997). Three groups of male alcoholics. Drug addiction; 48 (3): 24350. An example of the clustering of behavior types in addiction research.

BreuhlS, et al. (1999). Using cluster analysis to validate the IHS diagnostic criteria for migraine and tensiontype headache. A headache; 39 (3): 1819. A study to validate diagnostic criteria using kmeans on symptom patterns.

Guthrie E. et al. (2003). The cluster analysis of symptoms and healthoriented behavior distinguishes subgroups of patients with severe irritable bowel syndrome. Intestines; 52 (11): 161622. Behavioral patterns seeking care are differentiated through cluster analysis.
Application item

MacQueen J. (1967). Some methods of classifying and analyzing multivariate observations. Proceedings of the 5th Berkeley Symposium on Math. Statisticians. and Prob., Vol. 2, No. 1. Previous paper on statistical methods using kmeans the clustering algorithm from one of the early developers.

Salim SZ and Ismail MA. (1984). KMeans Type Algorithms: A Generalized Theorem of Convergence and Characterization of Local Optimality. IEEE Trans Pattern Anal Mach Intell; 6 (1): 817. Methodological considerations and recommendations for the use of kmeans clustering.

Saeed F, et al. (2012). Combining Kmeans clusterings of chemical structures using a clusterbased similarity partitioning algorithm. Communication in computer science and information science; 322: 304312. A recent article about improving the performance of kmeans cluster solutions through multiple iteration and combination approaches.
Web pages
Various walkthroughs of using R software to perform kmeans cluster analysis with applied examples and sample code.

statmethods.net: QuickR: Clusteranalyse http://www.statmethods.net/advstats/cluster.html

2. RStatistikBlog: Kmeans Clustering http://www.rstatistics.com/2013/08/kmeansclusteringfromrinaction/

3. Peeples MA (2011). R script for KMeans cluster analysis http://www.mattpeeples.net/kmeans.html

4. Robinson D (2015). RBlogger: Kmeans clustering isn't a free lunch http://www.rbloggers.com/kmeansclusteringisnotafreelunch/
lemon v. kurtzman (1971)
Technical RResources

York University  Cluster Analysis R commands analysis http://wiki.math.yorku.ca/index.php/R:_Cluster_analysis

R kmeans () help file https://stat.ethz.ch/Rmanual/Rdevel/library/stats/html/kmeans.html
Associated data reduction techniques

Exploratory Factor Analysis (EFA) on Advanced Epidemiology

Principal Component Analysis (PCA) for advanced epidemiology