This page briefly describes the methods of Exploratory Factor Analysis (EFA) and provides an annotated list of resources.
Factor analysis is a 100 year old family of techniques used to identify the structure / dimensionality of observed data and uncover the underlying constructs that lead to observed phenomena. The techniques identify and examine clusters of correlated variables; these clusters are called factors or latent variables (see Figure 1). In statistical terms, factor analysis is a method of modeling the population covariance matrix of a set of variables using sample data. Factor analysis is used for theory development, psychometric tool development and data reduction.
Figure 1. Example of the factor structure of common psychiatric disorders. Common disorders appear to represent two latent dimensions, internalizing and externalizing disorders. Von Krueger, R. F., 1999, The Structure of Common Mental Disorders. Archive of General Psychiatry. 56: 921-926.
Factor analysis was developed in 1904 by the psychologist and statistician Charles Spearman (famous for Spearman's correlation coefficient) in his work on the underlying dimensions of intelligence. Until the introduction of statistical computing, its use was hampered by laborious hand calculations; technology has flourished since then.
There are two main types of factor analysis: exploratory and confirmatory. In exploratory factor analysis (EFA, focus on this resource page), each observed variable is potentially a measure of each factor, and the goal is to determine the strongest relationships (between observed variables and factors). Confirmatory Factor Analysis (CFA) postulates a simple factor structure, each variable can be a measure of only one factor, and the correlation structure of the data is tested against the hypothetical structure using goodness-of-fit tests. Figure 2 is a graphical representation of EFA and CFA.
Figure 2. EFA (left) and CFA (right). Adapted from Wall, M., September 20, 2012, Session 3 Guest Lecture in Epidemiology of Drug and Alcohol Problems, Hassin, D., Columbia University Mailman School of Public Health
There are different factor analysis techniques for different measurement and data scenarios:
Observed variables are continuous, latent variables are assumed to be continuous
Observed are continuous, latent are categorical
Observed are categorical, latent are continuous
Observed are categorical, latent are categorical
This resource page focuses on scenarios 1 and 3.
The following figures 3 and 4 illustrate some of the basic premises of measurement theory versus factor analysis:
Factors or latent variables influence systematically observed variables (i.e. when we measure observed variables, those measurements / observations are at least in part caused by latent variables)
Inter-individual differences (i.e., variance) in observed variables are due to latent variables and measurement errors
Each type of factor (general, specific - see below) contributes to part of the variance in addition to the measurement error
Figure 3. Elements that affect the observed variables. Figure adapted from Tucker, LR and MacCallum, RC. 1997, Exploratory Factor Analysis: http://www.unc.edu/~rcm/book/factornew.htm
Figure 3 shows that three things affect the variables observed. Two are types of latent variables or factors. The first are common factors that result in more than one of the observed variables (e.g., math skills can result in an addition test result, a multiplication test result, and a division test result). The second is specific factors that result in only one of the observed variables (a common factor can become a specific factor if you remove all but one of the observed variables that they result in). The third factor that affects the observed variables is the measurement error, which is not latent but is often due to unsystematic events that affect the measurement. The measurement error is closely related to the reliability.
Each of the elements that affect the observed variables also contribute to the variance of those variables. Figure 4 shows that the variance of a particular observed variable is due in part to factors affecting other observed variables, factors affecting only the particular observed variable, and measurement errors. The common variance is sometimes called commonality, and the specific variance and error variance are often combined and called uniqueness.
Figure 4. Variance structure of observed variables. Figure from James Neill, 2013, Exploratory Factor Analysis, Lecture 5, Survey Research and Design in Psychology. http://www.slideshare.net/jtneill/exploratory-factor-analysis
The figure also shows a major difference between factor analysis andPrincipal component analysis. In principal component analysis, the goal is to take into account as much of the total variance of the observed variables as possible; Linear combinations of observed variables are used to create components. In factor analysis, the goal is to explain the covariance between variables; the observed variables are defined as linear combinations of the factors.
The main point is that factor analytical theory is about taking into account the covariation between observed variables. When observed variables are correlated with each other, factor analytical theory says that the correlation is at least partially due to the influence of common latent variables.
Factor analysis has the following assumptions, which can be explored in more detail in the resources linked below:
Sample size (e.g. 20 observations per variable)
Assessment level (e.g. the measurement / data scenarios mentioned above)
Outliers (factor analysis is sensitive to outliers)
Eigenvalues and factor loadings
[Note: this matrix algebra review can help you understand what is going on with eigenvalues and factor loadings under the hood, but is not strictly necessary to interpret the results of the factor analysis.]
Factors are extracted from correlation matrices by transforming such matrices by eigenvectors. An eigenvector of a square matrix is a vector that, when pre-multiplied by the square matrix, yields a vector that is an integral multiple of the original vector. This integer multiple is an eigenvalue.
The eigenvalue represents the variance that each factor takes into account. Each extracted factor has an eigenvalue (the integer multiple of the original vector). The first extracted factor will try to absorb as much variance as possible so that consecutive eigenvalues are lower than the first. Eigenvalues above 1 are stable. The sum of all eigenvalues is the number of observed variables in the model.
Abbildung 5. Scree-Plot, von James Neill, 2013, Exploratory Factor Analysis, Lecture 5, Survey Research and Design in Psychology. http://www.slideshare.net/jtneill/exploratory-factor-analysis
Each variable contributes to a variance of 1. Eigenvalues are then assigned to the factors according to the declared amount of variance. Scree plots (Figure 5 below) are often output in factor analysis software and are line graphs of eigenvalues. They represent the amount of variance explained by each factor, and the cut-off is the number of factors just before the bend in the scree plot, e.g. For example, about 2 or 3 factors in Figure 5. Eigenvalues and scree plots can help you determine many factors best fit your data.
Factor loadings are a matrix of the relationship between observed variables and the factors you specify. In geometrical terms, charges are the numerical coefficients that correspond to the directional paths that link common factors to observed variables. They form the basis for interpreting the latent variables. Higher loads mean that the observed variable is more closely related to the factor. As a rule of thumb, consider charges over 0.3.
Factors are rotated (literally in geometric space) to make interpretation easier. There are two types of rotation: orthogonal (vertical), in which factors must not be correlated with one another, and oblique, in which factors lie freely in the factor space and can be correlated with one another. Examples of orthogonal rotation include Varimax, Quartimax, and Equamax. Examples of an oblique twist are Oblimin, Promax, and Geomin. For information on choosing a rotation method, see the resources below.
After the rotation, the factors are rearranged to optimally traverse clusters of common variance so that the factors can be more easily interpreted. This is comparable to selecting a reference group in the regression. Figure 6 illustrates factor rotation using Varimax, but is for conceptual purposes only. Rotations take place under the hood of your software.
Figure 6. Example of an orthogonal varimax rotation. Observed variables related to wine characteristics. From Abdi, Hervé. http://www.utdallas.edu/~herve/Abdi-rotations-pretty.pdf
EFA with dichotomous items
A Pearson correlation matrix is not suitable for categorical or dichotomous elements. Therefore, in order to perform EFA on such data, you need to build an appropriate correlation matrix called tetrachoric (for dichotomous elements) or polychoric (for other categorical elements). A tetrachoric correlation matrix is the Pearson correlation derived from a 2 × 2 table under the assumption of bivariate normality. Polychoric generalizes this to an n x m table.
The idea illustrated in Figure 7 is that dichotomous items
underlying continuous constructs. Basically, in building a tetrachoric correlation matrix, you are estimating a model based on proportions that fall within each area in the lower right corner of Figure 7. The computer tries numerous thresholds and combinations.
Figure 7. Representation of the observed dichotomous variables (depressed yes / no) and a continuous latent construct. The lower corner shows how the latter is modeled by the former.
MPlus has been the gold standard for performing EFA on dichotomous items since spring 2013, but it can also be implemented in R. See resources below, especially the documentation for the Psych package.
Textbooks & Chapters
Exploration Factor Analysis: An online book manuscript by Ledyard Tucker and Robert MacCallum that provides a comprehensive technical treatment of the factor analysis model, as well as methods for performing exploratory factor analysis.
Methodology (theory and background)
R demo code from the presentation by Prins:
#fast demo of exploratory factor analysis
head (Harman.Holzinger) # 9 × 9 correlation matrix of the cognitive performance tests, N = 696
pa<- fa(Harman.Holzinger, 4, fm=pa, rotate=varimax, SMC=FALSE)
#prints results, sort = TRUE shows loads according to absolute value. u ^ 2 is uniqueness and h ^ 2 is # reliability. See values in? Fa for getting specific results
# creates a scree plot - a line diagram of eigenvalues. They represent the amount of variance, # explained by each factor, and the cut-off is the number of factors just before the curvature # in the scree plot, e.g. For example, about 2 or 3 factors in Figure 5. Eigenvalues and scree plots can #guide # in determining how many factors best fit your data.
fa.diagram(pa, sort=TRUE, cut=.3, simple=TRUE, error=FALSE, digits=1, e.size=.05, rsize=0.15)
#A familiar looking graph of the relationship between factors and observed variables
#Code for dichotomous elements
your data<-read.csv(, header=TRUE, stringsAsFactors=FALSE)
dein.fa<-fa.poly(your.data, nfactors=3, n.obs = 184, n.iter=1, rotate=geominQ, scores=tenBerge, SMC=TRUE, symmetric=TRUE, warnings=TRUE, fm=wls,
alpha=.1, p =.05, oblique.scores=TRUE)
# the main difference here is the rotation (you need to choose an oblique method - geominQ is # closest to MPlus), the factoring method (weighted least squares or wls is #MPlus closest but not accurate), and scores = score tenBerge.
#if you want to create the tetrachoric correlation matrix yourself, use the polychor package