Software | |
overview
This page briefly describes Ridge Regression and provides an annotated list of resources.
description
Definition des Problems
X’X represents a correlation matrix of all predictors; X represents a matrix of dimensions nxp, where n = # of observations and p = # of predictors in the regression model; Y represents a result vector of length n; and X ’stands for the transpose of X.
So if the inverse of X’X cannot be calculated, the OLS coefficients are indeterminate. In other words, the parameter estimates are very unstable (i.e., they have very high variances) and thus are not interpretable.
What causes (X’X) -1 to be indefinite?
the number of parameters in the model exceeds the number of observations (n> p)
Multikollinearität
Diagnosing multicollinearity
The simplest way to check for multicollinearity is to build a correlation matrix of all the predictors and determine whether correlation coefficients are close to 1. However, this is somewhat subjective and does not provide any information about the severity of the multicollinearity.
Additional methods commonly used to measure multicollinearity include:
2. Check for large condition numbers (CNs). Usually the CN is calculated by taking the maximum eigenvalue and dividing it by the minimum eigenvalue: max / λmin. CN> 5 indicates multicollinearity. CN> 30 indicates severe multicollinearity.
3. Check for high variance inflation factors (VIFs). As a rule of thumb, a VIF> 10 indicates multicollinearity.
In SAS, VIFs can be called up with the code / vif.
In R, they can be calculated on a regression object with the code vif (). It is important that this code requires the Auto and HH packages.
Options for dealing with multicollinearity
There are many ways to address multicollinearity, and each method has its advantages and disadvantages. Common methods include: variable selection, principal component regression, and ridge regression. Variable selection simply involves discarding predictors that are strongly correlated with other predictors in the model. However, this is sometimes not feasible. For example, a variable that contributes to collinearity can be a major predictor of interest, a potential confounder, or a mediator that needs to be adjusted to measure the direct effect of a predictor on the outcome. Fortunately, both Principal Component Regression and Ridge Regression allow all explanatory variables of interest to be retained, even if they are highly collinear, and both methods give practically identical results. However, ridge regression retains the OLS interpretation of the regression parameters, while principal component regression does not. So if the question of interest is, what is the relationship between each of the predictors in the model and the outcome? Then ridge regression may be more useful than principal component regression. Ridge regression also provides information about which coefficients are most sensitive to multicollinearity.
Ridge-Regression
The ridge regression focuses on the previously discussed X’X predictor correlation matrix. In particular, the ridge regression modifies X’X so that its determinant is not equal to 0; this ensures that (X’X) -1 can be calculated. Such a modification of the matrix effectively eliminates collinearity, which leads to more precise and therefore more interpretable parameter estimates. But in statistics there is always a tradeoff between variance and bias. Hence, this decrease in variance incurs a price: an increase in bias. However, the bias introduced by the ridge regression almost always tends towards zero. Therefore, ridge regression is considered to be a shrinking method, as it typically shrinks the beta coefficients towards 0.
How is X’X modified in the ridge regression?
The above equation should look familiar, as it corresponds to the OLS formula for estimating regression parameters with the exception of the addition of kI to the X’X matrix. In this equation, I represents the identity matrix and k is the ridge parameter. Multiplying k by I and adding that product to X’X is equivalent to adding the value of k to the diagonal elements of X’X.
How does the modification of X’X eliminate multicollinearity?
In the case of multicollinearity, the columns of a correlation matrix are not independent of one another. This is a problem because a matrix with non-independent columns has a determinant of 0. Therefore, the dependencies between the columns must be broken down so that the inverse of X’X can be calculated. Adding a positive value k to the diagonal elements of X’X will resolve any dependency between these columns. This also causes the estimated regression coefficients to shrink towards zero; the higher the value of k, the greater the shrinkage. The intercept is the only coefficient that is not penalized in this way.
k. choose
Hoerl and Kennard (1970) have shown that there is always a value of k> 0, so that the mean square error (MSE) is smaller than the MSE obtained with OLS. However, it is impossible to determine the ideal value of k because it ultimately depends on the unknown parameters. Thus, the ideal value of k can only be estimated from the data.
There are many ways to estimate the ideal value of k. However, there is currently no consensus as to which method is best. The traditional method for selecting k is the ridge trace plot introduced by Hoerl and Kennard (1970). This is a graphical means of selecting k. Estimated coefficients and VIFs are plotted against a range of specified values of k.
From this diagram, Hörl and Kennard propose to select the value of k that:
Stabilizes the system to reflect an orthogonal (i.e., statistically independent) system.
Leads to coefficients with reasonable values
Ensures that coefficients with improper signs have switched to the correct sign at k = 0
Ensures that the remainder of the squares doesn't get inflated to unreasonable amounts
However, these criteria are very subjective. Therefore, it is best to use another method in addition to the ridge trace plot. A more objective method is generalized cross-validation (GCV). Cross-validation simply involves looking at subsets of data and computing the coefficient estimates for each subset of data, using the same value of k across subsets. This is then repeated several times with different values of k. The value of k that minimizes the differences in coefficient estimates across these subsets is then selected. However, this is computationally intensive. GCV is only a weighted version of this method, and Golub et al. (1979) have shown that the model with the smallest prediction errors can be obtained simply by choosing the value of k that minimizes the GCV equation shown below (Note: Golub et al., 1979 refer to k as λ in their paper ).
Where
However, this does not have to be calculated manually. The value of k that minimizes this equation can be calculated using R.
Ridge regression implementation example
Interesting predictors: inorganic arsenic (InAs), monomethyl arsenic (MMA), dimethyl arsenic (DMA), measured in blood and log-transformed
Result: glutathione measured in the blood (bGSH)
proc reg data=fuchs;
Run;
In this case the VIFs are all very close to 10 so it may or may not be acceptable to use OLS. However, notice how wide the confidence intervals are for the parameter estimates. In addition, the parameter estimate for ln_bDMA is quite large. We seldom see such large coefficients in environmental health studies. Therefore, in this situation, ridge regression can be used as a diagnostic tool to determine whether these OLS estimates are reasonable.
Example of a ridge trace plot in SAS:
SAS ridge trace plots have two fields. The upper field shows the VIF for each predictor with increasing values of the ridge parameter (k). Each VIF should decrease with increasing values of k in the direction of 1, since multicollinearity is resolved. We see that in this case the VIFs are approaching 1 when k is about 0.2.
The lower panel shows the actual values of the ridge coefficients with increasing values of k. (SAS will automatically standardize these coefficients for you). At a certain value of k these coefficients should stabilize (again we see this at values of k> 0.2). Almost all of these parameters shrink with increasing values of k towards zero. Some parameter estimates can change signs. Notice that this is the case in my ridge trace plot for the variable ln_bMMA, shown in red. If the k value is 0 (the OLS estimate), the association between ln_bMMA and bGSH is positive. However, once k is introduced into the model and the multicollinearity is resolved, one can see that the coefficient is actually negative (this sign change occurs at a k value of 0.24). This ridge trace plot therefore suggests that the use of OLS estimates could lead to incorrect conclusions about the relationship between this arsenic metabolite (blood MMA) and the blood glutathione (bGSH) result.
I made the above diagram using the following SAS code:
outest = fox_ridge ridge = 0 to 1 of .02;
Run;
Notice that fox is the name of my dataset, fox_ridge is the name of a new dataset I'm creating that contains the calculated edge parameters for each value of k. You need to provide your model and also the values of k that you want to look at. I've examined all values of k between 0 and 1 in 0.02 increments, but note that these are small values of k to look at. Since the VIFs for my predictors were close to 10, the multicollinearity was not severe in this situation, so I didn't have to investigate large values of k.
You can also look at a table of all of your ridge coefficients and VIFs for each value of k by using the following statement:
Run;
Instructions for calculating the GCV criteria in R:
1. Download the 'MASS' package so that you can use the lm.ridge () function
epidemiology is the study of
2. Create a regression object with the lm.ridge () function. For example:
fox_ridge<-lm.ridge((bGSH~ln_bInAs + ln_bMMA + ln_bDMA + sex + cig + ln_age, lambda=seq(5,100,1))
## Notice that I have specified a range of values for k (called lambda in R). GCV tends to choose values of k close to 0, so it is best to narrow the possible range of k values.
3. Find the GCV criterion for each value of k using the code $ GCV that follows your regression object. For example:
fox_ridge$GCV
4. Choose the value of k that gives the smallest GCV criterion
NOTE: SAS and R scale things differently. If you use the instructions I provided, which are specific to each program, you will get very similar Ridge regression coefficients with each software. However, SAS and R recommend different k-values (due to the different scales), so you should not use the k-value recommended in SAS to calculate first coefficient in R, nor should you use the k-value recommended in R to calculate first coefficient to be calculated in SAS.
Glossary:
Transpose: The transposition of a matrix A (e.g. A ') is simply exchanged between matrix A and the values of the columns and rows. The row values of A are the column values of A 'and the column values of A are the row values of A'.
Eigenvalue: A number (λ) that, when multiplied by a non-zero vector (C), is the product of C and a matrix (A). In other words, if a number
Identity matrix (also known as the identity matrix): A square nxn matrix with values of 1 on the diagonal of the matrix and values of 0 in all other cells of the matrix. The identity matrix essentially serves as the value 1 in matrix operations. Examples of identity matrices are given below:
2×2 Identitätsmatrix
1 | 0 |
0 | 1 |
3×3 Identitätsmatrix
1 | 0 | 0 |
0 zoom recommended internet speed | 1 | 0 |
0 | 0 | 1 |
4×4 Identitätsmatrix
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
Readings
Textbooks & Chapters
A useful resource for understanding regression in terms of linear algebra:
Appendix B (pp.841-852) on matrices and their relationship to regression analysis by Kleinbaum, Kupper, Nizam and Muller. Applied regression analysis and other multivariable methods. Belmont, California: Thomson, 2008.
Chapter 8 of the following e-book is helpful in understanding the problem of multicollinearity in relation to matrices and how ridge regression solves this problem:
Sections 8.1.5, 8.1.6 of http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/csa/node171.html#SECTION025115000000000000000
For more details ...
Gruber, Marvin H.J. Improving Efficiency by Shrinkage: The James-Stein and Ridge Regression Estimators. New York: Marcel Dekker, Inc, 1998.
Methodical articles
Hörl AE and Kennard RW (2000). Ridge regression: Biased estimate for non-orthogonal problems. Technometry; 42 (1): 80.
Hoerl and Kennard (1968, 1970) wrote the original papers on Ridge Regression. In 2000 they published this more user-friendly and timely paper on the subject.
Selection of K:
G. Golub, M. Heath, G. Wahba (1979). Generalized cross-validation as a method for selecting a good ridge parameter. Technometry; 21 (2): 215-223. This is the first resource to understand generalized cross-validation for choosing k, but it's a bit fanciful. Therefore, a simpler explanation can be found in the resource listed under Websites.
Draper NR and van Nostrand CR (1979). Ridge regression and James Stein estimate: Review and Comments.Technometrics; 21 (4): 451-466. This paper takes a more critical look at ridge regression and describes the pros and cons of some of the different methods used to select the ridge parameter.
Khalaf G and Shukur G (2005). Choosing the Ridge parameter for regression problems. Communication in Statistics - Theory and Methods; 34: 1177-1182. This paper gives a nice and brief overview of ridge regression and also provides the results of a simulation that compares ridge regression with OLS and various methods for selecting k.
Comment on variable selection vs. shrinking methods:
Greenland S (2008). Invited comment: Variable selection versus shrinkage when controlling multiple confounding factors. American Journal of Epidemiology; 167 (5): 523-529.
Application item
Holford TR, Zheng T, Mayne ST, et al. (2000). Joint effects of nine polychlorinates biphenyl (PCB) congeners on breast cancer risk. Int. J. Epidemiol; 29: 975-82.
This paper compares several methods of dealing with multicollinearity, including ridge regression.
Huang D., Guan P., Guo J. et al. (2008). Investigation of the effects of climatic fluctuations on the incidence of bacillary dysentery in northeast China using ridge regression and hierarchical cluster analysis. BMC Infectious Diseases; 8: 130.
This paper uses a combination of ridge regression and hierarchical cluster analysis to examine the influence of correlated climate variables on the incidence of bacillary dysentery.
Web pages
A useful resource for understanding regression in terms of linear algebra:
http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat401/Notes/401-multreg.pdf
Tutorials explaining basic matrix manipulations / concepts of linear algebra:
https://www.khanacademy.org/math/linear-algebra/matrix_transformations
Slides from a Ridge Regression lecture from Dr. Patrick Breheny (BST 764: Applied Statistical Modeling for Medicine and Public Health) at the University of Kentucky:
http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/9-1.pdf
A nice website that explains cross-validation and generalized cross-validation in clearer language than the Golub article:
http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/csa/node123.html
Courses
Courses:
http://stat.columbia.edu/~cunningham/syllabi/STAT_W4400_2015spring_syllabus.pdf
Tutorials
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_ridge.htm