In this post, the following questions are covered. Let me know if there are other aspects that you wish to see included.
- What is multicollinearity?
- How does it affect models?
- How to measure multicollinearity?
- What are the possible responses to multicollinearity?
What is multicollinearity?
Pearson Linear Correlation Coefficient measures the strength of linearity in relationship between two numeric variables. If we extend this to involve multiple variables we are referring to multicollinearity.
How does it affect models?
Presence of multicollinearity among features affects coefficients in linear models and their standard errors.
Consider a simple example X1,X2,X3 used to model and predict y. If the features were mutual independent i.e. no multicollinearity then for models like these: y~X1, y~X1,X2 and y~X1,X2,X3 the coefficients are the same. The coefficient of X1 did not change with X2 included in the second model. Similarly X2 coefficient did not change when X3 was included in the third model. On the contrary, if multicollinearity is present then the coefficients will change with more features included. The extent of change is dependent on the extent of multicollinearity. It can even go to the extent of sign of coefficient changing with other features included. So how does it affect the model? The interpretability of each feature’s impact on the label is affected.
Secondly, the standard error of coefficients increases. In linear regression this is an inverse measure of how good the estimate. This has an effect of increasing the margin of error when we set up confidence interval of the model prediction.
So far, we saw how it affects parametric models and how they are affected. This applies to linear regression, regularized regression (Lasso, Ridge) and logistic regression. Does multicollinearity affect decision tree regressors? Yes but not in the same way. Presence of multicollinearity is equivalent to redundant features. This leads to sub optimal splits and estimation of feature importance. Therefore, it also affects interpretability of the tree.
How to measure multicollinearity?
There are two levels at which this is measured. At a feature level it is measured using VIF (Variance Inflation Factor) as 1 / (1-R2) where R2 is from a model fit to predict this feature using the rest of the features. For e.g. VIF of X1 is based of R2 of the model X1~X2,X3. As R2 increases VIF increases. This shows that X1 is highly predictable using X2 and X3. A cut-off of 10 is used above which action is suggested.
At a model level, condition number is a measure of overall multicollinearity among the features. Condition number above 1000 indicates need for action. Condition number is the fraction of highest eigen value and lowest eigen value for the covariance matrix of the features set.
What are the possible responses to multicollinearity?
There are several approaches possible. Some are:
A simple remedy is to mean center or standardize feature set. Check if this reduces VIF sufficiently.
A second approach is to remove features with higher than 10 VIF. First arrange features in descending order of VIF and start removing the highest one. Calculate VIFs of retained features. Remove highest if it has VIF greater than 10 and repeat this until all retained features have less than 10.
A third approach is to use PCA. The feature set is projected and the principal components that result from this step are used for model building. This removes multicollinearity completely. The disadvantage is we swap feature set for principal components so understandability of predictors is a lost.
Hope this helps. Let me know if there are any specific aspects of this post you want to see more details of. And if there is interest to see these demonstrated using Python.
If you like such content, consider subscribing to get alerts when new posts are published.