Why is feature selection important?
By selecting an effective subset of features, you:
- Reduce training time, as the size of the feature vectors is smaller.
- Create more interpretable models. Simpler models are easier to interpret.
- Improve accuracy, especially if the original feature set contains features that are no better than noise, or multiple features that encode the same information.
- Reduce the risk of overfitting. Simpler models with smaller feature sets are less likely to overfit to the training data.
I hope you’re convinced that feature selection is a good idea. While there are many different approaches to feature selection, here we focus on a fairly straightforward one: min-redundancy max-relevance (MRMR).
Why use MRMR?
It improves on using a relevance-only approach, such as using an f-test between the target and the features. With MRMR, if two features are similar, only the more relevant one will be considered important.
What is MRMR?
The min-redundancy max-relevance algorithm was proposed by Chris Ding and Hanchuan Peng in their paper “Minimum Redundancy Feature Selection from Microarray Gene Expression Data” in 2005.
MRMR is an algorithm that ranks features based on their importance in predicting the target variable, where importance has a relevance and redundancy component.
- Relevance: how well does the feature correlate to the target? For categorical variables, this is based on mutual information. For continuous variables it is the F-statistic of that variable with the target.
- Redundancy: how well does the feature correlate to features selected in previous iterations? The idea is that if a candidate feature correlates correlates strongly to an already selected feature, then it won’t provide much additional information to the feature set. For categorical variables, this is based on mutual information. For continuous variables, this can be the Pearson correlation coefficient or the distance between the feature vectors.
MRMR works iteratively, selecting one feature at a time. At each step, it calculates the feature with the maximum score among the remaining unselected features using either a difference (relevance minus redundancy) or quotient (relevance divided by redundancy) approach.
The first feature chosen is the one that is most relevant, with no redundancy constraints: there are no other features to compare and see if the feature is redundant.
To use MRMR for feature selection, select k-top features of the ranked features.
Toy Dataset Example
As an illustration, I took one of the sklearn toy datasets to run the algorithm on: the wine dataset. This dataset contains measurements for three different types of wines grown in the same region in Italy. It contains 13 numeric features.
I used four different methods to rank the 13 features: F-statistic, quotient based MRMR, differenced based MRMR, and using the feature importance attribute of a random forest model trained on all the data. The results are below. (The code for this comparison can be found here)
The three statistically-based approaches all have similarity to the random forest feature importance ranking. I was hoping the MRMR approaches would outperform the f-statistic-only based approach, but they seem to perform the same for this dataset, so 🤷. I guess there isn’t much correlation between the features, which would cause the relevance component to be the dominant variable in the MRMR algorithm.
Caveats
No algorithm is perfect, and that includes MRMR.
- MRMr only works for supervised datasets.
- All features must be either entirely categorical or numerical. It is still usable if you have a mix, but some preprocessing might be required, such as binning the numerical values.
Hopefully you found this article interesting and informative!
References
[1] C. Ding, H.Peng, Minimum Redundancy Feature Selection from Microarray Gene Expression Data, Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp.185–205, 2005.