Feature Selection

20.2. Feature Selection#

An overabundance of features can impact a model’s predictive ability and interpretability. Higher the number of features, more the training data required to optimize the model in question and absence of adequate training data can result in overfitted models that yield sub-optimal performance. Models with high number of features also result in complex decision boundaries where the relationship between individual predictors and the outcomes is also difficult to decipher. To this end, we explore strategies that can help gauge the relative importance of features used to represent datasets for a machine learning problem and eliminate the redundant ones that can have undesirable effects on the resulting model.

Removing low variance features#

A feature that has little to no variation in the distribution of the values it assumes is called a zero or a near-zero variance predictor. Predictors with one or only a handful of unique values are uninformative with regards to the dependent variable. Some models, like tree-based models, can be resistant to the presence of such predictors in the data. For example, if such predictors are not used in any of the splits during the construction of decision trees, then the consequent decision function remains functionally independent of these predictors. However, other models like regression models can find dealing with them problematic since they estimate parameters that weigh in every predictor in the data when making a prediction. Hence such low variance features need to be eliminated during feature selection. Feature Selection can be approached through a few different techniques with the choice depending on what the problem or the model being used demands.

Feature Selection Methods#

Feature Selection methods are intended to remove non-informative features from the model’s consideration. Techniques that do not take into consideration the outcome of the prediction problem when eliminating features are unsupervised in nature. For example, filtering out predictors that have high correlation to other predictors, or sparse and unbalanced distribution like those with near-zero variance is a good example of unsupervised feature selection. For supervised methods, predictors are selected while considering their impact in improving the prediction accuracy or model complexity.

In general, apart from models with built-in feature selection mechanisms (like decision trees), we can largely categorize feature selection methods into two main categories.

Filter Methods#

Filter methods evaluate the importance of features outside of their relevance to a specific predictive model and only move forward with the ones that pass certain statistical criteria. For example, each predictor available in the data could be individually checked for its relationship with the dependent variable. Only predictors that exhibit a strong relationship would be filtered out and passed to the resulting model. Filter methods are computationally inexpensive even though the selection criterion could not be directly related to the effectiveness of the model. Also since filter methods evaluate predictors separately, important interactions between independent variables are not quantified during these processes. Let us look into some of these methods now.

Univariate Feature Selection#

These feature selection techniques evaluate the strength of the relationship between individual predictors and the dependent variable based on univariate statistical tests appropriate for their respective variable types. These tests include the correlation test, Analysis of Variance (ANOVA), Chi-squared test, or mutual information.

Pearson’s Correlation Coefficient : A popular feature selection method for regression problems with numerical predictors is testing the Pearson’s correlation coefficient between each such feature and the target variable. Then selecting the most relevant features that reflect high correlation.
Chi-squared test : For feature selection over categorical variables, we use Pearson’s chi-squared statistical test of independence. It tests the independence of a categorical feature from the target variable to establish its usefulness for classification tasks. We start with a contingency table that enumerates all possible values of either variable whose mutual independence is being tested and also records the frequency of the different combination of values between these two variables. The following formula is used to calculate the chi squared statistic between two categorical variables:
\(x^{2} = \sum_{i=1}^{k}\sum_{j=1}^{m}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\) where,
- \(i\) is the categorical feature variable
- \(j\) is the target variable
- \(O_{ij}\) denotes the observed frequency in the contingency table
- \(E_{ij}\) denotes the expected frequency in the contingency table, under the assumption of independence.
We can calculate the chi-squared statistic of each feature with respect to the target variable and rank their relevance in increasing order of the chi-squared value, prior to selecting a subset of these features for model fitting.
Mutual Information : Another technique for feature selection over categorical features is the entropy based measure of mutual information. Derived from the field of information theory, mutual information between two variables calculates the reduction in uncertainty for one variable given the value of the other is known. Uncertainty is mathematically represented through the measure of entropy. Entropy of a random variable is the average level of ``information” or uncertainty inherent to the possible outcomes of the variable. Entropy for a target variable \(Y\) assuming \(c\) different values is given by the following formula:
\(H(Y) = \sum_{i=1}^{c}-p_{i}log_{2}p_{i}\) ; where \(p_{i}\) is the probability of the target variable \(Y\) assuming value \(i\)
In addition, we also need the conditional entropy of the target variable \(Y\), given the value of a categorical feature \(X\), \(H(Y|X)\). Finally the information gain over the target variable \(Y\) given the feature variable \(X\) can be calculated as: \(IG(Y,X) = H(Y) - H(Y|X)\).
Mutual information is a measure of dependence and a higher value of \(IG(Y,X)\) would imply that knowing the feature \(X\) helps in the prediction of the target variable \(Y\). Hence upon calculation of the mutual information measure of the target variable with respect to every feature in the dataset, we can select the ones that reflect higher mutual information, indicating strong dependence of the target variable on these features, making the latter most appropriate for the predictive task at hand.
ANOVA : Another statistical test that can be used for selection of categorical predictors with respect to numerical response variables is the ANOVA which stands for Analysis of Variance. ANOVA is used to check the presence of equal variance between groups. This in consequence implies that the feature in question has no impact on the response variable and therefore need not be considered in training. We perform the following steps to perform One Way ANOVA between a categorical feature and a response variable.
- Define the null and the alternate hypothesis with the null hypothesis relying on the assumption of equal variance.
- Calculate the sum of squares between and within groups to capture inter-group and intra-group variations.
- Calculate the degrees of freedoms for between and within groups.
- Calculate the F-value to compare the variance between the groups and the variance within them.
- Finally, the F-value indicates whether to accept or reject the null hypothesis. Should the null hypothesis be rejected, it means that variance exists between the groups of a categorical feature distribution which in turn confirms that the corresponding feature impacts the response variable.

Wrapper Methods#

Wrapper methods are a family of feature selection techniques that enforces a search algorithm to explore all the possible combinations of features and evaluate each subset of feature based on the performance of the model trained on them. The feature set which yields the best model is deemed to be the most relevant to the problem. There are tradeoffs to using wrapper methods for feature selection. The advantages include the fact that these techniques are not model agnostic and can therefore customize the feature selection process to the model of choice for a predictive problem. Wrapper methods can better capture interactions between features and find the optimal choice of feature subset. On the downside, wrapper methods involve multiple rounds of training to choose the optimal set of features that in turn can cause higher computational expense, dependence of the model and data and, in consequence, overfitting. Here we look into two popular wrapper methods:

Forward Selection : In classical forward selection, the predictors are evaluated and added into consideration one at a time, using a statistical hypothesis test, subject to passing some pre-defined significance threshold. For example, in regression models, we can conduct stepwise forward selection of features by using p-value to test the statistical significance of the variation in the outcome explained by the model resulting from the inclusion of certain features. Such a hierarchical regression process with forward selection would proceed as follows:
- At first, the model begins with only the intercept term.
- For each feature under consideration, we train a model by including said feature and then condense the amount of variation explained by the resulting model using a p-value.
- Next we consider all features for which the corresponding model produced a p-value lower than a pre-defined threshold. If no feature passes this criterion, then we exit the search procedure. Else for all features for which the resulting model produced an acceptable p-value, we select the one which yielded the lowest p-value. This is the feature we select to include and the model we continue to move forward with, in the search process.
- We repeat all the previous steps with the remaining features until we can no-longer find features whose addition to the model can produce acceptable p-values.
Note that there are a couple of issues with this process:
1. The forward selection search procedure takes a greedy approach where new features, that produce the highest statistical significance based on the current state of the models, are progressively added. However these are locally optimal choices which might not guarantee arriving at the globally optimal solutions, since past choices are not reevaluated down the line.
2. During forward selection the objective function that is being optimized is statistical significance rather than performance related metrics which are evaluated during actual training and testing.
3. The model trained after feature selection can be overfitted since the computation of parameter estimates and subsequent performance metrics does not factor in the selection process into account.
Recursive Feature Elimination : Classical forward selection involves refitting several models at each step of the search for the ideal feature subset and the overall procedure is costly. To avoid this, an alternative is a backward selection algorithm called recursive feature elimination. Here we begin by building a model on the entire set of predictors and compute feature importance for every predictor. With the least important predictors removed, the model is refit with the remaining features and importance scores are recomputed. In practice, the number of features to be retained is pre-determined and this number itself becomes a tuning parameter for Recursive Feature Elimination. The feature subset size is tuned by optimizing the performance metrics. Te whole process can be captured in the Algorithm Algorithm 20.1.

Algorithm 20.1 (Backward Selection with Recursive Feature Elimination)

Inputs Training Data and model of choice

Output Optimal subset of features \(S_{i}\) after feature selection

Train the model of choice on training data using all available features \(P\)
Calculate the trained model’s performance
Rank features as per some pre-defined feature importance metric
for each subset size \(i\), where \(i = 1...|S|\) do
1. Keep the \(i\) most important features \rightarrow \(S_{i}\)
2. Train the model on the training set using these \(S_{i}\) features
3. Calculate and record the model performance
end
Examine the performance profile over all iterations of \(S_{i}\)
Determine the appropriate number of predictors : the feature set \(S_{i}\) producing the best performance
Fit and return the final model based on the optimal feature set \(S_{i}\)