Feature Engineering and Feature Selection

20. Feature Engineering and Feature Selection#

Satadisha Saha Bhowmick

Statistical models have become ubiquitous in our efforts to understand and predict a variety of phenomena in modern society. Models are created by taking existing data into account and learning mathematical representations that optimally can generalize over and explain the observed outcomes with a view to draw inference or predict estimations upon future unobserved data.

The variables that go into a model are called predictors or features or independent variables (depending on context). The quantity being modelled is the response or dependent variable. Features represent observed examples for the model and are key to its success. Different set of predictors can help generalize the predictive task at hand, through a process called model fitting, to varying degree of effectiveness, depending on their association with the outcome variable. If the features have no relationship with the outcome, they are redundant and the resulting data representation is irrelevant for the purpose of modelling.

The idea that there are different ways to represent the observed examples fed to a model and impact its performance leads to the notion of Feature Engineering. It is a process that utilizes domain knowledge to transform raw data collected for a predictive problem into new features, or variables, that better represent the underlying patterns in data and lead to more effective models. Often time, applying the right transformations on the original data can also lead to simpler models and more interpretable features. Typically, such transformations are also unsupervised.

Although for any given dataset, many possible feature transformations can be chosen, it is important to carefully select which ones to use. In machine learning, an overabundance of features also require an adequate number of datapoints to be collected for the resulting model using these features to be properly fitted. Otherwise models trained using an overabundance of features trained upon insufficient data (relative to the number of predictors) often face the issue of ‘overfitting’ leading to suboptimal performance. This is where feature selection is necessary. Feature Selection is the process of choosing the most predictive features from a large feature space in order to make the training process as well as the resulting model more efficient. In supervised approaches to feature selection, different feature combinations are evaluated in conjunction with a trained model where performance in predicting the target variable can be computed to assess their impact. Alternatively, unsupervised approaches aim to identify similarities within predictors without the use of labeled data.