Feature Engineering

20.1. Feature Engineering#

In this chapter, we will focus on the data preparation step and examine various methodologies to extract features from data that are essential to build any intelligent learning systems. Raw data can be both structured and unstructured, and the feature engineering strategies vary depending on the type and format of data to which they are being applied.

Features in a Machine Learning Pipeline#

An end-to-end machine learning pipeline starts with a data retrieval process that gathers raw data to be ingested into the model during the subsequent learning task. This is followed by a data preparation process where different techniques are tried to engineer meaningful features that the model of choice can utilize during training. The trained model is then deployed for the subsequent prediction or classification task on unseen data which also undergo similar transformations made during data preparation prior to being fed to the model for testing.

Feature Engineering Techniques in Structured Data#

Structured data is standardized, clearly defined in format, as well as easy to organize, search and analyze. Data types stored in structured data can be both numeric or categorical. We will look into specific feature engineering techniques for each of these data types now.

Feature Engineering on Numeric Data#

Numeric or quantitative data consist of scalar values that record measurements or observations, often in certain prespecified units. Raw numeric data can be fed directly into most models but depending on the problem and domain they could still be modified to better features. In this section, we will look into a few strategies we can leverage for feature engineering on numeric data. We will use the datasets at our disposal to demonstrate these techniques.

diabetes_df = pd.read_csv("../../data/diabetes.csv")
diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

Normalization: Feature Scaling is an important problem to tackle when feeding numeric data to models. In order to train a model and enhance its predictive capacity features should preferably be within a similar range and not vastly disparate scales. Min-max normalization is a common way of feature scaling where all values are scaled down to a range between [0, 1]. The resulting transformation has no influence on the feature’s underlying distribution but could be sensitive to the presence of outliers that could affect the minimum and maximum feature values and as a result the underlying scale.

column = 'Glucose'
diabetes_df['Glucose_normalized'] = MinMaxScaler().fit_transform(np.array(diabetes_df[column]).reshape(-1,1))
diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Glucose_normalized
0	6	148	72	35	0	33.6	0.627	50	1	0.743719
1	1	85	66	29	0	26.6	0.351	31	0	0.427136
2	8	183	64	0	0	23.3	0.672	32	1	0.919598
3	1	89	66	23	94	28.1	0.167	21	0	0.447236
4	0	137	40	35	168	43.1	2.288	33	1	0.688442

Standardization: Another form of feature scaling is Standardization or z-score normalization, that takes into account the underlying standard deviation of the feature distribution. To standardize a feature column all data points are subtracted by their mean value and the result divided by the feature distribution’s variance. With this transformation, we arrive at a distribution of 0 mean and 1 variance. Since the standardization process does not limit the transformed values within a specific range, the outliers within data does not impact the transformation process. However it does enforce the assumption of the underlying feature distribution being a Gaussian which may not always be true.

column = 'BMI'
diabetes_df['BMI_standardized'] = StandardScaler().fit_transform(np.array(diabetes_df[column]).reshape(-1,1))
diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Glucose_normalized	BMI_standardized
0	6	148	72	35	0	33.6	0.627	50	1	0.743719	0.204013
1	1	85	66	29	0	26.6	0.351	31	0	0.427136	-0.684422
2	8	183	64	0	0	23.3	0.672	32	1	0.919598	-1.103255
3	1	89	66	23	94	28.1	0.167	21	0	0.447236	-0.494043
4	0	137	40	35	168	43.1	2.288	33	1	0.688442	1.409746

Binarization: Often times features represent raw counts or frequencies whose exact values are less relevant to the problem at hand rather than being indicative of a certain phenomenon in the data space. In such cases binarization of a numeric feature can resolve the scaling issue that we have navigated in the previous techniques while transforming the original feature to an indicator function.

column = 'Pregnancies'
was_pregnant = np.array(diabetes_df[column])
was_pregnant[was_pregnant>=1] = 1
diabetes_df['was_pregnant'] = was_pregnant
diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Glucose_normalized	BMI_standardized	was_pregnant
0	6	148	72	35	0	33.6	0.627	50	1	0.743719	0.204013	1
1	1	85	66	29	0	26.6	0.351	31	0	0.427136	-0.684422	1
2	8	183	64	0	0	23.3	0.672	32	1	0.919598	-1.103255	1
3	1	89	66	23	94	28.1	0.167	21	0	0.447236	-0.494043	1
4	0	137	40	35	168	43.1	2.288	33	1	0.688442	1.409746	0

Rounding: Often when dealing with continuous numeric attributes the model might not require scalar values to be maintained at a high precision. In such cases it makes sense to round off high precision floats.

diabetes_df['rounded_DiabetesPedigreeFunction'] = diabetes_df['DiabetesPedigreeFunction'].round(2)
diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	Glucose_normalized	BMI_standardized	was_pregnant	rounded_DiabetesPedigreeFunction
0	6	148	72	35	0	33.6	0.627	50	1	0.743719	0.204013	1	0.63
1	1	85	66	29	0	26.6	0.351	31	0	0.427136	-0.684422	1	0.35
2	8	183	64	0	0	23.3	0.672	32	1	0.919598	-1.103255	1	0.67
3	1	89	66	23	94	28.1	0.167	21	0	0.447236	-0.494043	1	0.17
4	0	137	40	35	168	43.1	2.288	33	1	0.688442	1.409746	0	2.29

Custom Features: Having domain knowledge can often help aggregate multiple raw features into new custom features that can better capture context more directly relevant to the predictive task at hand.

housing_df = pd.read_csv("../../data/Housing.csv")
housing_df['total_size'] = housing_df['floor_size']+housing_df['garage_size']
housing_df['price_per_area_unit'] = (housing_df['sold_price']/housing_df['total_size']).round(2)
housing_df.head()

	floor_size	bed_room_count	built_year	sold_date	sold_price	room_count	garage_size	parking_lot	total_size	price_per_area_unit
0	2068	3	2003	Aug2015	195500	6	768	3	2836	68.94
1	3372	3	1999	Dec2015	385000	6	480	2	3852	99.95
2	3130	3	1999	Jan2017	188000	7	400	2	3530	53.26
3	3991	3	1999	Nov2014	375000	8	400	2	4391	85.40
4	1450	2	1999	Jan2015	136000	7	200	1	1650	82.42

Polynomial Transformations: Polynomial expansions of continuous valued features are common transformations to achieve higher order features that can be linearly combined in the eventual optimization function. For example, in case of a continuous predictor x , an order p polynomial expansion would yield the following additional features:
f(x) = \(\sum_{i=1}^p \beta_{i}x^{i}\), where p is a hyperparameter that can be selected during finetuning.

Trigonometric Transformations: Sometimes features found in datasets can be cyclical in nature. Timeseries, wind or tidal data typically constitute of cyclical variables where values repeat periodically. It is vital for such features to be transformed into a representation where the model can exploit their cyclical nature to improve its predictive capability. In such cases trigonometric transformations are commonly used. A feature variable \(t\) can be converted into a set of cyclical features:
x = \(\sin(\frac{t\times2\pi}{max(t)})\), and, y = \(\cos(\frac{t\times2\pi}{max(t)})\)

Logarithmic Transformations: Log transforms are applied to features with skewed distributions in order to control the skewness. We take the log of the values in the feature column to bring down its range and feed the transformed feature to the model. However logarithmic transformations do not work on features with non-positive values.

print(housing_df['sold_price'].max(), housing_df['sold_price'].min())
housing_df['sold_price_log'] = np.log(housing_df['sold_price'])
housing_df.head()

550000 87000

	floor_size	bed_room_count	built_year	sold_date	sold_price	room_count	garage_size	parking_lot	total_size	price_per_area_unit	sold_price_log
0	2068	3	2003	Aug2015	195500	6	768	3	2836	68.94	12.183316
1	3372	3	1999	Dec2015	385000	6	480	2	3852	99.95	12.860999
2	3130	3	1999	Jan2017	188000	7	400	2	3530	53.26	12.144197
3	3991	3	1999	Nov2014	375000	8	400	2	4391	85.40	12.834681
4	1450	2	1999	Jan2015	136000	7	200	1	1650	82.42	11.820410

Feature Engineering on Categorical Data#

Categorical or nominal predictors are those that contain qualitative data. For example, education level, state of residence, or even zipcode, which albeit having numerical values would qualify as categorical data. Categorical variables can have both ordered or unordered data depending on whether the data values can be organized based on some inherent ordering among them. For example, if we look into the student scores dataset, the feature ‘parental level of education’ shows a clear ordering among its categorical values. Hence this feature consists of ordinal data.

student_scores_df = pd.read_csv("../../data/student_scores_data.csv")
student_scores_df.head(15)

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group D	some college	standard	completed	59	70	78
1	male	group D	associate's degree	standard	none	96	93	87
2	female	group D	some college	free/reduced	none	57	76	77
3	male	group B	some college	free/reduced	none	70	70	63
4	female	group D	associate's degree	standard	none	83	85	86
5	male	group C	some high school	standard	none	68	57	54
6	female	group E	associate's degree	standard	none	82	83	80
7	female	group B	some high school	standard	none	46	61	58
8	male	group C	some high school	standard	none	80	75	73
9	female	group C	bachelor's degree	standard	completed	57	69	77
10	male	group B	some high school	standard	none	74	69	69
11	male	group B	master's degree	standard	none	53	50	49
12	male	group B	bachelor's degree	free/reduced	none	76	74	76
13	male	group A	some college	standard	none	70	73	70
14	male	group C	master's degree	free/reduced	none	55	54	52

On the other hand, ‘gender’ is a categorical feature with values that do not have any natural ordering within them. Ordered and unordered features require different preprocessing approaches for the underlying information to be fed into a model.

Although tree-based models are capable of handling raw categorical data, majority of models require numeric predictors as input. Hence, in this section, we will look into a few strategies we can utilize to engineer model-friendly features from categorical data.

One-hot Encoding: The simplest way to handle categorical data is to create a vector of indicator variables, one for each category. These are variables artificially added to the feature set to capture the presence of different possible values for a categorical feature. To illustrate this consider the categorical feature ‘race/ethnicity’ in the student scores dataset. We look into the possible values and convert them into dummy binary variables. It is also acceptable to create these dummy variables for all but one of the values. The reason to leave one of the values out is that it can be directly inferred from the states of the other variables and hence could otherwise add multicollinearity. Even though this encoding strategy increases the dimensionality of data at hand, it does not impose any additional ordering that does not exist among categories unlike some of the other techniques that we will examine later.

set(student_scores_df['race/ethnicity'].values)

{'group A', 'group B', 'group C', 'group D', 'group E'}

def encode_and_bind(original_dataframe, feature_to_encode):
    #function to generate one-hot encoded features from categorical values
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return res

feature = 'race/ethnicity'
encoded_df = encode_and_bind(student_scores_df, feature)
encoded_df.head(15)

	gender	parental level of education	lunch	test preparation course	math score	reading score	writing score	race/ethnicity_group A	race/ethnicity_group B	race/ethnicity_group C	race/ethnicity_group D	race/ethnicity_group E
0	female	some college	standard	completed	59	70	78	False	False	False	True	False
1	male	associate's degree	standard	none	96	93	87	False	False	False	True	False
2	female	some college	free/reduced	none	57	76	77	False	False	False	True	False
3	male	some college	free/reduced	none	70	70	63	False	True	False	False	False
4	female	associate's degree	standard	none	83	85	86	False	False	False	True	False
5	male	some high school	standard	none	68	57	54	False	False	True	False	False
6	female	associate's degree	standard	none	82	83	80	False	False	False	False	True
7	female	some high school	standard	none	46	61	58	False	True	False	False	False
8	male	some high school	standard	none	80	75	73	False	False	True	False	False
9	female	bachelor's degree	standard	completed	57	69	77	False	False	True	False	False
10	male	some high school	standard	none	74	69	69	False	True	False	False	False
11	male	master's degree	standard	none	53	50	49	False	True	False	False	False
12	male	bachelor's degree	free/reduced	none	76	74	76	False	True	False	False	False
13	male	some college	standard	none	70	73	70	True	False	False	False	False
14	male	master's degree	free/reduced	none	55	54	52	False	False	True	False	False

An obvious drawback of the one-hot encoding setup is when the set of possible values for a categorical feature gets too large. For example encoding a categorical feature like ZIP code for United States could consist of up to 41K values. Applying the one-hot encoding strategy would lead to an overabundance of dummy variables relative to the number of datapoints available for effective generalization. Moreover, due to uneven distribution of population across different locations, one might encounter certain zip codes much more frequently than others, leading to a long tailed distribution with the ones that are infrequent when collecting data.

An issue with having such long-tailed feature distribution is that resampling might altogether exclude some infrequent categories from the analysis altogether. This leads to dummy variable columns of all zeros for those corresponding categories which could cause a numerical error for many models and will also render them unable to produce accurate predictions for test samples that do contain these categories. Feature columns with a single value are called zero-variance predictors that do not provide meaningful representation for the predictive task at hand. While we can create the full set of indicator variables and filter out those showing near-zero variance, the latter cannot be known a priori. As an alternative, these predictors can be pooled together to an “Other” category. Another way to combine categories would be to use a hash function and group categories into a reduced set of hashes.

Label Encoding: Alternative to one-hot encoding, label encoding does not add any additional feature columns to the data and maps each unique category to a number. Such a numerical mapping however adds an ordering among the transformed values which might not exist among the categories.

Ordinal Encoding: However, ordered categorical values exist. For example, `parental level of education’ has categories that can be ordered by the degree of education that students’ parents that completed. When categories have a natural ordering among them, a numerical mapping of categories to values that preserves the same ordering makes sense and would also improve the underlying predictive task. Hence, such an ordering is called an Ordinal Encoding. Like label encoding, the data dimensionality is not increased during such transformations.

parental_education_levels = set(student_scores_df["parental level of education"])
parental_education_levels_categories = ['some high school','high school','some college',"associate's degree","bachelor's degree","master's degree"]
encoder = OrdinalEncoder(categories=[parental_education_levels_categories])
student_scores_df['parental_education_levels'] = encoder.fit_transform(student_scores_df[["parental level of education"]])
student_scores_df.head(15)

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	parental_education_levels
0	female	group D	some college	standard	completed	59	70	78	2.0
1	male	group D	associate's degree	standard	none	96	93	87	3.0
2	female	group D	some college	free/reduced	none	57	76	77	2.0
3	male	group B	some college	free/reduced	none	70	70	63	2.0
4	female	group D	associate's degree	standard	none	83	85	86	3.0
5	male	group C	some high school	standard	none	68	57	54	0.0
6	female	group E	associate's degree	standard	none	82	83	80	3.0
7	female	group B	some high school	standard	none	46	61	58	0.0
8	male	group C	some high school	standard	none	80	75	73	0.0
9	female	group C	bachelor's degree	standard	completed	57	69	77	4.0
10	male	group B	some high school	standard	none	74	69	69	0.0
11	male	group B	master's degree	standard	none	53	50	49	5.0
12	male	group B	bachelor's degree	free/reduced	none	76	74	76	4.0
13	male	group A	some college	standard	none	70	73	70	2.0
14	male	group C	master's degree	free/reduced	none	55	54	52	5.0

Feature Engineering on Unstructured Text Data#

Data practitioners often have to deal with data containing textual fields or unstructured text data for certain learning tasks. Data containing textual fields can be gathered from questionnaires, articles, reviews, tweets or large-scale text corpora (for example, collection of Shakespearean sonnets). For these datasets, words or phrases populating the open text fields act as predictors for the machine learning task at hand. Hence we need to find a process that transforms their absence or presence into a numerical representation of such textual data. This technique is referred to as Text Vectorization. Prior to this, data practitioners conduct a handful of text pre-processing and cleaning steps that consist of normalizing the text (case folding and special character removal) followed by text tokenization. In addition, text pre-processing also consists of running the corpus through a stemming or lemmatization function that would transform the full inflected forms in which words or tokens appear in the text to their root forms, as well as removing stopwords that are often associated with bringing noise into text analysis.

Text Vectorization begins with setting a vocabulary, \(V\), that comprises of all the possible distinct words encountered in a text corpus. Next we explore strategies that convert text data into \(|V|\) dimensional vectors of binary or real-valued features.

One hot encoding#

The simplest vectorization technique is to treat the words in the corpus vocabulary as categorical features and to associate an indicator variable with each word in the feature vector. However, one-hot encoding can only signify the presence or absence of certain words in text. In many text applications, frequency of words play an important role in measuring their relative importance within the corpus, as well as, to the predictive task at hand, which is why we often prefer alternative strategies.

Bag of Words representation#

This is a popular representation of text data frequently utilized in the field of Information Retrieval. In bag-of-words models the input text is converted into a \(|V|\) dimensional real-valued vector of word counts or frequencies. Bag-of-word (BOW) representation converts a text document into a flat vector. While we can encode the relative importance of words within a corpus through frequency features, this representation treat text data as an unordered collection of tokens. Since the ordering of words in text indicate both meaning and context, bag-of-words representation cannot encode any semantic information.

TF-IDF model#

This method is an improvement over the previously described BOW model that simply records word counts in feature vectors. The TF-IDF statistic considers two different kind of frequencies:

Term Frequency (tf) - For a word \(w\) and document \(D\) in the corpus, \(tf(w,D)\) represents the frequency or raw count of the word in the document.
Inverse Document Frequency (idf) - This frequency is a signifier of the informativeness of a term in the context of the whole corpus. It assumes that much like stopwords, if a word in a corpus appears widely in most or all document, it’s informativeness relative to the content of individual documents is diminished. Rare words are considered more interesting since they can provide distinctive information. Hence, \(idf\) applies a log transform upon the inverse of a word’s document frequency. If in a corpus of \(N\) documents, the word \(w\) appears in \(df(w)\) documents, then \(idf(w,D) = log(\frac{N}{df(w)})\).

The combined \(tfidf\) statistic is calculated as the product of the above two frequencies:
\(tfidf(w,D) = tf(w,D)\times idf(w,D)\)

Feature vectors in the TF-IDF model represents each document in the corpus by including the \(tfidf\) score of every word in the vocabulary corresponding to the document. These scores are normalized to values between 0 and 1 and the resulting document vectors can be directly fed into a learning algorithm for the downstream prediction task.

N-gram representation#

As mentioned earlier, treating texts as unordered collection of words only results in lexical features that do not capture meaning or context. For example, consider the following two sentences:

The cat killed curiosity.
Curiosity killed the cat.

These two sentences carry the opposite sense however they will have the exact same bag-of-words representation. A modification of the Bag-of-word (BOW) representation that addresses this deficit is the \(n\)-gram model. Individual words are called unigram however a sequence of \(n\) consecutive words within a document is called an \(n\)-gram. Instead of creating a vocabulary of unigrams, this representation creates a vocabulary of all distinct \(n\)-grams within the corpus, and then, recalculates the previous TF-IDF statistics of \(n\)-grams for documents within the corpus. Unlike unigrams, \(n\)-grams obviously retain the ordering of words in these phrases and therefore is a better representation to capture semantic information within text.