INTRODUCTION INTO FEATURE ENGINEERING
Feature engineering is a process of preparing and changing input data to be ready to be used for training Machine Learning models.
From the feature engineering point of view, there are the following types of features
- date and time
Mixed features term can cause some ambiguity. An example of a mixed variable value is the seat on the plane ticket: 17B. Here 17 is row number, and B stands for the chair near to the window. Therefore, we can introduce two other features: seat number and seat position in the row.
Important to mention is that the input set of observations we must split it on training and test (validation) set to minimize chances for overfitting. During the training phase, the chosen algorithm must not be aware of the existence of observations in the test set.
Features have the following characteristics
- missing data
- rare labels for categorical features
- outliers for numeric features
Rare labels are those categorical feature values that appear only in a small subset of observations.
Categorical features with high cardinality (i.e., with a significant number of distinct values) and rare labels can introduce noise and thus make it harder to train a model that includes them successfully.
The equivalent note comes for both of categorical features with rare labels and numerical features with outliers: introduce noise and bad performing trained model.
HANDLING MISSING VALUES
The first step in feature engineering is dealing with missing data. There are two main strategies for dealing with it:
- removing observations with missing data
- missing data imputation.
The first strategy is non-efficient in many cases because it may lead to the removal of a substantial portion of recorded observations.
Regarding the second strategy, there are several ways to fill missing data values:
- mean or median imputation
- arbitrary value imputation
- end of distribution imputation
- frequent category imputation
- random sample imputation
- adding a missing indicator feature per feature with missing values.
All mentioned imputation methods imply distortion of the distributions. However, missing data imputation is necessary because most of the machine learning algorithms cannot deal with missing values in the features they are processing.
We can implement all the imputation methods mentioned previously with Scikit-learn Python package. An alternative to it is a feature-engine package built with the purpose of feature engineering in mind. It can be installed in Python with the “pip install feature-engine” command. Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine’s transformers follow Scikit-learn functionality with fit() and transform() methods to first learn the transforming parameters from data and then transform the data.
After missing values imputation, the next step in feature engineering is categorical feature encoding.
CATEGORICAL FEATURE ENCODING
The most popular method for categorical feature encoding is one-hot encoding. The downside of this method is adding a massive amount of new sparse features mostly filled with zeros. Variation of this method is the one-hot encoding of top categories, where we consider the only a subset of the most frequent categories, plus one category covering the rest of categorical values.
Another method for encoding categorical features is ordinal encoding, also known as label or integer encoding. The idea is to assign some integer value to each category. In the basic scenario of this method, we assign integers arbitrarily, and help from experts with domain knowledge is valuable. Other scenarios are the count of frequency encoding and target guided ordinal encoding, also known as ordered categorical encoding. The idea of the last one is to create a monotonic relationship between integer values for the categories and the target value.
Other methods for encoding categorical features are
- mean encoding
- probability ration encoding
- rare label encoding
- binary encoding.
After encoding categorical features, the next step is to perform mathematical feature transformation.
MATHEMATICAL FEATURE TRANSFORMATIONS
Ideally, features are distributed normally i.e. following Gaussian curve. This is needed to achieve the best prediction accuracy with the trained model.
But traditionally, it is not always the case, and the feature distribution is skewed at best and does not even look like normal distribution at worse. Therefore, we need some mathematical transformation to get it as much as possible close to normal distribution.
The most popular method for mathematic feature transformation is BoxCox, but there are other transformations: logarithmic, exponential, reciprocal, and Yeo-Johnson. We can achieve all the mentioned transformations using NumPy and ScyPy, scikit-learn, and feature-engine packages.
NUMERICAL FEATURES DISCRETIZATION
What is coming next in the feature engineering process t is the discretization of numeric features. We can skip this step in case we decide to stick with numerical values with the regression class of algorithms. Discretization is a process of transforming numeric feature values into a set of continuous intervals that span the range of the feature values. There are several types of discretization approaches:
- equal-width discretization
- equal-frequency discretization
- K-means discretization.
After performing discretization, the result is a categorical feature that we can encode on the ways that we mentioned above. We can implement discretization in a more sophisticated way of using classification trees. Also, for performing discretization, domain experts can supply valuable help. We can implement discretization methods both using Scikit-learn and feature-engine packages.
The next step in feature engineering is outlier handling. The intention is to handle them to minimize the effect of unwanted noise that may lead to poorly trained models. we can manage outliers using the following approaches:
- outlier trimming
- outlier capping with Interquartile Range (IQR)
- outlier capping with mean and std
- outlier capping with quantities
- and arbitrary capping.
Next to do for the feature engineering is feature scaling. The idea is to scale feature values to be equally weighted by the learning algorithm. Normally, when the values of numeric features are big, their learned weights are small, and vice versa. After feature scaling, the learning process treats them equally. There are several types of scaling:
- mean normalization
- scaling to a minimum and maximum values
- maximum absolute scaling
- scaling to median and quantiles
- robust scaling
- scaling to vector unit length
From all these types of scaling, standardization, and mean normalization are the most popular ones.
MIXED FEATURE ENGINEERING
We can do mixed feature engineering on the way that we extend the with more features (numeric or categorical), depending on how many parts the values of such kind of features have. After that, we treat them newly extracted features that are processed as described above.
DATE AND TIME FEATURE ENGINEERING
The last piece in the puzzle of feature engineering is dealing with date and date and time features. Since they come in mixed text format, “10:15:25 PM 30.1.2020 +02:00” for example, after engineering such kind of features mean extracting more features as
- month (as number)
One necessary thing is kind of “normalizing time values,” i.e. present them first with +00: 00-time offset.
CONCLUSION FOR FEATURE ENGINEERING
That should be all for feature engineering.
https://www.udemy.com/course/feature-engineering-for-machine-learning/ extensively covers all the topics discussed below, and we highly recommend it for all interested in the implementation of a full Machine Learning pipeline.
WHAT COMES NEXT IN MACHINE LEARNING PIPELINE
The next step in the Machine Learning pipeline is feature extraction, and we will cover this topic another post that follows.