Feature Engineering in Machine Learning

May 13, 2020

INTRODUCTION INTO FEATURE ENGINEERING

Feature engineering is a process of preparing and changing input data to be ready to be used for training Machine Learning models.

From the feature engineering point of view, there are the following types of features

numerical
categorical
date and time
mixed.

Mixed features term can cause some ambiguity. An example of a mixed variable value is the seat on the plane ticket: 17B. Here 17 is row number, and B stands for the chair near to the window. Therefore, we can introduce two other features: seat number and seat position in the row.

Important to mention is that the input set of observations we must split it on training and test (validation) set to minimize chances for overfitting. During the training phase, the chosen algorithm must not be aware of the existence of observations in the test set.

FEATURE CHARACTERISTICS

Features have the following characteristics

missing data
cardinality
rare labels for categorical features
distribution
magnitude,
outliers for numeric features

Rare labels are those categorical feature values that appear only in a small subset of observations.

Categorical features with high cardinality (i.e., with a significant number of distinct values) and rare labels can introduce noise and thus make it harder to train a model that includes them successfully.

The equivalent note comes for both of categorical features with rare labels and numerical features with outliers: introduce noise and bad performing trained model.

HANDLING MISSING VALUES

The first step in feature engineering is dealing with missing data. There are two main strategies for dealing with it:

removing observations with missing data
missing data imputation.

The first strategy is non-efficient in many cases because it may lead to the removal of a substantial portion of recorded observations.

Regarding the second strategy, there are several ways to fill missing data values:

mean or median imputation
arbitrary value imputation
end of distribution imputation
frequent category imputation
random sample imputation
adding a missing indicator feature per feature with missing values.

All mentioned imputation methods imply distortion of the distributions. However, missing data imputation is necessary because most of the machine learning algorithms cannot deal with missing values in the features they are processing.

We can implement all the imputation methods mentioned previously with Scikit-learn Python package. An alternative to it is a feature-engine package built with the purpose of feature engineering in mind. It can be installed in Python with the “pip install feature-engine” command. Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine’s transformers follow Scikit-learn functionality with fit() and transform() methods to first learn the transforming parameters from data and then transform the data.

After missing values imputation, the next step in feature engineering is categorical feature encoding.

CATEGORICAL FEATURE ENCODING

The most popular method for categorical feature encoding is one-hot encoding. The downside of this method is adding a massive amount of new sparse features mostly filled with zeros. Variation of this method is the one-hot encoding of top categories, where we consider the only a subset of the most frequent categories, plus one category covering the rest of categorical values.

Another method for encoding categorical features is ordinal encoding, also known as label or integer encoding. The idea is to assign some integer value to each category. In the basic scenario of this method, we assign integers arbitrarily, and help from experts with domain knowledge is valuable. Other scenarios are the count of frequency encoding and target guided ordinal encoding, also known as ordered categorical encoding. The idea of the last one is to create a monotonic relationship between integer values for the categories and the target value.

Other methods for encoding categorical features are

mean encoding
probability ration encoding
rare label encoding
binary encoding.

After encoding categorical features, the next step is to perform mathematical feature transformation.

MATHEMATICAL FEATURE TRANSFORMATIONS

Ideally, features are distributed normally i.e. following Gaussian curve. This is needed to achieve the best prediction accuracy with the trained model.

But traditionally, it is not always the case, and the feature distribution is skewed at best and does not even look like normal distribution at worse. Therefore, we need some mathematical transformation to get it as much as possible close to normal distribution.

The most popular method for mathematic feature transformation is BoxCox, but there are other transformations: logarithmic, exponential, reciprocal, and Yeo-Johnson. We can achieve all the mentioned transformations using NumPy and ScyPy, scikit-learn, and feature-engine packages.

NUMERICAL FEATURES DISCRETIZATION

What is coming next in the feature engineering process t is the discretization of numeric features. We can skip this step in case we decide to stick with numerical values with the regression class of algorithms. Discretization is a process of transforming numeric feature values into a set of continuous intervals that span the range of the feature values. There are several types of discretization approaches:

equal-width discretization
equal-frequency discretization
K-means discretization.

After performing discretization, the result is a categorical feature that we can encode on the ways that we mentioned above. We can implement discretization in a more sophisticated way of using classification trees. Also, for performing discretization, domain experts can supply valuable help. We can implement discretization methods both using Scikit-learn and feature-engine packages.

OUTLIER HANDLING

The next step in feature engineering is outlier handling. The intention is to handle them to minimize the effect of unwanted noise that may lead to poorly trained models. we can manage outliers using the following approaches:

outlier trimming
outlier capping with Interquartile Range (IQR)
outlier capping with mean and std
outlier capping with quantities
and arbitrary capping.

FEATURE SCALING

Next to do for the feature engineering is feature scaling. The idea is to scale feature values to be equally weighted by the learning algorithm. Normally, when the values of numeric features are big, their learned weights are small, and vice versa. After feature scaling, the learning process treats them equally. There are several types of scaling:

standardization
mean normalization
scaling to a minimum and maximum values
maximum absolute scaling
scaling to median and quantiles
robust scaling
scaling to vector unit length
etc

From all these types of scaling, standardization, and mean normalization are the most popular ones.

MIXED FEATURE ENGINEERING

We can do mixed feature engineering on the way that we extend the with more features (numeric or categorical), depending on how many parts the values of such kind of features have. After that, we treat them newly extracted features that are processed as described above.

DATE AND TIME FEATURE ENGINEERING

The last piece in the puzzle of feature engineering is dealing with date and date and time features. Since they come in mixed text format, “10:15:25 PM 30.1.2020 +02:00” for example, after engineering such kind of features mean extracting more features as

hour
minute
second
day-in-month
day-in-week
is-weekend
month (as number)
quartal
year

One necessary thing is kind of “normalizing time values,” i.e. present them first with +00: 00-time offset.

CONCLUSION FOR FEATURE ENGINEERING

That should be all for feature engineering.

https://www.udemy.com/course/feature-engineering-for-machine-learning/ extensively covers all the topics discussed below, and we highly recommend it for all interested in the implementation of a full Machine Learning pipeline.

WHAT COMES NEXT IN MACHINE LEARNING PIPELINE

The next step in the Machine Learning pipeline is feature extraction, and we will cover this topic another post that follows.

Feature Engineering in Machine Learning

INTRODUCTION INTO FEATURE ENGINEERING

FEATURE CHARACTERISTICS

HANDLING MISSING VALUES

CATEGORICAL FEATURE ENCODING

MATHEMATICAL FEATURE TRANSFORMATIONS

NUMERICAL FEATURES DISCRETIZATION

OUTLIER HANDLING

FEATURE SCALING

MIXED FEATURE ENGINEERING

DATE AND TIME FEATURE ENGINEERING

CONCLUSION FOR FEATURE ENGINEERING

WHAT COMES NEXT IN MACHINE LEARNING PIPELINE

PreviousOriginal vs. AI Generated

NextFeature Selection in Machine Learning

Related Posts ...

AI as New Electricity?

AI in the Middle of a Big Paradigm Shift in Physics

No comment

Leave a Reply Cancel reply