## Feature Selection Introduction

Feature selection is a method of selecting a subset of all features provided with observations data to build the optimal Machine Learning model. Well implemented feature selection leads to faster training and inference as well as better performing trained models.

Other benefits from proper feature selection are a more straightforward interpretation of the results, reducing overfitting, reducing feature redundancy, etc.

## Feature Selection General Methods

There are three main types of feature selection methods:

- Filter methods
- Wrapper methods
- Embedded methods

## Filter Methods

Filter methods select features independently of the chosen Machine Learning training algorithm. Therefore, they are model agnostic. In this way, they rely only on the characteristics of the data that the features contain.

### Removing irrelevant features

The first thing to do is to remove irrelevant features. These are the features that are constant or quasi constant features, so they do not provide useful information for the observation. Therefore they don’t distinguish enough different input records.

### Removing redundant features

After removing irrelevant features, the next step is to remove correlated features. Correlated features are duplicated or highly correlated features. Having correlated features means that only one of them is enough to keep the information about the record. Therefore, we can exclude the rest of them from further processing. To achieve this, we can calculate correlation among a subset of features with Pearson’s correlation coefficient, for example. Each correlation value higher than 0.8 or less than -0.8 refers to highly correlated features.

### Rank remaining features and take an arbitrary number of top ones

The final step is to rank remaining features according to information they provide about observations and choose an arbitrary number of top-ranked features. The important aspect here is to choose appropriate ranking algorithms. Possible ones are Chi-squared test, Fisher Score, univariate analysis, and univariant ROC-AUC value.

## Wrapper Methods

The main goal of wrapper methods is to select the optimal subset of all available features that produce a trained model with the best performance.

To implement any of the wrapper methods, we must first choose the machine learning algorithm.

There are three types of wrapper methods that differentiate in the ways how they choose the optimal features subset.

### Step Forward Feature Selection

Let say that N is the number of available features. Step forward wrapper method works on the following way:

- Evaluate all subsets with one feature
- Choose the one that provides the best performing trained model
- Evaluate the performance of the models trained with all subsets that have two features where the first feature is already selected
- Choose the optimal subset with two features with the best performance
- We repeat until all the features are taken into account while training the model.
- From the resulting N subsets, choose one with the best performance of the trained model

### Step Backward Feature Selection

This method is similar to step forward feature selection, as follows:

- The initial feature set is one that contains all of the available features.
- Then we generate N feature sets from that feature set by removing each of the available features, and we evaluate these N subsets with trained models.
- Then we evaluate the performance of the models trained with all subsets that contain N-1 features.
- We choose the next feature set among N feature sets with N-1 features with the best results.
- Repeat this procedure until there is a feature set with only one feature
- From the resulting N subsets, choose one with the best performance of the trained model.

### Exhaustive Feature Selection

This method trains models with all of the possible subsets from the set of all available features. We chose the subset with the best performance.

### Note on Wrapper Methods

The complexity of step forward and step backward methods is N^2, and the complexity of exhaustive feature selection is 2^N. In all of the real-world scenarios, 2^N is a considerable number that is beyond the processing power of the existing computing infrastructure. Therefore this method is rarely used in reality.

## Embedded Methods

### Introduction

Embedded methods perform feature selection during the model training process. The training algorithm embeds the feature ranking as its primary or extended functionality.

The advantages of these methods are:

- Faster than wrapper methods
- More accurate than filter methods
- Detect an interaction between features

Procedure:

- Train a machine learning algorithm
- Derive the feature importance
- Remove unimportant features

There are three types of embedded methods for feature selection: Lasso regularization, linear models, and trees.

### Lasso Regularization

Lasso regularization is l1 regularization. We include it into the cost function that the training algorithm is trying to minimize, and we implement it as a sum of training weights for all of the features, multiplied by regularization parameter ƛ. For small values of ƛ, weight coefficients are bigger than zero. But if the value of ƛ increases, some of the weight coefficients will start to become 0, meaning that appropriate features are not crucial for the model. Therefore, by gradually increasing the value of ƛ and training the model, the result of the training gives more and more weights equal to 0. The process stops by choosing an arbitrary value of the maximum value of ƛ, or when we achieve a selected number of features that we can be remove from the observations set.

### Regression Models

Feature selection using linear models assumes multivarariant dependency of the target from values of available features, and values of available features are normally distributed. In such a case, we train the model using logistic regression for classification or linear regression for regression. This training procedure results in values of the weights of the features involved. For more important features these weights are bigger, and vice versa. Chosen set of selected features is an arbitrary number of top-ranked features having their weights as ranking criteria.

### Trees

A decision tree is a machine learning algorithm for both classification and regression. We can measure the importance of each feature by the purity of each bucket of observations derived from the question related to the value of the feature. The method that we can use for the observation bucket generation can be information gain, GINI index, or entropy, among other techniques. The more the feature decreases impurity, the more influential the feature is.

A variant of this method is random forest. Random forest is a set of decision trees (usually hundreds of them) with a random selection of subsets of available features. In this case, the importance of each feature is calculated as its average importance among all the trees in the forest.

Another variant is the recursive usage of random forests. After the first run of the random forest, we will remove a few features as insignificant ones. Then we rerun a random forest, and we chose the second subset of unimportant features for removal. We repeat this procedure until we met certain criteria of the model performance.

We can use the same feature selection methodology gradient boosting trees instead of random forests.

## Conclusion

This should be all for feature selection.

The following course extensively covers all of the topics mentioned above https://www.udemy.com/course/feature-selection-for-machine-learning/ and we highly recommend it to all interested in implementing the full Machine Learning pipeline.

## What Comes Next in Machine Learning Pipeline

The next step in the Machine Learning pipeline is the Deployment of Machine Learning Models, and we will cover this topic in another post that follows.