TabNet: New Kid on the Boosting Block

June 27, 2021

Image by Aaron Olson from Pixabay

Deep Learning proved its value in NLP and computer vision: RNNs, LSTM, and CNNs: entirely well covered in the last few years.

For the last 1.5 years: like it or not: but the Transformers are taking out of the Machine Learning space. First, they conquered the NLP. Then they entered into the computer vision space.

Still, tabular data remained SOTA covered with old-fashion Machine Learning XGBoost.

A key difference between unstructured data, so well covered with Deep Learning, and tabular data:

Tabular data is often heterogeneous. The features constructed from tables come from various unrelated sources, each with its units (e.g., seconds vs. hours) and associated numerical scaling issues.
The features themselves are sparse: unlike data from images, audio, and language, there can be slight variation in a table column. There are also typically more categorical features, in which the order (and value) of the features themselves are not important, and unlike numerical features, they are discrete by nature. To handle this, preprocessing we often have to perform careful preprocessing.

Some other differences between tabular data and data approached in a SOTA way with Deep Learning:

features constructed from tabular data are often correlated, so a small subset of features are responsible for most of the predictive power,
missing data in the form of NULL values in a database, and
There is usually a strong class imbalance in the labels (in the supervised setting). For instance, users prefer only a small collection of movies in a movie catalog.

Now, one variation of them is going deeper into traditional, old-fashion Machine Learning and taking over one of its regular Kagglers’ first choice algorithms, XGBoost.

Time to introduce the new kid on the boosting block: TabNine.

Introduced by [1908.07442] TabNet: Attentive Interpretable Tabular Learning (arxiv.org)

“Enter Google’s TabNet in 2019. According to the paper, this Neural Network was able to outperform the leading tree-based models across a variety of benchmarks. Not only that, it is considerably more explainable than boosted tree models as it has built-in explainability. It can also be used without any feature preprocessing.”

In short: Based on the currently running training procedure with the given training examples, it decides which features to consider for the subsequent data rows and which ones to ignore to use for explainability. Just try to explain any badging or boosting algorithm results, and you automatically jump into trouble. Not with TabNine 🙂

Top advantages:

Encode multiple data types like images along with tabular data and use nonlinearity to solve.
No need for Feature Engineering can throw all the columns, and the model will pick the best features, and it’s also interpretable.

Few features:

TabNet inputs raw tabular data without any preprocessing and is trained using gradient descent-based optimization.
TabNet uses sequential attention to choose features at each decision step, enabling interpretability and better learning as the learning capacity is used for the most valuable features.
Feature selection is instance-wise. E.g., it can be different for each row of the training dataset.
TabNet employs a single deep learning architecture for feature selection and reasoning. That is known as soft feature selection.
The above design choices allow TabNet to enable two kinds of interpretability: local interpretability that visualizes the importance of features and how they are combined for a single row, and global interpretability quantifies each feature’s contribution to the trained model across the dataset.