Post Transformers-2020: Great Start in 2021

Transformers in AI in 2021

Transformers from the movie. We are not talking about them.

Very Short Introduction: Transformers

The transformers took the NLP world out of from RNNs and LSTMs, and replaced them with the Attention mechanism, with the capital A.

2019: Effective Transformers Implememntatons

BERT and GPT-2 appared as effective implementations. All already well elaborated.

BERT (language model) – Wikipedia

GPT-2 – Wikipedia

2020: Transformers Improved

Transformers successfully transformed the way how we approach NLP tasks. However, the initial implementation reached the architecture limitations very quickly. Due to their self-attention architecture, the number of neurons increased with the square of numbers of input tokens processed. Also, AI was faced with such massive neural networks for the first time, and Data Scientists started to analyze their behavior. And, exciting findings appeared. For example, it appeared that only a small part of the neurons are effectively used in the inference process after the model is trained. As a result, they figured out how to trim the unused neurons from the inference process, thus increasing the speed, decreasing the models with figures of 10 times, and losing accuracy 5-10% only. Such BERT models appeared in the middle of 2020. First, well elaborated in scientific papers in 2019, then effectively implemented in the middle of 2020: compressed BERT, Poor Man’s BERT, RoBERTa, etc.
As it appeared, also as expected: it’s the base architecture that matters. So, Google proposed T5, early in 2020, and implemented it later as an extension to the BERT’s architecture. While BERT’s output is class, or span of the input, in T5, both input and output are explicit texts.
Still, the limiting factor of squared complexity remained to limit the size of input and output texts.
Reformers relaxed the complexity of self-attention with windowed locality-sensitive-hashed attention.

[2001.04451] Reformer: The Efficient Transformer (

Longformers were proposed with linear complexity and implemented with models that operate with 4096 tokens. The magic: replace self-attention with more sparse, randomized heads distributed at the computational matrixes’ borders. It was not very convincing initially, and not much attention to it till now perceived.

Longformer — transformers 4.2.0 documentation (

GTP-3’s (April 2020) performance was ground-shaking but not outsourced, and no explanation about the architecture. “Promising” point here: MS put his hands over it with a claim to invest several Bs of $s in it in the following years.

2021: At the Very Beginning: Transformers on Steroids

Google: Text2Image and Zero-Shot image classification. About the second one, I already wrote a few words.

Zero-Shot Image Classification – CLIP – AI Daily News (

The last thing: Switch Transformer by Google. Extension to their T5-XXL model with a stunning 1.6 Ts (trillions) parameters. One single model is not the case here. We now have 2048 models specialized for different tasks automatically and one routing model that routes the question to the most relevant one. A colossal model overall, but it gained in a speed four times, as claimed.

Google Brain’s Switch Transformer Language Model Packs 1.6-Trillion Parameters | Synced (

Google trained a trillion-parameter AI language model | VentureBeat

Google Trains A Trillion Parameter Model, Largest Of Its Kind (

[2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (

What’s next? We’ll see.
Anyway: is anybody except Google and Open AI, and partially Facebook doing AI on this planet at all?

NLPDeep Learning

AIDeep Learningneural netowrksNLPtransformers

Leave a Reply

Your email address will not be published. Required fields are marked *

16 + 16 =