Size Does Matter

Size does Matter

Size does Matter


Introduction

In a word of NLP and language models, size does matter. Literally, the bigger is better.

Since the appearance of transformers on the NLP horizon, they became a de facto synonym for successful language model implementation, shadowing usage of recurrent neural networks, and LSTM. NLP got their equivalent of ImageNet-based neural networks for image recognition.

This came with a price: a huge number of learned parameters. So, they require a huge amount of RAM and CPU.

This article is abot the biggest transoformers appeared in 2020. Things are going on with exponential speed, so what happened in 2019 in this area is really unrelevant any more.

GPT-2

November 15, 2019. 1.5 B parameters.

OK, GPT-2 appeared in 2019, we’ll use it as “the biggest of 2019” for being aware where we left the previous year.

It is the successor of the GPT-1 model, and he wasn’t released for a long time since OpenAI boult some controversy around it. OpenAI declared it as a dangerously good text generation model that can provide an automatic generation of false news at a scale.

Maybe the last one that can be consumed on our laptops for a next few years.

To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.

https://github.com/openai/gpt-2

I’m still confused should I take the quote above as pro or con.

Meena

January 28, 2020. 2.6 B parameters.

Meena was trained on a whopping 341 gigabytes of public social-media chatter—8.5 times as much data as OpenAI’s GPT-2. Google says Meena can talk about pretty much anything, and can even make up (bad) jokes.

https://www.technologyreview.com/2020/01/30/275995/google-says-its-new-chatbot-meena-is-the-best-in-the-world/

Can I talk to Meena? Not yet. Google says it won’t be releasing a public demo until it has vetted the model for safety and bias, which is probably a good thing. When Microsoft released its chatbot Tay on Twitter in 2016, it started spewing racist, misogynistic invective within hours. 

https://www.technologyreview.com/2020/01/30/275995/google-says-its-new-chatbot-meena-is-the-best-in-the-world/

Off the context: MS tried to enter into the chat-bot arena with Tau in early 2016. A few months later, they tried again with a “fixed” one. Never heard for another try by MS since then.

Megatron BERT and GPT-2

April 2020. 3.9 B parameters for BERT and 8.3 B for GPT-2.

NVIDIA entered the NLP arena with Megatron, enhanced implementations of BERT and GPT-2 architectures.

Smaller models with just 345 M parameters can be downloaded from NVIDIA site at https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m and https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m

Blender

April 29, 2020. 9.4 B parameters.

Comes in three sizes:

  • Small, 90 M parameters
  • Medium, 2.7 B parameters and
  • Large, 9.4 parameters

Available at

Highlight:

In addition, the model was fine-tuned using Blended Skill Talk (BST), which strengthened the model for the following skills:

– Engaging use of personality (PersonaChat)

– Engaging use of knowledge (Wizard of Wikipedia)

– Display of empathy (Empathetic Dialogues)

– Ability to blend all three seamlessly (BST)

T5

February 24, 2020. 11 B parameters.

we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia. Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content. This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training. C4 is available through TensorFlow Datasets.

https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

T-NLG

February 10, 2020. 17 B parameters.

Finally some news from MS.

I’m not sure if their model i publicly available, but they open-sourced their training techniques:

  • ZeRO
  • Hummingbird
Finally Something Interesting From Microsoft

Sambanova

March 26 2020. 1000 B parameters.

GPT-3

June 11, 2020. 175 B parameters.

Highlifht:

It does not need pre-training for specific tasks. It already knows “everything“.

Examples of its creativity: https://read-the-samples.netlify.app/

GShard

June 30, 2020. 600 B parameters.

Next coming: 1T model probably. Google already had a 137B parameter model in 2017:  https://arxiv.org/abs/1701.06538. What else is out there, in research labs? We’ll know for a couple of months/years.

Many thanks for this post to

NLP

Leave a Reply

Your email address will not be published. Required fields are marked *

one × three =