Zero-Shot Image Classification – CLIP

zero-shot image classification

zero-shot image classification

A Long Introduction

At the beginning of 2020, it seemed to me that Deep Learning shifted non-reversibly to the transfer-learning concept. Nobody reasonable started his project with training a neural network from scratch.

If you wanted image recognition, all you would do is to pick the right model that suits your needs the best, fine-tune hit a couple, i.e. tens of images, and you have a proper custom image classifier.

The next step was one-shot training: all you need was one single image for the face recognition task. Implemented properly, with only one single image of a person, trained with images from at about 600 people, in my project I achieved at about 90% accuracy with Siymeese convolutional neural networks, already pre-trained, of course. Of course, there were misclassified images: the network was exposed to images of such people with bards, and they were classified with the beards shaved. I wouldn’t recognize them, as well.

NLP gained from this new trend in AI innovations with transformers, as a natural extension to RNNs and LSTMs combined with an attention mechanism. With a multiheaded attention mechanism, they overrun all the previous results for NLP related tasks. And again: we got two well-trained models: BERT and GPT-2.

BERT is a two-directional language model pre-trained with a large text corpus for filling gaps in the train sentences. This makes him very suitable for fine-tuning for all kinds of NLP tasks. You simply use everything about the language BERT is trained with and use that knowledge about the language semantics in a wide variety of ways.

The lastest SOTA result, at least, according to me, was zero-shot learning: You want to execute classification of the same text, forcing the neural network to categorize it among several provided classes, where the model was never trained for.

Already had written for this one:

Zero-Shot Text Classification

It was just a matter of time when those two the most advanced models in Deep Learning would merge. And this happened:

Finally, the Point

CLIP: Connecting Text and Images (


Few more resources on the toppic:

Open AI CLIP: learning visual concepts from natural language supervision | by Mostafa Ibrahim | Jan, 2021 | Towards Data Science

GitHub - openai/CLIP: Contrastive Language-Image Pretraining

(4) OpenAI CLIP - Connecting Text and Images | Paper Explained - YouTube

OpenAI Unveils DALL·E and CLIP AI Models That Create and Classify Images | Technology News (

Long story short:

  1. You want to classify an image into some category, never used in any kind of pre-training process
  2. You describe target categories with few words
  3. Ask the CLIP model which category describes the presented image the best?
  4. Read the prediction.

And a Little Bit Of Code

import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)
image = transform("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

Want to know the number of the lines? It's 13.

Happy Christmas & NY 2021

Whant a start of the year, m?

First (among other thins), American Congress. Then this.

Ok, try to take the maximum and enjoy in 2021. I already started, playing with this.

Expecting this year to have a model that will finally tell us if Mona Lisa smile, or not? And, I guess we'll wait a couple of years to reach an explained result.

Happy NY hollidays.

No comment

Leave a Reply

Your email address will not be published. Required fields are marked *