A Long Introduction
At the beginning of 2020, it seemed to me that Deep Learning shifted non-reversibly to the transfer-learning concept. Nobody reasonable started his project with training a neural network from scratch.
If you wanted image recognition, all you would do is to pick the right model that suits your needs the best, fine-tune hit a couple, i.e. tens of images, and you have a proper custom image classifier.
The next step was one-shot training: all you need was one single image for the face recognition task. Implemented properly, with only one single image of a person, trained with images from at about 600 people, in my project I achieved at about 90% accuracy with Siymeese convolutional neural networks, already pre-trained, of course. Of course, there were misclassified images: the network was exposed to images of such people with bards, and they were classified with the beards shaved. I wouldn’t recognize them, as well.
NLP gained from this new trend in AI innovations with transformers, as a natural extension to RNNs and LSTMs combined with an attention mechanism. With a multiheaded attention mechanism, they overrun all the previous results for NLP related tasks. And again: we got two well-trained models: BERT and GPT-2.
BERT is a two-directional language model pre-trained with a large text corpus for filling gaps in the train sentences. This makes him very suitable for fine-tuning for all kinds of NLP tasks. You simply use everything about the language BERT is trained with and use that knowledge about the language semantics in a wide variety of ways.
The lastest SOTA result, at least, according to me, was zero-shot learning: You want to execute classification of the same text, forcing the neural network to categorize it among several provided classes, where the model was never trained for.
Already had written for this one:
It was just a matter of time when those two the most advanced models in Deep Learning would merge. And this happened:
Finally, the Point
Few more resources on the toppic:
Long story short:
- You want to classify an image into some category, never used in any kind of pre-training process
- You describe target categories with few words
- Ask the CLIP model which category describes the presented image the best?
- Read the prediction.
And a Little Bit Of Code
import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, transform = clip.load("ViT-B/32", device=device) image = transform(Image.open("CLIP.png")).unsqueeze(0).to(device) text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
Want to know the number of the lines? It’s 13.
Happy Christmas & NY 2021
Whant a start of the year, m?
First (among other thins), American Congress. Then this.
Ok, try to take the maximum and enjoy in 2021. I already started, playing with this.
Expecting this year to have a model that will finally tell us if Mona Lisa smile, or not? And, I guess we’ll wait a couple of years to reach an explained result.
Happy NY hollidays.