Distillation of CLIP model and other experiments

Published in

PicCollage Company Blog

9 min readAug 10, 2021

Introduction

CLIP is a model released by OpenAI earlier this year. It was trained to learn “visual concepts from natural language supervision” on more than 400 million image-text pairs using an impressive amount of compute (256 GPUs for 2 weeks).

At PicCollage we have been researching ways to combine text and images. CLIP came in handy and we tested its performance on some of our content. It was VERY impressive — better than anything we had earlier. However we soon began to notice a quirk of the model: it seemed to prioritize textual similarity to semantic similarity for a search query.

Given how powerful the model was, we also wanted to reduce its size and explore the possibility of deploying it on edge. Given the magnitude of the dataset and compute required, it seemed like a daunting task but we wanted to give it a shot anyway.

This article expands on two experiments with CLIP performed at PicCollage:

How to reduce the emphasis on textual similarity in order to get more relevant search results
Reduce of the size of CLIP by using model distillation

TLDR

We dealt with the issue of over-emphasis on text (described below) using CLIP. We also performed model distillation of CLIP and ran the distilled models on iOS devices.

Over-emphasis on text in images

Searching for cat using CLIP returns roughly two kinds of results:
i) images containing the text cat or something similar
ii) images containing the actual cat itself
CLIP tends of have a higher score for first type. We came up with two ways to resolve this issue and control the amount of "textness" in search results.

Model distillation of CLIP

We also used model distillation to reduce the size of CLIP model (the ViT model to be specific, not the language model) and got promising results. The sizes of the original and the distilled models are 350MB and 48MB (24MB) with FP32 (FP16) precision respectively. Finally we converted the distilled models to CoreML format to run on iOS and observed negligible difference between the search results of FP16 and FP32 versions.

Over-emphasis on text in images

Let’s use a simple setup to demonstrate this problem. Consider three images:
i) an image with the word Cat in it
ii) an image with the word Gat in it
iii) an image with a cat in it

The images look like so:

CLIP assigns higher scores to these images compared to the image below for the word “cat”

Let’s say you search for cat. CLIP will convert this piece of text into a vector, say text_vector. The cosine similarities of text_vector with the vectors of the three images shown above are:

We can see the similarity between a search term and an image can be “similar” in two ways:
i) the image contains text similar to the search term: let’s refer to it as textual similarity
ii) the semantic meanings of the image and search term are similar: let’s refer to it as semantic similarity

When building a search functionality, one might prefer semantic similarity to textual similarity. We found that CLIP tends to give higher scores to textually similar images.

A similar problem has been discussed online already and the problem was even solved by Yannic Kilcher. However we can’t control the text typed in by a user so using prompts may not be a good solution here.

Solutions to over-emphasis on text in images

We began with a hypothesis: there exists a direction in the shared vector space in which the “textness” property of images varies a lot whereas other (semantic) properties remain invariant. If we could find this direction, we could use a vector pointing in this direction and add it to all the image vectors (or the text vector) before normalizing them and calculating cosine similarities. Let’s call this vector the textness_bias vector.

In other words, before the following set of operations:

image_vectors /= np.linalg.norm(image_vectors, axis=-1, keepdims=True)
cosine_similarities = text_vector @ image_vectors

we would do something like:

# add bias to the image vectors
image_vectors += scale * textness_bias
# or add bias to the text vector
text_vector += scale * textness_bias

The next question was: how to find this textness_bias vector? We came up with two different approaches which led to quite similar answers. The second approach (which is what we used) is described below:

Reducing “textness”: training a small model with no hidden layer

We created a dataset of images with and without text in them. The idea was to train a model and then use the weights of the model as an indicator of textness bias:

class Model(nn.Module):
    def __init__(self, dim=512):
        super(Model, self).__init__()
        self.linear = nn.Linear(dim, 2)def forward(self, x):
        return self.linear(x)model = Model()

Then we used the weight vector responsible for predicting the positive label as the textness bias. Another interesting finding was that adding the bias to the text vector was much more effective than adding it to the image vectors.

textness_bias = model.linear.weight[1]
text_vector += scale * textness_bias

The bigger the scale, the more emphasis CLIP puts on textual similarity. Let's take a look at some of the results.

Results of controlling textual similarity in search

For every search term, we varied the value of scale sequentially like so: -2, -1, 0, 1, 2. For each value of scale, we stored the top ten results in a single row. Thus for each search term, we got a grid of images where each row corresponded to a value of scale and contained top ten results for that scale. Notice how the preference for textual similarity increases as we go from top row to bottom row:

Results for seaching “beach” and “halloween”

Attempts at model distillation of ViT

Given how powerful CLIP was, we decided to reduce its size using model distillation. Some of the specifics of this task are:

Exactly what was distilled: CLIP model is actually two models with a disjoint set of parameters: ViT (converts image to vector) and Transformer (converts text to vector) model. We decided to perform model distillation of ViT model (~350MB with FP32 precision). The size of the student ViT model was decided to be less than 50MB.
How was the student model defined: The original ViT model is defined by a class named VisualTransformer. The model is created like so:

teacher_clip = VisualTransformer(input_resolution=224,patch_size=32,width=768,layers=12,heads=12,output_dim=512)

For creating a student model, we reduced the width and layers by a factor of two. Unsure about the number of heads to have, we defined two versions - one with the same number of heads as the teacher model and one with twice the number of heads. This was in order to see how increasing the number of heads would affect the performance of the model.

student_clip_12_heads = VisualTransformer(input_resolution=224,patch_size=32,width=768//2,layers=12//2,heads=12,output_dim=512)student_clip_24_heads = VisualTransformer(input_resolution=224,patch_size=32,width=768//2,layers=12//2,heads=24,output_dim=512)

We trained student_clip_12_heads first.

Data used for training: We began with a dataset of ~200,000 images taken from various sources. After about 10 epochs, once we began to see some promising results, the size was increased to 800,000+ images.
Loss function used: The sum of KLD + L1 Loss was used to train the model. For the first ten epochs, the temperature was set to 4 which was then reduced to 2 later.

We trained student_clip_12_heads in the beginning and then fine-tuned the weights on student_clip_24_heads. One major challenge we faced was collecting the data to cover all kinds of images. The original CLIP was trained on 400 million images. While collecting images of such a magnitude was impractical, we focused on images from standard open source datasets. In order to circumvent the need for a huge amount of images, we also tried using Zero Shot Distillation but it didn’t work.

Results of the distilled ViT model in CLIP

We used COCO testing dataset to see the performance of distilled CLIP model by looking at top 20 results for each search term. We also evaluated the mean average precision (MAP) based on top N results of both original CLIP and distilled CLIP for N ranging from 10 to 20 for each search term. For each of these values of N, we found the MAP to be roughly 0.012. Such a low value indicates that the results from original and distilled CLIP won't have many common results. While this may sound discouraging, results from distilled CLIP model do look very promising. A quick look at the top 20 results from both the models explains the low MAP despite both giving semantically meaningful results.

Results for the term “bird” by teacher model

Results for the term “bird” by student model

Results for the term “girl” by teacher model

Results for the term “girl” by student model

As you can see, both the results make sense although there are hardly any common results.

Cases where the distilled model performs poorly

While the distilled ViT clip model shows promising results, there are a few cases for which its performance dulls in comparison to the original model. A few cases are:

1. It performs poorly for cases not covered in the training dataset: This is an assumption based on a few observations. We are yet to conduct tests to validate it. E.g. it performs poorly for search terms like flag. Another interesting case is that of the search term flock. The distilled model learned to associate the concept of numerosity to the word flock but in a wrong way.

Results for “flock” by student and distilled models. Student model shows large groups of animals instead of birds.

2. Color search get less accurate and it cannot do OCR: We also noticed that the distilled model fails to perform composition of concepts when doing color search. E.g. when searching for white cat, the distilled model would return images of cats with white color somewhere in the image instead of images of white cats. The original model seems to compose these concepts pretty well. Another observation was the inability of distilled model to read text in images; something the original model is adept at. We assume this is also due to the training dataset not containing many images with text.

3. It seems to lose the property of multimodality: Searching for terms like Xmas or school, the original model returns multimodal results like Christmas tree, Santa hat and cake for Xmas and books, school sign and school bus for school. We do not see this property in results from distilled model.

What’s next

The results from distilled ViT CLIP model look quite promising. Here are a few directions we plan to explore in this project:

1. Take pointers from recent paper by Google which also involves distillation of CLIP

2. Use a different loss function and more data: Since the optimization metric relies on cosine similarity, we only need to put a constraint on minimizing cosine distance instead of L1 distance between the output vectors. We would like to test whether this improves the results or not.

3. Hyperparameter optimization: Things like LR, LR scheduler, architectural choices (number of heads, number of residual attention layers), temperature etc can be tweaked to find a better combination.

Finally we hope that this work inspires others to conduct similar experiments and share their findings with the ML community.