Test-time training for image captioning - A summary

Summary:

Introduction:

The current image-captioning models are still not adept at captioning images containing novel objects. In this project, we explored the idea of a test-time training paradigm and the construction of the CIDEr predictor network, which can be used together to improve the captioning quality for images containing novel objects. However, the results weren't very impressive, and strong baselines beat our proposed approaches.

Intuition:

To generate captions for out-of-distribution images without reference captions, we need to solve two problems - fluency of captions and relevance of words to the objects present in the image.
We can use Conceptual captions (CC) which has millions of image-captions collected from the web, to begin with. However, CC suffer from linguistic drift as the style of captions in CC are very different from captions in COCO/nocaps.
Can we use a model like ViLBERT that has been trained with a variety of auxiliary tasks for aligning visual and linguistic representations? Say we fine-tune a model like ViLBERT to learn to predict CIDEr-D for in-domain images to account for the linguistic drift. As it has been pre-trained on a lot of data, will it generalize to out-of-domain images?

Note: For an easier test bed, we used the DCC-COCO split for novel object captioning. The train set consists of images from only 72 out of 80 COCO categories (we will refer it as COCO-72). The test set consists of 8 novel categories (COCO-8).

Early experiments:

When we train a CIDEr-D predictor on train data, and evaluate on in-domain images, the correlation of predicted CIDEr-D and actual CIDEr-D is ~0.58. When evaluating on images with novel objects, it's still shows a positive correlation of around 0.44. We measure correlation because as long as we're able to say which is good/bad caption, it's sufficient for Self-Critical (SC) training. We don't need to know absolute values.
Just using a captioning model trained on CC doesn't work very well. It gives a CIDEr-D of 0.32 on test set (Table 1, Row 1). This is not surprising as it suffers from linguistic drift as captions in COCO/nocaps are written by humans while captions in CC are generated from <alt> tags from the internet.
Using a model pre-trained on CC, and fine-tuned for COCO works better. This model gives a CIDEr-D of 0.616 on the test set. (Table 1, Row 2). We now can improve it further by using our CIDEr-D predictor model.

Observation 1: CIDEr predictor model isn't working as well as we thought it is

Through our initial experiments, we observed that the CIDEr-D predictor model is giving improvements over the base model when used for self-critical training (Approach A). This seemed encouraging since we can use this predictor model at test time to improve the captions for images at test time. This was especially encouraging because we found out that approach A works better than self-critical training on train set using ground-truth reference captions (Approach B)
However, we recently found out that we were not training the base model + self-critical training using ground truth reference captions properly. For approach B, we were using the same set of hyper-parameters as approach A. When we fixed this, approach B outperforms approach A. Additionally, doing self-critical training using the CIDEr predictor model on top of approach B doesn't give any substantial gains.

Table 1 shows all the results discussed so far -

Green tags are baselines and oracle.
Red tags are faulty approaches.
Orange tags are our experiments.