Summary:

Introduction:

The current image-captioning models are still not adept at captioning images containing novel objects. In this project, we explored the idea of a test-time training paradigm and the construction of the CIDEr predictor network, which can be used together to improve the captioning quality for images containing novel objects. However, the results weren't very impressive, and strong baselines beat our proposed approaches.

Intuition:

Note: For an easier test bed, we used the DCC-COCO split for novel object captioning. The train set consists of images from only 72 out of 80 COCO categories (we will refer it as COCO-72). The test set consists of 8 novel categories (COCO-8).

Early experiments:

Observation 1: CIDEr predictor model isn't working as well as we thought it is

Table 1 shows all the results discussed so far -

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a2818bfb-292b-4ec2-9f87-7e9ba25376e9/Table1.png