<aside> <img src="https://img.icons8.com/ios/250/000000/bookmark-ribbon.png" alt="https://img.icons8.com/ios/250/000000/bookmark-ribbon.png" width="40px" /> This was our deep learning class project. With 3 other amazing teammates, we built a pipeline that consists of two models: The first model, Show and Tell (Im2Txt), generates a caption from a given image. The second model, Txt2Im, produces an image from a given caption. In addition to training and implementing these two models independently, by combining them, we were also able to compare a given image with the pipeline-generated one and comment on the information conveyed by a caption of just a few words.

</aside>

gizemt/Image2Caption2Image

Details

There had been vast improvements in machine translation applications with the use of encoder-decoder RNNs. An encoder RNN is used to convert a sentence (or phrase) in one language to a rich, fixed-length vector representation, and then a decoder RNN is used to reconstruct the corresponding sentence (or phrase) vector in the target language. In the Show-and-Tell caption generating model, the encoder RNN is replaced with a CNN, and an image is used as the input to be encoded instead of a sentence (phrase). CNN is trained for an image classification task and the last hidden layer, which would be a rich representation of the input image, is fed into the decoder RNN to generate captions.

RNNs (LSTMs to be more specific) can generate new sequences that are representative of their existing training sequences. In many applications, these training and generated sequences are texts. In others, these can be images. If images are represented as patches drawn sequentially, one can use a generative LSTM drawing a sequence of patches from the sequence of text in the caption as the input, just as proposed in alignDRAW model in [2].

In this project, we combined these two models end-to-end. We trained the models in Flickr8k dataset. Since both of the models in those papers were trained on MSCOCO dataset, we needed to change the structure of the networks and train them from scratch (on top of the pre-trained CNN model) on Flickr8k. We trained two models separately and independently, but on the same dataset. Then we fed the caption generated by the first (caption-generating) model to the second (image-drawing) model.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d663a8bf-7a5a-47b2-8a5b-e2d53cffc5ce/Untitled.png

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ccc12482-cbaf-444d-9413-9e7be752c690/Untitled.png

Image → Caption → Image

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/a5d7a0e9-8edf-461a-bcec-b18ed11ebd24/Untitled.png