Neural Machine Translation

In the paper proposed by Sutskever et. al., we have an encoder and a decoder mechanism to obtain an end to end neural machine translation model.

Encoder: $h_{t}=f_{enc}(h_{t-1},x_{t})$

Decoder: $s_{t}=f_{dec}(s_{t},y_{t-1},c)$ where $c=h_{T_{x}}$

Problem with this architecture: Here $c$ is a static fixed-size vector, which stores the entire information of the source sentence. It becomes a bottleneck if the source sentence is huge. The context vector also does not provide any additional information to the present decoder word. Each and every decoder step gets the same information from the source sentence which has a bad effect on the task.

NMT with Attention

Due to the problem documented above, there was this idea of building a new context vector at each decoder step. This time around we do not have a static fixed-size vector, in fact, we have a dynamic vector that is being evaluated on each decoder step. This way the bottleneck of information is bypassed.

Encoder: $h_{t}=f_{enc}(h_{t-1},x_{t})$

Decoder: $s_{t}=f_{dec}(s_{t},y_{t-1},\boxed{c_{t}})$

The concept of attention kicks in while we build the dynamic context vector. The first thing that we do is build an alignment scalar that provides the intuition of similarity between the decoder hidden state and the annotations of the entire encoder.

After having the alignment scores, we need to normalize them so that we get a normal distribution of the entire encoder states.

$$ e_{t,i}=f_{att}(s_{t-1},h_{i})\\ a_{t,i}= softmax(e_{t,i})\\ c_{t}=\sum_{i=0}^{T_{x}}a_{t,i}h_{i} $$

Where:

$e_{t,i}$ is the alignment score of the decoder hidden state with respect to the annotation.
$a_{t,i}$ is the attention weight for the annotation.
$c_{t}$ is the weighted annotations.

This model overcomes the problem of bottlenecking.

Image Caption with Attention

We need to notice here that the attention mechanism does not care about the format of the input data. In theory, we can use this mechanism for anything provided the idea remains the same.

For image captioning, we have a similar architecture of an encoder and a decoder. The encoder is a CNN architecture which extract features from the images, while the decoder reamins to be a Recurrent model that builds the caption one step at a time.