image.png

Motivation

Encoder-only, decoder-only, and encoder-decoder are all variants of the transformer architecture. In Attention Is All You Need, the transformer was introduced as an encoder-decoder used for machine translation. Since then, the transformer has become the de facto model for almost all AI tasks. Some tasks are well-suited for encoder-only, other tasks are well-suited for decoder-only, and still other tasks are well-suited for encoder-decoder. In this blog post, I explain their differences.

Encoder-only models use a bidirectional approach to understand context from both the left and the right of a given token. In contrast, decoder-only models process text from left to right and are particularly good at text generation tasks. Encoder-decoder models combine both approaches, using an encoder to understand the input and a decoder to generate the output. They’re great at sequence-to-sequence tasks such as machine translation.

Comparison

Let’s compare the three architectures side by side. I’ve used colors to highlight their similarities.

Encoder-only Decoder-only Encoder-decoder
Example models BERT, RoBERTa, DeBERTa GPT, Llama, Claude T5, BART, original transformer
Tasks Text understanding tasks (e.g. classification, NER, and QA) Text generation tasks Sequence-to-sequence tasks (e.g. machine translation and summarization)
Attention Bidirectional (tokens on both the left and the right) Causal (only tokens on the left). Encoder: bidirectional
Decoder: causal + cross-attention (decoder tokens can attend to all encoder outputs i.e. the full input context).
Training method Masked language modeling (MLM): randomly mask some tokens in the input and then predict them. Causal language modeling (CLM): predict the next token based on all previous tokens in the sequence. seq2seq reconstruction
Training Processes entire input sequence in parallel Process the entire input sequence in parallel, using teacher forcing and causal masking. Encoder processes entire input sequence in parallel. Decoder processes the entire target sequence in parallel, using teacher forcing and causal masking.
Inference Single forward pass Generate output sequentially, one token at a time (because each new tokens depends on all previously generated tokens) Single forward pass through encoder (cache output); decoder generates output sequentially, one token at a time.
Strengths Understanding tasks, contextualized embeddings Natural text generation, few-shot learning Explicit input-output mapping
Weaknesses Doesn’t generate text No bidirectional understanding during generation More complex, harder to scale
Popularity Still used for language understanding tasks Dominant for most applications Less popular, often replaced by decoder-only

In truth, decoder-only models like GPT have become the dominant architecture for most AI tasks, including the tasks originally handled by encoder-only and encoder-decoder models.

In the future, I hope to cover some of the concepts mentioned, including teacher forcing and causal masking.

References