Encoder and Decoder

Motivation

Encoder-only, decoder-only, and encoder-decoder are all variants of the transformer architecture. In Attention Is All You Need, the transformer was introduced as an encoder-decoder used for machine translation. Since then, the transformer has become the de facto model for almost all AI tasks. Some tasks are well-suited for encoder-only, other tasks are well-suited for decoder-only, and still other tasks are well-suited for encoder-decoder. In this blog post, I explain their differences.

Encoder-only models use a bidirectional approach to understand context from both the left and the right of a given token. In contrast, decoder-only models process text from left to right and are particularly good at text generation tasks. Encoder-decoder models combine both approaches, using an encoder to understand the input and a decoder to generate the output. They’re great at sequence-to-sequence tasks such as machine translation.

Comparison

Let’s compare the three architectures side by side. I’ve used colors to highlight their similarities.

	Encoder-only	Decoder-only	Encoder-decoder
Example models	BERT, RoBERTa, DeBERTa	GPT, Llama, Claude	T5, BART, original transformer
Tasks	Text understanding tasks (e.g. classification, NER, and QA)	Text generation tasks	Sequence-to-sequence tasks (e.g. machine translation and summarization)
Attention	Bidirectional (tokens on both the left and the right)	Causal (only tokens on the left).	Encoder: bidirectional
Decoder: causal + cross-attention (decoder tokens can attend to all encoder outputs i.e. the full input context).
Training method	Masked language modeling (MLM): randomly mask some tokens in the input and then predict them.	Causal language modeling (CLM): predict the next token based on all previous tokens in the sequence.	seq2seq reconstruction
Training	Processes entire input sequence in parallel	Process the entire input sequence in parallel, using teacher forcing and causal masking.	Encoder processes entire input sequence in parallel. Decoder processes the entire target sequence in parallel, using teacher forcing and causal masking.
Inference	Single forward pass	Generate output sequentially, one token at a time (because each new tokens depends on all previously generated tokens)	Single forward pass through encoder (cache output); decoder generates output sequentially, one token at a time.
Strengths	Understanding tasks, contextualized embeddings	Natural text generation, few-shot learning	Explicit input-output mapping
Weaknesses	Doesn’t generate text	No bidirectional understanding during generation	More complex, harder to scale
Popularity	Still used for language understanding tasks	Dominant for most applications	Less popular, often replaced by decoder-only

In truth, decoder-only models like GPT have become the dominant architecture for most AI tasks, including the tasks originally handled by encoder-only and encoder-decoder models.

In the future, I hope to cover some of the concepts mentioned, including teacher forcing and causal masking.

References

How 🤗 Transformers solve tasks