Encoder-only, decoder-only, and encoder-decoder are all variants of the transformer architecture. In Attention Is All You Need, the transformer was introduced as an encoder-decoder used for machine translation. Since then, the transformer has become the de facto model for almost all AI tasks. Some tasks are well-suited for encoder-only, other tasks are well-suited for decoder-only, and still other tasks are well-suited for encoder-decoder. In this blog post, I explain their differences.
Encoder-only models use a bidirectional approach to understand context from both the left and the right of a given token. In contrast, decoder-only models process text from left to right and are particularly good at text generation tasks. Encoder-decoder models combine both approaches, using an encoder to understand the input and a decoder to generate the output. They’re great at sequence-to-sequence tasks such as machine translation.
Let’s compare the three architectures side by side. I’ve used colors to highlight their similarities.
Encoder-only | Decoder-only | Encoder-decoder | |
---|---|---|---|
Example models | BERT, RoBERTa, DeBERTa | GPT, Llama, Claude | T5, BART, original transformer |
Tasks | Text understanding tasks (e.g. classification, NER, and QA) | Text generation tasks | Sequence-to-sequence tasks (e.g. machine translation and summarization) |
Attention | Bidirectional (tokens on both the left and the right) | Causal (only tokens on the left). | Encoder: bidirectional |
Decoder: causal + cross-attention (decoder tokens can attend to all encoder outputs i.e. the full input context). | |||
Training method | Masked language modeling (MLM): randomly mask some tokens in the input and then predict them. | Causal language modeling (CLM): predict the next token based on all previous tokens in the sequence. | seq2seq reconstruction |
Training | Processes entire input sequence in parallel | Process the entire input sequence in parallel, using teacher forcing and causal masking. | Encoder processes entire input sequence in parallel. Decoder processes the entire target sequence in parallel, using teacher forcing and causal masking. |
Inference | Single forward pass | Generate output sequentially, one token at a time (because each new tokens depends on all previously generated tokens) | Single forward pass through encoder (cache output); decoder generates output sequentially, one token at a time. |
Strengths | Understanding tasks, contextualized embeddings | Natural text generation, few-shot learning | Explicit input-output mapping |
Weaknesses | Doesn’t generate text | No bidirectional understanding during generation | More complex, harder to scale |
Popularity | Still used for language understanding tasks | Dominant for most applications | Less popular, often replaced by decoder-only |
In truth, decoder-only models like GPT have become the dominant architecture for most AI tasks, including the tasks originally handled by encoder-only and encoder-decoder models.
In the future, I hope to cover some of the concepts mentioned, including teacher forcing and causal masking.