Transformers processes inputs step by step:
First, text (or tokens) is converted into embeddings and enriched with positional encodings to preserve the order of words. In the encoder, layers of self-attention (multi-headed) and linear/feed-forward networks transform these embeddings into contextual representations, where each token "looks at" or attends to others to capture meaning. The decoder then takes these encoder outputs along with its own self-attention (over previously generated tokens) and cross-attention (to the encoder) to produce the next token prediction. Stacking many encoder and decoder layers allows transformers to build deep contextual understanding, making them the backbone of modern language and vision models like LLMs.