<aside>

`TLDR;

</aside>

Imagine you’re at a busy party - the so‐called “cocktail party effect.” You focus on a friend’s voice amidst the chatter. In ML, this ability to selectively “listen” is what attention mechanisms enable. Early neural networks struggled with long sentences and complex relationships. The concept of “Attention” first emerged in machine translation, where models had to decide which words to “focus on” to translate a sentence correctly.

🔸What Is Attention?

Imagine reading a complex paragraph: instead of trying to remember every single word equally, you naturally focus on the most relevant sentences to understand the meaning. In Transformers, each word (or token) is transformed into three vectors:

The model computes how “compatible” a query is with each key using a dot product, scales the result (to prevent overly large numbers), and then applies a softmax function. This converts raw scores into probabilities (attention weights) that determine how much each word should contribute when forming a new representation. Mathematically, it looks like this:

This self-attention mechanism means every word can “look at” every other word , capturing context, syntax, and even distant relationships that older models struggled to remember.

🔸Transformer Architecture

Encoder–Decoder Structure

The Transformer is built on a classic encoder–decoder model:

Both parts are built from stacked layers that use attention and feed-forward networks, but they differ slightly:

Multi-Head Attention