Table of Contents

A Revolution in One Equation
In June 2017, a team at Google published "Attention Is All You Need," a paper that would reshape artificial intelligence. At its heart was a single mechanism — self-attention — captured in one equation:
This equation is the engine of every large language model, from GPT to Claude to Gemini. Understanding it — truly understanding it, matrix dimension by matrix dimension — is essential for anyone who wants to grasp how modern AI works.
The Setup: Tokens as Vectors
A Transformer begins by converting each input token (word or subword) into a vector in . For a sequence of tokens, we stack these vectors into a matrix:
Each row of is one token's embedding — a high-dimensional point in semantic space.
Queries, Keys, and Values
Self-attention creates three different "views" of the input by multiplying with learned weight matrices:
where:
- — query projection
- — key projection
- — value projection
This gives us:
- — the queries: "what am I looking for?"
- — the keys: "what do I contain?"
- — the values: "what information do I carry?"
The analogy is a lookup table. Each token broadcasts a query ("I need context about verbs"), checks it against every other token's key ("I am a verb"), and retrieves information from the matching token's value.
The Attention Matrix
The core computation is the attention score matrix:
Entry is the dot product of query with key :
This dot product measures the compatibility between token 's query and token 's key. Higher values mean stronger relevance.
The result is an matrix where every token has scored its relationship with every other token — and with itself. This is what gives Transformers their remarkable ability to model long-range dependencies.
Why the Scaling Matters
Before applying softmax, we divide by :
Why? The answer lies in the statistics of dot products.
Assume the components of and are independent random variables with mean and variance . Their dot product is a sum of independent terms, each with variance , so:
As grows, the dot products grow in magnitude. Large dot products push the softmax into saturated regions where the gradients are extremely small — a phenomenon known as the vanishing gradient problem.
Dividing by normalizes the variance back to , keeping the softmax in its sensitive regime where gradients flow effectively. Without this scaling, training deep Transformers would be significantly harder.
The Softmax: From Scores to Probabilities
After scaling, we apply softmax row-wise:
This converts each row of scores into a probability distribution. The attention weights are non-negative and sum to across each row:
Each token now has a complete probability distribution over all tokens in the sequence, expressing how much it should "attend to" each position.
The Output: Weighted Value Aggregation
Finally, we multiply the attention weights by the value matrix:
Row of the output is:
Each output vector is a weighted combination of all value vectors, where the weights are determined by the attention scores. Tokens that are more relevant (higher attention weight) contribute more to the output.
This is the key insight: self-attention computes a context-dependent representation for each token by aggregating information from the entire sequence, weighted by relevance.
Multi-Head Attention
A single attention head captures one type of relationship. Multi-head attention runs attention heads in parallel, each with its own learned projection matrices:
where each .
Typically , so the total computation is comparable to a single head with full dimensionality. Each head can specialize: one might track syntactic dependencies, another semantic similarity, another positional relationships.
Computational Complexity
The attention matrix requires operations, and storing it requires memory. This quadratic scaling in sequence length is the fundamental computational bottleneck of Transformers.
For a sequence of tokens with heads, the attention computation involves approximately multiply-add operations per layer. This is why efficient attention variants (FlashAttention, ring attention, linear attention) are active areas of research.
The Deeper Mathematics
Self-attention can be understood through several mathematical lenses:
- Kernel methods — The softmax attention matrix is a kernel matrix, where measures similarity in a learned feature space
- Graph neural networks — The attention weights define a weighted, directed graph where each token is a node
- Dynamical systems — Stacking attention layers creates a discrete dynamical system where representations evolve toward task-relevant fixed points
These connections are not mere analogies. They provide theoretical tools for understanding why Transformers generalize, what they can and cannot express, and how to design better architectures.
From Matrix Multiplication to Understanding
It is remarkable that the machinery of matrix multiplication — the most basic operation in linear algebra — gives rise to systems that can write poetry, prove theorems, and engage in nuanced reasoning. The self-attention equation is, at its core, just a weighted average. But the weights are learned, the representations are high-dimensional, and the layers are deep.
The mathematics of attention reminds us that intelligence, natural or artificial, may be less about any single clever mechanism and more about the composition of simple operations at scale.
Applied mathematician and AI practitioner. Founder of MathLumen, exploring mathematics behind machine learning and scientific AI.

