The Mathematics of Self-Attention: Deconstructing the Transformer

A Revolution in One Equation

In June 2017, a team at Google published "Attention Is All You Need," a paper that would reshape artificial intelligence. At its heart was a single mechanism — self-attention — captured in one equation:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$

This equation is the engine of every large language model, from GPT to Claude to Gemini. Understanding it — truly understanding it, matrix dimension by matrix dimension — is essential for anyone who wants to grasp how modern AI works.

The Setup: Tokens as Vectors

A Transformer begins by converting each input token (word or subword) into a vector in $\mathbb{R}^{d_{\text{model}}}$ . For a sequence of $n$ tokens, we stack these vectors into a matrix:

$X \in \mathbb{R}^{n \times d_{\text{model}}}$

Each row of $X$ is one token's embedding — a high-dimensional point in semantic space.

Queries, Keys, and Values

Self-attention creates three different "views" of the input by multiplying $X$ with learned weight matrices:

$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$

where:

$W^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ — query projection
— key projection

This gives us:

$Q \in \mathbb{R}^{n \times d_k}$ — the queries: "what am I looking for?"
$K \in \mathbb{R}^{n \times d_k}$ — the : "what do I contain?"

The analogy is a lookup table. Each token broadcasts a query ("I need context about verbs"), checks it against every other token's key ("I am a verb"), and retrieves information from the matching token's value.

The Attention Matrix

The core computation is the attention score matrix:

$A = QK^\top \in \mathbb{R}^{n \times n}$

Entry $A_{ij}$ is the dot product of query $i$ with key $j$ :

$A_{ij} = q_i \cdot k_j = \sum_{m=1}^{d_k} q_{im} \, k_{jm}$

This dot product measures the compatibility between token $i$ 's query and token $j$ 's key. Higher values mean stronger relevance.

The result is an $n \times n$ matrix where every token has scored its relationship with every other token — and with itself. This is what gives Transformers their remarkable ability to model long-range dependencies.

Why the $\sqrt{d_k}$ Scaling Matters

Before applying softmax, we divide by $\sqrt{d_k}$ :

$\text{Scaled scores} = \frac{QK^\top}{\sqrt{d_k}}$

Why? The answer lies in the statistics of dot products.

Assume the components of $q_i$ and $k_j$ are independent random variables with mean $0$ and variance $1$ . Their dot product is a sum of $d_k$ independent terms, each with variance , so:

$\text{Var}(q_i \cdot k_j) = d_k$

$\text{Std}(q_i \cdot k_j) = \sqrt{d_k}$

As $d_k$ grows, the dot products grow in magnitude. Large dot products push the softmax into saturated regions where the gradients are extremely small — a phenomenon known as the vanishing gradient problem.

Dividing by $\sqrt{d_k}$ normalizes the variance back to $1$ , keeping the softmax in its sensitive regime where gradients flow effectively. Without this scaling, training deep Transformers would be significantly harder.

The Softmax: From Scores to Probabilities

After scaling, we apply softmax row-wise:

$\alpha_{ij} = \frac{\exp(A_{ij} / \sqrt{d_k})}{\sum_{l=1}^{n} \exp(A_{il} / \sqrt{d_k})}$

This converts each row of scores into a probability distribution. The attention weights $\alpha_{ij}$ are non-negative and sum to $1$ across each row:

$\sum_{j=1}^{n} \alpha_{ij} = 1 \quad \text{for all } i$

Each token now has a complete probability distribution over all tokens in the sequence, expressing how much it should "attend to" each position.

The Output: Weighted Value Aggregation

Finally, we multiply the attention weights by the value matrix:

$\text{Output} = \alpha V \in \mathbb{R}^{n \times d_v}$

Row $i$ of the output is:

$\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j$

Each output vector is a weighted combination of all value vectors, where the weights are determined by the attention scores. Tokens that are more relevant (higher attention weight) contribute more to the output.

This is the key insight: self-attention computes a context-dependent representation for each token by aggregating information from the entire sequence, weighted by relevance.

Multi-Head Attention

A single attention head captures one type of relationship. Multi-head attention runs $h$ attention heads in parallel, each with its own learned projection matrices:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

where each $\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$ .

Typically $d_k = d_v = d_{\text{model}} / h$ , so the total computation is comparable to a single head with full dimensionality. Each head can specialize: one might track syntactic dependencies, another semantic similarity, another positional relationships.

Computational Complexity

The attention matrix $QK^\top$ requires $O(n^2 d_k)$ operations, and storing it requires $O(n^2)$ memory. This in sequence length is the fundamental computational bottleneck of Transformers.

For a sequence of $n = 4096$ tokens with $h = 32$ heads, the attention computation involves approximately $4096^2 \times 32 = 5.4 \times 10^8$ multiply-add operations per layer. This is why efficient attention variants (FlashAttention, ring attention, linear attention) are active areas of research.

The Deeper Mathematics

Self-attention can be understood through several mathematical lenses:

Kernel methods — The softmax attention matrix is a kernel matrix, where $\alpha_{ij}$ measures similarity in a learned feature space
Graph neural networks — The attention weights define a weighted, directed graph where each token is a node
Dynamical systems — Stacking attention layers creates a discrete dynamical system where representations evolve toward task-relevant fixed points

These connections are not mere analogies. They provide theoretical tools for understanding why Transformers generalize, what they can and cannot express, and how to design better architectures.

From Matrix Multiplication to Understanding

It is remarkable that the machinery of matrix multiplication — the most basic operation in linear algebra — gives rise to systems that can write poetry, prove theorems, and engage in nuanced reasoning. The self-attention equation is, at its core, just a weighted average. But the weights are learned, the representations are high-dimensional, and the layers are deep.

The mathematics of attention reminds us that intelligence, natural or artificial, may be less about any single clever mechanism and more about the composition of simple operations at scale.

Related Articles

The Geometry of Gradient Descent: Curvature, Saddle Points, and Loss Landscapes

Information Geometry and Neural Networks

Gerd Faltings Wins the 2026 Abel Prize for Reshaping Arithmetic Geometry

A Revolution in One Equation

The Setup: Tokens as Vectors

Queries, Keys, and Values

The Attention Matrix

Why the $\sqrt{d_k}$ Scaling Matters

The Softmax: From Scores to Probabilities

The Output: Weighted Value Aggregation

Multi-Head Attention

Computational Complexity

The Deeper Mathematics

From Matrix Multiplication to Understanding

Popular This Week

Related Articles

The Geometry of Gradient Descent: Curvature, Saddle Points, and Loss Landscapes

Information Geometry and Neural Networks

Gerd Faltings Wins the 2026 Abel Prize for Reshaping Arithmetic Geometry

Related Articles

The Geometry of Gradient Descent: Curvature, Saddle Points, and Loss Landscapes

Information Geometry and Neural Networks

Gerd Faltings Wins the 2026 Abel Prize for Reshaping Arithmetic Geometry

A Revolution in One Equation

The Setup: Tokens as Vectors

Queries, Keys, and Values

The Attention Matrix

Why the dk\sqrt{d_k}dk​​ Scaling Matters

The Softmax: From Scores to Probabilities

The Output: Weighted Value Aggregation

Multi-Head Attention

Computational Complexity

The Deeper Mathematics

From Matrix Multiplication to Understanding

Popular This Week

Related Articles

The Geometry of Gradient Descent: Curvature, Saddle Points, and Loss Landscapes

Information Geometry and Neural Networks

Gerd Faltings Wins the 2026 Abel Prize for Reshaping Arithmetic Geometry

Why the $\sqrt{d_k}$ Scaling Matters