1️⃣ Definition
For activation vector \( x \in \mathbb{R}^d \):
\[
\mu = \frac{1}{d}\sum_{i=1}^d x_i
\]
\[
\sigma = \sqrt{\frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2}
\]
\[
\text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma} + \beta
\]
- Normalization over feature dimension
- Applied per sample
- Same computation at train and inference
2️⃣ Geometry Interpretation
LayerNorm imposes two constraints:
(1) Zero mean
\[
\sum_i x_i = 0
\]
Vector lies in a \( (d-1) \)-dimensional hyperplane.
(2) Unit variance
\[
\|x\|^2 = d
\]
Vector lies on a sphere.
Resulting manifold
\[
\mathcal{M} = \{ x \mid \sum x_i = 0,\ \|x\|^2 = d \}
\]
This is a \((d-2)\)-dimensional sphere.
LayerNorm projects activations onto a curved manifold.
3️⃣ Riemannian Optimization View
LayerNorm removes:
- Shift freedom
- Scale freedom
Optimization occurs on:
\[
\mathbb{R}^d / (\text{shift + scale})
\]
Effective gradient becomes projection:
\[
g_{\text{proj}} = P_{\text{tangent}} g
\]
LN removes radial component and mean direction.
Acts like geometric preconditioning.
Explains smoother optimization, reduced scale sensitivity, and deep training stability.
4️⃣ Why It Works in Transformers
Transformers rely on dot products:
\[
QK^T
\]
Dot product magnitude depends on vector norm.
LayerNorm:
- Stabilizes per-token norm
- Makes attention scale-invariant
- Prevents exploding attention scores
5️⃣ LayerNorm vs BatchNorm
| Property |
BatchNorm |
LayerNorm |
| Normalizes over | Batch | Features |
| Depends on batch size | Yes | No |
| Train/test mismatch | Yes | No |
| Works with batch=1 | No | Yes |
| Autoregressive compatible | No | Yes |
| Per-sequence independence | No | Yes |
BatchNorm introduces cross-sample coupling:
\[
x_i^{norm} \text{ depends on } x_j
\]
6️⃣ Pre-LN vs Post-LN
Post-LN
\[
x + \text{Sublayer}(x) \rightarrow \text{LN}
\]
Harder to train deep networks.
Pre-LN
\[
x + \text{Sublayer}(\text{LN}(x))
\]
More stable gradient flow.
Modern LLMs use Pre-LN + RMSNorm.
7️⃣ RMSNorm
\[
\text{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2}}
\]
Removes mean subtraction.
Keeps scale invariance.
Suggests scale invariance is more important than mean-centering.
8️⃣ Core Insights
- Enforces per-token scale stability.
- Constrains activations to a sphere.
- Changes optimization geometry.
- Stabilizes attention.
- Eliminates train/inference mismatch.
- Enables deep transformer scaling.
🔟 One-Sentence Mental Model
LayerNorm projects each token representation onto a scale-invariant manifold, stabilizing attention geometry and enabling deep transformer optimization.