📘 Layer Normalization — Geometry & Optimization Note

Structured knowledge-base summary (Optimization / Geometry View)

1️⃣ Definition

For activation vector \( x \in \mathbb{R}^d \):

\[ \mu = \frac{1}{d}\sum_{i=1}^d x_i \] \[ \sigma = \sqrt{\frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2} \] \[ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma} + \beta \]

Normalization over feature dimension
Applied per sample
Same computation at train and inference

2️⃣ Geometry Interpretation

LayerNorm imposes two constraints:

(1) Zero mean

\[ \sum_i x_i = 0 \]

Vector lies in a \( (d-1) \)-dimensional hyperplane.

(2) Unit variance

\[ \|x\|^2 = d \]

Vector lies on a sphere.

Resulting manifold

\[ \mathcal{M} = \{ x \mid \sum x_i = 0,\ \|x\|^2 = d \} \]

This is a \((d-2)\)-dimensional sphere.

LayerNorm projects activations onto a curved manifold.

3️⃣ Riemannian Optimization View

LayerNorm removes:

Shift freedom
Scale freedom

Optimization occurs on:

\[ \mathbb{R}^d / (\text{shift + scale}) \]

Effective gradient becomes projection:

\[ g_{\text{proj}} = P_{\text{tangent}} g \]

LN removes radial component and mean direction.

Acts like geometric preconditioning.

Explains smoother optimization, reduced scale sensitivity, and deep training stability.

4️⃣ Why It Works in Transformers

Transformers rely on dot products: \[ QK^T \] Dot product magnitude depends on vector norm. LayerNorm:

Stabilizes per-token norm
Makes attention scale-invariant
Prevents exploding attention scores

5️⃣ LayerNorm vs BatchNorm

Property	BatchNorm	LayerNorm
Normalizes over	Batch	Features
Depends on batch size	Yes	No
Train/test mismatch	Yes	No
Works with batch=1	No	Yes
Autoregressive compatible	No	Yes
Per-sequence independence	No	Yes

BatchNorm introduces cross-sample coupling:

\[ x_i^{norm} \text{ depends on } x_j \]

6️⃣ Pre-LN vs Post-LN

Post-LN

\[ x + \text{Sublayer}(x) \rightarrow \text{LN} \] Harder to train deep networks.

Pre-LN

\[ x + \text{Sublayer}(\text{LN}(x)) \] More stable gradient flow. Modern LLMs use Pre-LN + RMSNorm.

7️⃣ RMSNorm

\[ \text{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2}} \] Removes mean subtraction. Keeps scale invariance. Suggests scale invariance is more important than mean-centering.

8️⃣ Core Insights

Enforces per-token scale stability.
Constrains activations to a sphere.
Changes optimization geometry.
Stabilizes attention.
Eliminates train/inference mismatch.
Enables deep transformer scaling.

🔟 One-Sentence Mental Model

LayerNorm projects each token representation onto a scale-invariant manifold, stabilizing attention geometry and enabling deep transformer optimization.