📘 Layer Normalization — Geometry & Optimization Note

Structured knowledge-base summary (Optimization / Geometry View)

1️⃣ Definition

For activation vector \( x \in \mathbb{R}^d \):

\[ \mu = \frac{1}{d}\sum_{i=1}^d x_i \] \[ \sigma = \sqrt{\frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2} \] \[ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma} + \beta \]

2️⃣ Geometry Interpretation

LayerNorm imposes two constraints:

(1) Zero mean

\[ \sum_i x_i = 0 \]

Vector lies in a \( (d-1) \)-dimensional hyperplane.

(2) Unit variance

\[ \|x\|^2 = d \]

Vector lies on a sphere.

Resulting manifold

\[ \mathcal{M} = \{ x \mid \sum x_i = 0,\ \|x\|^2 = d \} \]

This is a \((d-2)\)-dimensional sphere.

LayerNorm projects activations onto a curved manifold.

3️⃣ Riemannian Optimization View

LayerNorm removes:

Optimization occurs on:

\[ \mathbb{R}^d / (\text{shift + scale}) \]

Effective gradient becomes projection:

\[ g_{\text{proj}} = P_{\text{tangent}} g \]

LN removes radial component and mean direction.

Acts like geometric preconditioning.

Explains smoother optimization, reduced scale sensitivity, and deep training stability.

4️⃣ Why It Works in Transformers

Transformers rely on dot products: \[ QK^T \] Dot product magnitude depends on vector norm. LayerNorm:

5️⃣ LayerNorm vs BatchNorm

Property BatchNorm LayerNorm
Normalizes overBatchFeatures
Depends on batch sizeYesNo
Train/test mismatchYesNo
Works with batch=1NoYes
Autoregressive compatibleNoYes
Per-sequence independenceNoYes

BatchNorm introduces cross-sample coupling:

\[ x_i^{norm} \text{ depends on } x_j \]

6️⃣ Pre-LN vs Post-LN

Post-LN

\[ x + \text{Sublayer}(x) \rightarrow \text{LN} \] Harder to train deep networks.

Pre-LN

\[ x + \text{Sublayer}(\text{LN}(x)) \] More stable gradient flow. Modern LLMs use Pre-LN + RMSNorm.

7️⃣ RMSNorm

\[ \text{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\frac{1}{d}\sum x_i^2}} \] Removes mean subtraction. Keeps scale invariance. Suggests scale invariance is more important than mean-centering.

8️⃣ Core Insights

  1. Enforces per-token scale stability.
  2. Constrains activations to a sphere.
  3. Changes optimization geometry.
  4. Stabilizes attention.
  5. Eliminates train/inference mismatch.
  6. Enables deep transformer scaling.

🔟 One-Sentence Mental Model

LayerNorm projects each token representation onto a scale-invariant manifold, stabilizing attention geometry and enabling deep transformer optimization.