High-Dimensional Learning — Curse of Dimensionality

A detailed study note covers the full statistical learning framework, a rigorous derivation of the three-way error decomposition (approximation, statistical, optimization), and three manifestations of the curse of dimensionality in Lipschitz estimation, neural network approximation, and non-convex optimization.

Statistical Learning Error Decomposition Curse of Dimensionality Lipschitz Estimation Approximation Theory ERM Rademacher Complexity Barron Class

Table of contents

1. Statistical Learning Framework foundations

Statistical learning concerns fitting an unknown function \(f^*\) from data. The setup involves four interacting components: a data distribution, a hypothesis class, a loss metric, and an estimation algorithm.

Data

We observe \(n\) i.i.d. pairs \(\{(x_i, y_i)\}_{i=1}^n\), where

Data model
\[ x_i \sim \nu \quad (\text{data distribution over } \mathcal X), \qquad y_i = f^*(x_i), \]

with \(f^*:\mathcal X\to\mathbb{R}\) the unknown target function we wish to recover. Examples: \(f^*(x)\) = excitation energy of molecule \(x\), or \(f^*(x)=P(y=1\mid x)\) in binary classification.

Why assumptions are unavoidable. We cannot learn \(f^*\) without structural assumptions on both \(\nu\) and \(f^*\). Without any assumption, every function consistent with the data is equally plausible, so no estimator can generalise.

Hypothesis class and complexity

We restrict the search to a hypothesis class \(\mathcal{F}\subset\{f:\mathcal{X}\to\mathbb{R}\}\). Within \(\mathcal{F}\), we measure model size via a (non-negative) complexity measure \(\gamma:\mathcal{F}\to\mathbb{R}_{\ge 0}\), e.g. a norm or neuron count.

Complexity ball
\[ \mathcal{F}_\delta = \{f\in\mathcal{F} : \gamma(f)\le\delta\}. \]

Choosing \(\delta\) trades approximation power against statistical tractability: larger \(\delta\) admits richer functions but also harder generalisation.

Hypothesis class Complexity measure \(\gamma(f)\) Intuition
Shallow neural network \(\sum_j a_j\rho(w_j^Tx+b_j)\) Number of neurons \(m\) More neurons → richer functions
Same shallow net Path norm \(\sum_j|a_j|(\|w_j\|+|b_j|)\) Controls smoothness of the function
Sobolev ball \(\mathcal{H}^s\) Sobolev norm \(\int(1+\omega^2)^s|\hat f(\omega)|^2\,d\omega\) \(s\) finite derivatives bounded
Implicit complexity. Often \(\gamma(f)\) is not prescribed explicitly but is defined implicitly through the training algorithm — e.g. gradient descent with early stopping or SGD with a particular initialisation.

2. Empirical Risk Minimisation three formulations

Population loss vs empirical loss

Population loss
\[ \mathcal{R}(f) = \mathbb{E}_\nu[\,\ell(f(x), f^*(x))\,], \]

where \(\ell(y,y')\) is a pointwise convex error measure, e.g. \(\ell(y,y')=|y-y'|^2\).

Empirical loss
\[ \hat{\mathcal{R}}(f) = \frac{1}{n}\sum_{i=1}^n \ell\!\left(f(x_i),\, f^*(x_i)\right). \]

For any fixed \(f\in\mathcal{F}\), the empirical loss is an unbiased estimator of the population loss:

Unbiasedness and variance
\[ \mathbb{E}[\hat{\mathcal{R}}(f)] = \mathcal{R}(f), \qquad \text{Var}[\hat{\mathcal{R}}(f)] = \frac{\sigma^2(f)}{n}, \quad \sigma^2(f) = \mathbb{E}\!\left[\left(\ell(f(x),f^*(x))-\mathcal{R}(f)\right)^2\right]. \]
Why pointwise unbiasedness is not enough. In machine learning, the hypothesis \(f\) is chosen after seeing the data, so it depends on the training set. Pointwise variance bounds do not control the worst-case fluctuation over all \(f\in\mathcal{F}_\delta\) simultaneously. We need uniform bounds — typically via Rademacher complexities.

Three formulations of ERM

The underlying goal is to minimise \(\mathcal{R}(f)\) while having access only to \(\hat{\mathcal{R}}(f)\). The three standard formulations are:

C
Constrained form. Fix a budget \(\delta\) and minimise empirical loss inside the ball: \[ \hat{f}_\delta = \mathop{\arg\min}_{f\in\mathcal{F}_\delta} \hat{\mathcal{R}}(f). \]
P
Penalised form. Add a regulariser weighted by \(\lambda>0\): \[ \hat{f} = \mathop{\arg\min}_{f\in\mathcal{F}} \hat{\mathcal{R}}(f) + \lambda\,\gamma(f). \]
I
Interpolation form. Minimise complexity subject to zero empirical loss: \[ \hat{f} = \mathop{\arg\min}_{f\in\mathcal{F}} \gamma(f) \quad \text{s.t.} \quad \hat{\mathcal{R}}(f)=0. \]

This is the formulation used to analyse double-descent and implicit regularisation of gradient descent.

Lagrangian duality. The constrained and penalised forms are equivalent for the right choice of \(\lambda\) as a function of \(\delta\), by the Lagrangian duality argument.

3. Error Decomposition full derivation

The central decomposition of excess population risk into three interpretable components is fundamental to theoretical ML. We derive it step by step.

Goal: bound \(\mathcal{R}(\hat{f}_\delta) - \inf_{f\in\mathcal{F}}\mathcal{R}(f)\) in terms of three manageable quantities.

Setup

Fix any \(\delta>0\) and let \(\hat{f}_\delta\in\mathcal{F}_\delta\) be the ERM solution (constrained form). Define the best possible population risk over the full class and over the ball:

Key notation
\[ f^*_\delta = \mathop{\arg\min}_{f\in\mathcal{F}_\delta}\mathcal{R}(f), \qquad \mathcal{R}^*_\delta = \inf_{f\in\mathcal{F}_\delta}\mathcal{R}(f), \qquad \mathcal{R}^* = \inf_{f\in\mathcal{F}}\mathcal{R}(f). \]

Step 1 — Introduce the constrained oracle

Add and subtract \(\mathcal{R}^*_\delta\):

Step 1
\[ \mathcal{R}(\hat{f}_\delta) - \mathcal{R}^* = \underbrace{\bigl(\mathcal{R}(\hat{f}_\delta) - \mathcal{R}^*_\delta\bigr)}_{\text{sub-optimality within }\mathcal{F}_\delta} + \underbrace{\bigl(\mathcal{R}^*_\delta - \mathcal{R}^*\bigr)}_{\varepsilon_{\mathrm{appr}}\ge 0}. \]

The second term is the approximation error: how much we lose by restricting to \(\mathcal{F}_\delta\) instead of all of \(\mathcal{F}\). It is always \(\ge 0\) since \(\mathcal{F}_\delta\subseteq\mathcal{F}\).

Step 2 — Replace population loss by empirical loss

Inside the first bracket, add and subtract the empirical loss at \(\hat{f}_\delta\) and at \(f^*_\delta\):

Step 2
\[ \mathcal{R}(\hat{f}_\delta) - \mathcal{R}^*_\delta = \underbrace{\bigl(\hat{\mathcal{R}}(\hat{f}_\delta) - \hat{\mathcal{R}}(f^*_\delta)\bigr)}_{\le\,\varepsilon_{\mathrm{opt}}\,\text{by defn}} + \underbrace{\bigl(\mathcal{R}(\hat{f}_\delta) - \hat{\mathcal{R}}(\hat{f}_\delta)\bigr)}_{\text{fluctuation at }\hat{f}_\delta} + \underbrace{\bigl(\hat{\mathcal{R}}(f^*_\delta) - \mathcal{R}(f^*_\delta)\bigr)}_{\text{fluctuation at }f^*_\delta}. \]

The first bracket is the optimisation error \(\varepsilon_{\mathrm{opt}} = \hat{\mathcal{R}}(\hat{f}_\delta) - \inf_{f\in\mathcal{F}_\delta}\hat{\mathcal{R}}(f) \ge 0\). It is zero if ERM is solved exactly (often assumed in theory).

Step 3 — Bound fluctuations uniformly

Both fluctuation terms are bounded by the supremum of the generalisation gap over the whole ball:

Step 3
\[ \bigl|\mathcal{R}(f) - \hat{\mathcal{R}}(f)\bigr| \le \sup_{f\in\mathcal{F}_\delta}\bigl|\mathcal{R}(f) - \hat{\mathcal{R}}(f)\bigr| \quad\forall f\in\mathcal{F}_\delta. \] Therefore, combining both fluctuation terms: \[ \bigl(\mathcal{R}(\hat{f}_\delta) - \hat{\mathcal{R}}(\hat{f}_\delta)\bigr) + \bigl(\hat{\mathcal{R}}(f^*_\delta) - \mathcal{R}(f^*_\delta)\bigr) \le 2\sup_{f\in\mathcal{F}_\delta}\bigl|\mathcal{R}(f) - \hat{\mathcal{R}}(f)\bigr| =:\varepsilon_{\mathrm{stat}}. \]

Final bound

Three-way error decomposition
\[ \mathcal{R}(\hat{f}_\delta) - \mathcal{R}^* \;\le\; \underbrace{\varepsilon_{\mathrm{opt}}}_{\text{optimisation}} + \underbrace{2\sup_{f\in\mathcal{F}_\delta}|\mathcal{R}(f)-\hat{\mathcal{R}}(f)|}_{\varepsilon_{\mathrm{stat}},\;\text{statistical}} + \underbrace{(\mathcal{R}^*_\delta - \mathcal{R}^*)}_{\varepsilon_{\mathrm{appr}},\;\text{approximation}}. \]
\(\varepsilon_{\mathrm{opt}}\)
How far the training algorithm is from the ERM minimiser on the empirical objective. Decreases with more compute.
\(\varepsilon_{\mathrm{stat}}\)
Maximum discrepancy between population and empirical loss over the model class. Decreases with more data \(n\); increases with larger \(\delta\).
\(\varepsilon_{\mathrm{appr}}\)
How far the best model in the class is from the true target. Decreases with larger \(\delta\); independent of data.
The fundamental tension. \(\delta\) controls the bias-variance tradeoff at the model-class level: increasing \(\delta\) reduces \(\varepsilon_{\mathrm{appr}}\) but inflates \(\varepsilon_{\mathrm{stat}}\). Choosing the optimal \(\delta\) as a function of \(n\) is the classical model-selection problem.

Geometric picture

Imagine the function space as a plane with two sets of concentric contours: one for the population loss (centred on \(f^*\)) and one for the empirical loss (slightly shifted due to finite data). The ball \(\mathcal{F}_\delta\) is a constrained region:

4. Approximation Error in Depth model capacity

Approximation error
\[ \varepsilon_{\mathrm{appr}}(\delta) = \inf_{f\in\mathcal{F}_\delta}\mathcal{R}(f) - \inf_{f\in\mathcal{F}}\mathcal{R}(f) = \inf_{f\in\mathcal{F}_\delta}\mathcal{R}(f) - \mathcal{R}^*. \]

Three regimes matter in practice:

Universal approximation

If \(\mathcal{F}\) is dense (e.g. neural networks with a non-polynomial activation), then \(\inf_{f\in\mathcal{F}}\mathcal{R}(f)=0\), so \(\mathcal{R}^*=0\). But making \(\delta\to\infty\) may be statistically intractable.

Sobolev class

If \(f^*\in\mathcal{H}^s\) (s finite derivatives bounded), then for shallow nets with \(m\) neurons: \(\varepsilon_{\mathrm{appr}} = \Theta(m^{-s/d})\). The rate degrades exponentially with input dimension \(d\).

Barron class

If \(\int|\hat{f}^*(\omega)|\|\omega\|^2\,d\omega<\infty\), the approximation error is \(\varepsilon_{\mathrm{appr}} = O(m^{-1})\), independent of \(d\). A rare curse-free setting.

Key message. Approximation error reflects a pure model-capacity question: does \(\mathcal{F}_\delta\) contain a function close to \(f^*\)? It has nothing to do with data or the training algorithm. It is determined entirely by the architecture and the target.

5. Statistical Error in Depth generalisation gap

Statistical error
\[ \varepsilon_{\mathrm{stat}} = 2\sup_{f\in\mathcal{F}_\delta}\bigl|\mathcal{R}(f) - \hat{\mathcal{R}}(f)\bigr|. \]

This is the uniform law of large numbers deviation. Pointwise, each \(\hat{\mathcal{R}}(f)\) concentrates around \(\mathcal{R}(f)\), but we need simultaneous control over all \(f\in\mathcal{F}_\delta\) because the chosen \(\hat{f}_\delta\) depends on the training data.

Rademacher complexity

The standard tool to bound \(\varepsilon_{\mathrm{stat}}\) is the empirical Rademacher complexity:

Rademacher complexity
\[ \mathfrak{R}_n(\mathcal{F}_\delta) = \mathbb{E}_{\varepsilon,x}\!\left[ \sup_{f\in\mathcal{F}_\delta} \frac{1}{n}\sum_{i=1}^n \varepsilon_i f(x_i) \right], \] where \(\varepsilon_i\in\{-1,+1\}\) are independent Rademacher (uniform random sign) variables.
Generalisation bound via Rademacher complexity

With probability at least \(1-\delta'\) over the draw of the data:

\[ \sup_{f\in\mathcal{F}_\delta}|\mathcal{R}(f) - \hat{\mathcal{R}}(f)| \le 2\mathfrak{R}_n(\ell\circ\mathcal{F}_\delta) + c\sqrt{\frac{\log(1/\delta')}{n}}. \]
Intuition for Rademacher complexity. The random signs \(\varepsilon_i\) represent pure noise. A rich class can "fit" noise well, achieving a large correlation. A simpler class cannot adapt to random signs and therefore has small complexity.

How the ball size \(\delta\) enters

For Lipschitz functions, the Rademacher complexity of the ball \(\mathcal{F}_\delta\) typically scales as:

Typical scaling
\[ \mathfrak{R}_n(\mathcal{F}_\delta) \sim \delta \cdot C(n, d), \]

where \(C(n,d)\) captures dimensional complexity. For Lipschitz classes on \(\mathbb{R}^d\), \(C(n,d)\sim n^{-1/d}\), exposing the curse.

6. Optimisation Error in Depth algorithmic gap

Optimisation error
\[ \varepsilon_{\mathrm{opt}} = \hat{\mathcal{R}}(\hat{f}_\delta) - \inf_{f\in\mathcal{F}_\delta}\hat{\mathcal{R}}(f). \]

This is zero only when the algorithm finds a global minimiser of \(\hat{\mathcal{R}}\) on \(\mathcal{F}_\delta\). In theory, this is often assumed to be zero (exact ERM). In practice, it depends on how well the optimiser converges.

When it is hard to minimise

For generic non-convex losses, finding the global minimum is NP-hard. In high dimensions, the landscape may have exponentially many local minima.

When it is tractable

Many ML models (linear regression, kernel regression, convex losses) make ERM a convex problem. Gradient descent provably converges to the global minimum.

Theorem (Jin et al., 2017). Noisy gradient descent finds an \(\varepsilon\)-approximate second-order stationary point of a \(\beta\)-smooth function in \(\tilde{O}(\beta/\varepsilon^2)\) iterations — a rate with no explicit dependence on dimension (up to logarithmic factors). Provided the empirical loss has no spurious local minima, this translates to a global minimiser.

7. The Curse of Dimensionality Bellman, 1962

The curse of dimensionality refers to the phenomenon where computational and statistical costs grow exponentially (or super-polynomially) with the input dimension \(d\), making learning from high-dimensional data fundamentally hard without further structure.

"The curse of dimensionality" — Richard Bellman, 1962. The name was coined in the context of dynamic programming, but the phenomenon is universal.

Three faces of the curse

Statistical curse

The number of samples needed to learn a generic Lipschitz function to error \(\varepsilon\) grows as \(n\sim\varepsilon^{-d}\). Even moderate \(d\) makes this astronomical.

Approximation curse

A network with \(m\) neurons approximates a Sobolev \(s\)-smooth function only to error \(\Theta(m^{-s/d})\). Increasing \(d\) requires exponentially more neurons for the same accuracy.

Optimisation curse

Minimising a generic \(d\)-dimensional function is NP-hard. The number of local minima can grow exponentially in \(d\), though structured ML losses often escape this.

High-dimensional geometry intuition

In high dimensions, the volume of a unit ball concentrates near its surface: almost all of the mass of a Gaussian in \(\mathbb{R}^d\) lies in a thin shell at radius \(\sqrt{d}\). This means that two random points are nearly orthogonal, and there are no "close neighbours" — every point is far from every other point.

Volume concentration
\[ \frac{\text{Vol}(B_d(1-\varepsilon))}{\text{Vol}(B_d(1))} = (1-\varepsilon)^d \;\xrightarrow{d\to\infty}\; 0 \quad \text{for any fixed } \varepsilon>0. \]

Almost all of the volume of the unit ball sits within distance \(\varepsilon\) of the boundary. This makes grid-based methods useless: you would need \((1/\varepsilon)^d\) grid points to cover the ball.

8. Learning Lipschitz Functions statistical curse quantified

The Lipschitz class is the simplest regularity assumption that captures continuity. We show that even here, learning is exponentially hard in dimension.

Lipschitz function
\[ f:\mathcal X\subseteq\mathbb{R}^d\to\mathbb{R} \text{ is } \beta\text{-Lipschitz if } |f(x)-f(x')|\le\beta\|x-x'\| \quad\forall x,x'. \]

Upper bound: nearest-neighbour estimator

Fix \(f^*\) to be 1-Lipschitz and let \(\nu=\mathcal{N}(0,I_d)\). The nearest-neighbour estimator \(\hat{f}(x)=y_{i^*}\), where \(i^*=\arg\min_i\|x-x_i\|\), satisfies:

Bounding the error

For any test point \(x\), let \(r_n = \|x - x_{i^*}\|\) be the distance to the nearest training point.

\[ |f^*(x) - \hat{f}(x)| = |f^*(x) - f^*(x_{i^*})| \le \|x - x_{i^*}\| = r_n \quad\text{(Lipschitz condition)}. \]

The squared population loss is bounded by the expected squared nearest-neighbour distance:

\[ \mathcal{R}(\hat{f}) \le \mathbb{E}[r_n^2]. \]

To bound \(\mathbb{E}[r_n^2]\), we analyse how densely \(n\) random Gaussian points cover \(\mathbb{R}^d\). Covering a ball of radius \(R\) with \((R/r)^d\) balls of radius \(r\) means we need \(n\sim(R/r)^d\) samples to guarantee every point has a neighbour within distance \(r\).

Upper bound
\[ \mathcal{R}(\hat{f}) = O\!\left(n^{-2/d}\right) \quad \text{as } n\to\infty. \]

To achieve error \(\varepsilon\), we need \(n = O(\varepsilon^{-d/2})\) samples — exponential in \(d\).

Lower bound: Le Cam / Fano method

The lower bound shows this exponential rate is unavoidable for any estimator, not just nearest-neighbour.

Lower bound sketch

Construct a set of \(M\) functions that are all 1-Lipschitz but pairwise differ by at least \(\varepsilon\) in \(L^2(\nu)\) on disjoint regions of \(\mathbb{R}^d\):

  • Partition a region into cubes of side length \(r\).
  • On each cube, the function can be \(0\) or go up by \(r\) (staying Lipschitz).
  • Number of cubes: \(\sim (R/r)^d\), giving \(M = 2^{(R/r)^d}\) distinct functions.
  • With \(n\) samples, we cannot distinguish \(M\) hypotheses unless \(n\gtrsim \log M \sim (R/r)^d\).

Setting \(r = \varepsilon\) gives the lower bound:

\[ \inf_{\hat f}\sup_{f^*\in\mathrm{Lip}(1)}\mathbb{E}[\mathcal{R}(\hat f)] = \Omega\!\left(n^{-2/d}\right). \]
Minimax rate for Lipschitz estimation
\[ \inf_{\hat f}\sup_{f^*\in\mathrm{Lip}(\beta)} \mathbb{E}[\,\mathcal{R}(\hat f)\,] = \Theta\!\left(n^{-2/d}\right). \]

Both the nearest-neighbour upper bound and the Le Cam lower bound match. The rate is tight. For \(d=100\) and target error \(\varepsilon=0.1\), we need \(n \sim 0.1^{-50} = 10^{50}\) samples.

Implication. The Lipschitz class is too large for high-dimensional learning. Although it is a natural smoothness assumption, it does not give enough structure to overcome the statistical curse. We must impose additional structure.

9. Curse in Approximation shallow nets and Sobolev rates

Shallow neural network class

Shallow net
\[ f(x) = \sum_{j=1}^m a_j\,\rho(w_j^Tx + b_j), \qquad \mathcal{F} = \{f \text{ of this form for any } m,a_j,w_j,b_j\}. \]

By the Universal Approximation Theorem (Hornik, Cybenko, Barron, Pinkus, …): if \(\rho\) is not a polynomial, \(\mathcal{F}\) is dense in the class of continuous functions under the uniform compact topology. But density alone does not tell us how many neurons we need.

Sobolev class: curse remains

Sobolev approximation rate (DeVore, Maiorov)

If \(f^*\in\mathcal{H}^s\) (the Sobolev class with \(s\) bounded derivatives), then the best approximation by a shallow net with \(m\) neurons satisfies:

\[ \inf_{g\in\mathcal{F}}\sup_{x\in K}|f^*(x)-g(x)| = \Theta\!\left(m^{-s/d}\right). \]

Here \(K\subset\mathbb{R}^d\) is a compact domain and \(s\) is the number of bounded derivatives.

Why the curse appears

Approximating a smooth function in \(\mathbb{R}^d\) requires covering the input space. A grid of side \(h\) needs \((1/h)^d\) cells. A shallow net approximating the function at resolution \(h\) on each cell needs at most one neuron per cell (a bump function), so \(m\sim(1/h)^d\). Inverting: \(h\sim m^{-1/d}\), so the approximation error (controlled by \(h^s\)) is \(\sim m^{-s/d}\).

Consequence for error decomposition. Setting \(\delta\) proportional to the neuron budget \(m\), \(\varepsilon_{\mathrm{appr}}(\delta)\sim m^{-s/d}\). For fixed accuracy \(\varepsilon\), we need \(m\sim\varepsilon^{-d/s}\) neurons — exponential in \(d\).

Barron class: the curse-free exception

Barron condition
\[ C_{f^*}^2 := \int_{\mathbb{R}^d} |\hat{f}^*(\omega)|\,\|\omega\|^2\,d\omega < \infty, \] where \(\hat{f}^*\) is the Fourier transform of \(f^*\).
Barron approximation theorem
\[ \inf_{g\in\mathcal{F}_m}\|f^*-g\|_{L^2(\nu)}^2 = O\!\left(\frac{C_{f^*}^2}{m}\right), \] where \(\mathcal{F}_m\) consists of shallow nets with at most \(m\) neurons.

The rate \(O(m^{-1})\) is independent of dimension \(d\). This is because the Barron condition controls the Fourier energy at high frequencies — functions satisfying it are "band-limited enough" that they can be well-approximated by averaging random Fourier features.

Monte Carlo argument (sketch)

By the Barron condition, \(f^*\) has an integral representation:

\[ f^*(x) = \int_{\mathbb{R}^d} e^{i\omega^Tx}\hat{f}^*(\omega)\,d\omega \approx \sum_{j=1}^m w_j\,e^{i\omega_j^Tx}, \quad \omega_j \sim \frac{|\hat{f}^*(\omega)|\|\omega\|^2}{C_{f^*}^2}. \]

By Monte Carlo, each term contributes variance \(\le C_{f^*}^2/m\) to the approximation error. The dimension \(d\) does not appear because Fourier sampling is dimension-agnostic.

Practical caveat. The Barron condition is a strong restriction on \(f^*\). Most functions encountered in practice (e.g. those that depend on all \(d\) coordinates in a non-linear and non-smooth way) do not satisfy it. It is a special favourable case, not the rule.
Function class Approximation rate (neurons \(m\)) Curse?
Sobolev \(\mathcal{H}^s\) in \(\mathbb{R}^d\) \(\Theta(m^{-s/d})\) Yes — exponential in \(d\)
Barron class (spectral condition) \(O(m^{-1})\) No — dimension-free
Composition of Sobolev pieces (deep nets) \(O(m^{-s/d'})\) with \(d'\ll d\) Partially lifted if \(d'\) is small

10. Curse in Optimisation non-convex landscapes

Even if the model class and data are well-behaved, finding the ERM solution is in general NP-hard.

Hardness result. Training a 3-node network with sigmoid activations to minimise the squared error is NP-hard (Blum and Rivest, 1992). The landscape of the empirical loss has exponentially many saddle points and local minima in general.

Why neural network training often works despite NP-hardness

The NP-hardness result applies to the worst case. Many practically important ML models define "benign" landscapes: losses with no spurious local minima, or where all local minima are global minima.

Over-parameterisation

With enough parameters, the empirical loss typically has many global minima forming a connected manifold. Gradient descent converges to this manifold from any initialisation.

Implicit regularisation

Even among global minima, gradient descent with small learning rate tends to pick low-complexity solutions, acting as an implicit regulariser.

Gradient descent escapes saddle points

Noisy GD theorem (Jin et al., 2017)

Let \(\mathcal{L}\) be a \(\beta\)-smooth function. Noisy gradient descent (GD with small random perturbations) finds an \(\varepsilon\)-approximate second-order stationary point — a point where \(\|\nabla\mathcal{L}\|\le\varepsilon\) and \(\lambda_{\min}(\nabla^2\mathcal{L})\ge-\sqrt{\varepsilon}\) — in at most

\[ T = \tilde{O}\!\left(\frac{\beta}{\varepsilon^2}\right) \]

iterations. The iteration count has no explicit dependence on dimension (up to log factors).

Key implication. If the ERM landscape has no bad local minima (all second-order stationary points are global minima), then noisy GD efficiently finds a global minimiser in polynomial time regardless of dimension. This is verified for several ML models (matrix factorisation, two-layer nets in the mean-field limit, etc.).

Summary of three curses

Curse Where it appears Formal manifestation Known escape
Statistical \(\varepsilon_{\mathrm{stat}}\) Minimax rate \(\Theta(n^{-2/d})\) for Lipschitz class Geometric priors, invariance, low-dim structure
Approximation \(\varepsilon_{\mathrm{appr}}\) \(\Theta(m^{-s/d})\) for Sobolev class Barron class; compositional / deep models
Optimisation \(\varepsilon_{\mathrm{opt}}\) NP-hard in general; exponential local minima Benign landscapes; noisy GD; implicit regularisation

11. Summary and the Way Out geometric function spaces

The three curses leave us in a dilemma:

Lipschitz class is too large

The statistical error is exponential in dimension: \(\varepsilon_{\mathrm{stat}}\sim n^{-2/d}\). Even with infinite compute, no algorithm can learn a generic Lipschitz function in high dimension from a polynomial number of samples.

Smooth Sobolev/Barron classes are too restrictive

The approximation error for Sobolev functions is \(\Theta(m^{-s/d})\), cursed again. The Barron class avoids this but requires very strong Fourier decay, which most real targets do not satisfy.

We need function classes that are rich enough to contain \(f^*\) but structured enough to be statistically and computationally tractable.

Geometric function spaces: the key idea

Real high-dimensional data — images, molecules, 3D shapes, social graphs — are not arbitrary vectors in \(\mathbb{R}^d\). They are signals defined on structured low-dimensional domains:

Data type High-dim ambient space Low-dim geometric domain \(\Omega\) Signal
Image \(\mathbb{R}^{H\times W\times 3}\) 2D pixel grid colour per pixel
Molecule \(\mathbb{R}^{3N}\) (atom positions) graph of bonds atom features per node
3D shape \(\mathbb{R}^{d}\) (point cloud) manifold / mesh surface geometric or semantic feature

The geometric domain \(\Omega\) carries symmetries (translations, rotations, permutations) and a notion of locality (nearby pixels share context; bonded atoms interact). By encoding these priors into the hypothesis class, we can simultaneously reduce approximation and statistical error — without losing \(f^*\) from the class, provided \(f^*\) itself respects the symmetry.

Geometric regularity vs classical regularity. Classical Sobolev regularity penalises any oscillation in input space. Geometric regularity additionally respects symmetry (invariance/equivariance) and locality (nearby domain points give similar signals). This is a richer but more relevant prior for structured data.

What comes next

Symmetry and group theory: formalise which transformations of \(\Omega\) should leave \(f^*\) invariant or equivariant.
Group smoothing: show that averaging over group orbits is an orthogonal projection that cannot worsen approximation error when \(f^*\) is invariant.
Sample complexity under invariance: the effective sample size scales as \((|\mathfrak{G}|n)^{-1/d}\), giving a multiplicative improvement by the group size.
Scale separation: composing equivariant layers with local pooling reduces the effective dimension from \(d\) to a much smaller intrinsic dimension, further lifting the curse.
Big picture. The error decomposition \(\varepsilon_{\mathrm{opt}}+\varepsilon_{\mathrm{stat}}+\varepsilon_{\mathrm{appr}}\) provides a clean framework for understanding where high-dimensional learning fails and how structure helps. Each geometric prior directly addresses one or more of these three terms.