A detailed study note covers the full statistical learning framework, a rigorous derivation of the three-way error decomposition (approximation, statistical, optimization), and three manifestations of the curse of dimensionality in Lipschitz estimation, neural network approximation, and non-convex optimization.
Statistical learning concerns fitting an unknown function \(f^*\) from data. The setup involves four interacting components: a data distribution, a hypothesis class, a loss metric, and an estimation algorithm.
We observe \(n\) i.i.d. pairs \(\{(x_i, y_i)\}_{i=1}^n\), where
with \(f^*:\mathcal X\to\mathbb{R}\) the unknown target function we wish to recover. Examples: \(f^*(x)\) = excitation energy of molecule \(x\), or \(f^*(x)=P(y=1\mid x)\) in binary classification.
We restrict the search to a hypothesis class \(\mathcal{F}\subset\{f:\mathcal{X}\to\mathbb{R}\}\). Within \(\mathcal{F}\), we measure model size via a (non-negative) complexity measure \(\gamma:\mathcal{F}\to\mathbb{R}_{\ge 0}\), e.g. a norm or neuron count.
Choosing \(\delta\) trades approximation power against statistical tractability: larger \(\delta\) admits richer functions but also harder generalisation.
| Hypothesis class | Complexity measure \(\gamma(f)\) | Intuition |
|---|---|---|
| Shallow neural network \(\sum_j a_j\rho(w_j^Tx+b_j)\) | Number of neurons \(m\) | More neurons → richer functions |
| Same shallow net | Path norm \(\sum_j|a_j|(\|w_j\|+|b_j|)\) | Controls smoothness of the function |
| Sobolev ball \(\mathcal{H}^s\) | Sobolev norm \(\int(1+\omega^2)^s|\hat f(\omega)|^2\,d\omega\) | \(s\) finite derivatives bounded |
where \(\ell(y,y')\) is a pointwise convex error measure, e.g. \(\ell(y,y')=|y-y'|^2\).
For any fixed \(f\in\mathcal{F}\), the empirical loss is an unbiased estimator of the population loss:
The underlying goal is to minimise \(\mathcal{R}(f)\) while having access only to \(\hat{\mathcal{R}}(f)\). The three standard formulations are:
This is the formulation used to analyse double-descent and implicit regularisation of gradient descent.
The central decomposition of excess population risk into three interpretable components is fundamental to theoretical ML. We derive it step by step.
Fix any \(\delta>0\) and let \(\hat{f}_\delta\in\mathcal{F}_\delta\) be the ERM solution (constrained form). Define the best possible population risk over the full class and over the ball:
Add and subtract \(\mathcal{R}^*_\delta\):
The second term is the approximation error: how much we lose by restricting to \(\mathcal{F}_\delta\) instead of all of \(\mathcal{F}\). It is always \(\ge 0\) since \(\mathcal{F}_\delta\subseteq\mathcal{F}\).
Inside the first bracket, add and subtract the empirical loss at \(\hat{f}_\delta\) and at \(f^*_\delta\):
The first bracket is the optimisation error \(\varepsilon_{\mathrm{opt}} = \hat{\mathcal{R}}(\hat{f}_\delta) - \inf_{f\in\mathcal{F}_\delta}\hat{\mathcal{R}}(f) \ge 0\). It is zero if ERM is solved exactly (often assumed in theory).
Both fluctuation terms are bounded by the supremum of the generalisation gap over the whole ball:
Imagine the function space as a plane with two sets of concentric contours: one for the population loss (centred on \(f^*\)) and one for the empirical loss (slightly shifted due to finite data). The ball \(\mathcal{F}_\delta\) is a constrained region:
Three regimes matter in practice:
If \(\mathcal{F}\) is dense (e.g. neural networks with a non-polynomial activation), then \(\inf_{f\in\mathcal{F}}\mathcal{R}(f)=0\), so \(\mathcal{R}^*=0\). But making \(\delta\to\infty\) may be statistically intractable.
If \(f^*\in\mathcal{H}^s\) (s finite derivatives bounded), then for shallow nets with \(m\) neurons: \(\varepsilon_{\mathrm{appr}} = \Theta(m^{-s/d})\). The rate degrades exponentially with input dimension \(d\).
If \(\int|\hat{f}^*(\omega)|\|\omega\|^2\,d\omega<\infty\), the approximation error is \(\varepsilon_{\mathrm{appr}} = O(m^{-1})\), independent of \(d\). A rare curse-free setting.
This is the uniform law of large numbers deviation. Pointwise, each \(\hat{\mathcal{R}}(f)\) concentrates around \(\mathcal{R}(f)\), but we need simultaneous control over all \(f\in\mathcal{F}_\delta\) because the chosen \(\hat{f}_\delta\) depends on the training data.
The standard tool to bound \(\varepsilon_{\mathrm{stat}}\) is the empirical Rademacher complexity:
With probability at least \(1-\delta'\) over the draw of the data:
\[ \sup_{f\in\mathcal{F}_\delta}|\mathcal{R}(f) - \hat{\mathcal{R}}(f)| \le 2\mathfrak{R}_n(\ell\circ\mathcal{F}_\delta) + c\sqrt{\frac{\log(1/\delta')}{n}}. \]For Lipschitz functions, the Rademacher complexity of the ball \(\mathcal{F}_\delta\) typically scales as:
where \(C(n,d)\) captures dimensional complexity. For Lipschitz classes on \(\mathbb{R}^d\), \(C(n,d)\sim n^{-1/d}\), exposing the curse.
This is zero only when the algorithm finds a global minimiser of \(\hat{\mathcal{R}}\) on \(\mathcal{F}_\delta\). In theory, this is often assumed to be zero (exact ERM). In practice, it depends on how well the optimiser converges.
For generic non-convex losses, finding the global minimum is NP-hard. In high dimensions, the landscape may have exponentially many local minima.
Many ML models (linear regression, kernel regression, convex losses) make ERM a convex problem. Gradient descent provably converges to the global minimum.
The curse of dimensionality refers to the phenomenon where computational and statistical costs grow exponentially (or super-polynomially) with the input dimension \(d\), making learning from high-dimensional data fundamentally hard without further structure.
The number of samples needed to learn a generic Lipschitz function to error \(\varepsilon\) grows as \(n\sim\varepsilon^{-d}\). Even moderate \(d\) makes this astronomical.
A network with \(m\) neurons approximates a Sobolev \(s\)-smooth function only to error \(\Theta(m^{-s/d})\). Increasing \(d\) requires exponentially more neurons for the same accuracy.
Minimising a generic \(d\)-dimensional function is NP-hard. The number of local minima can grow exponentially in \(d\), though structured ML losses often escape this.
In high dimensions, the volume of a unit ball concentrates near its surface: almost all of the mass of a Gaussian in \(\mathbb{R}^d\) lies in a thin shell at radius \(\sqrt{d}\). This means that two random points are nearly orthogonal, and there are no "close neighbours" — every point is far from every other point.
Almost all of the volume of the unit ball sits within distance \(\varepsilon\) of the boundary. This makes grid-based methods useless: you would need \((1/\varepsilon)^d\) grid points to cover the ball.
The Lipschitz class is the simplest regularity assumption that captures continuity. We show that even here, learning is exponentially hard in dimension.
Fix \(f^*\) to be 1-Lipschitz and let \(\nu=\mathcal{N}(0,I_d)\). The nearest-neighbour estimator \(\hat{f}(x)=y_{i^*}\), where \(i^*=\arg\min_i\|x-x_i\|\), satisfies:
For any test point \(x\), let \(r_n = \|x - x_{i^*}\|\) be the distance to the nearest training point.
\[ |f^*(x) - \hat{f}(x)| = |f^*(x) - f^*(x_{i^*})| \le \|x - x_{i^*}\| = r_n \quad\text{(Lipschitz condition)}. \]The squared population loss is bounded by the expected squared nearest-neighbour distance:
\[ \mathcal{R}(\hat{f}) \le \mathbb{E}[r_n^2]. \]To bound \(\mathbb{E}[r_n^2]\), we analyse how densely \(n\) random Gaussian points cover \(\mathbb{R}^d\). Covering a ball of radius \(R\) with \((R/r)^d\) balls of radius \(r\) means we need \(n\sim(R/r)^d\) samples to guarantee every point has a neighbour within distance \(r\).
To achieve error \(\varepsilon\), we need \(n = O(\varepsilon^{-d/2})\) samples — exponential in \(d\).
The lower bound shows this exponential rate is unavoidable for any estimator, not just nearest-neighbour.
Construct a set of \(M\) functions that are all 1-Lipschitz but pairwise differ by at least \(\varepsilon\) in \(L^2(\nu)\) on disjoint regions of \(\mathbb{R}^d\):
Setting \(r = \varepsilon\) gives the lower bound:
\[ \inf_{\hat f}\sup_{f^*\in\mathrm{Lip}(1)}\mathbb{E}[\mathcal{R}(\hat f)] = \Omega\!\left(n^{-2/d}\right). \]Both the nearest-neighbour upper bound and the Le Cam lower bound match. The rate is tight. For \(d=100\) and target error \(\varepsilon=0.1\), we need \(n \sim 0.1^{-50} = 10^{50}\) samples.
By the Universal Approximation Theorem (Hornik, Cybenko, Barron, Pinkus, …): if \(\rho\) is not a polynomial, \(\mathcal{F}\) is dense in the class of continuous functions under the uniform compact topology. But density alone does not tell us how many neurons we need.
If \(f^*\in\mathcal{H}^s\) (the Sobolev class with \(s\) bounded derivatives), then the best approximation by a shallow net with \(m\) neurons satisfies:
\[ \inf_{g\in\mathcal{F}}\sup_{x\in K}|f^*(x)-g(x)| = \Theta\!\left(m^{-s/d}\right). \]Here \(K\subset\mathbb{R}^d\) is a compact domain and \(s\) is the number of bounded derivatives.
Approximating a smooth function in \(\mathbb{R}^d\) requires covering the input space. A grid of side \(h\) needs \((1/h)^d\) cells. A shallow net approximating the function at resolution \(h\) on each cell needs at most one neuron per cell (a bump function), so \(m\sim(1/h)^d\). Inverting: \(h\sim m^{-1/d}\), so the approximation error (controlled by \(h^s\)) is \(\sim m^{-s/d}\).
Consequence for error decomposition. Setting \(\delta\) proportional to the neuron budget \(m\), \(\varepsilon_{\mathrm{appr}}(\delta)\sim m^{-s/d}\). For fixed accuracy \(\varepsilon\), we need \(m\sim\varepsilon^{-d/s}\) neurons — exponential in \(d\).
The rate \(O(m^{-1})\) is independent of dimension \(d\). This is because the Barron condition controls the Fourier energy at high frequencies — functions satisfying it are "band-limited enough" that they can be well-approximated by averaging random Fourier features.
By the Barron condition, \(f^*\) has an integral representation:
\[ f^*(x) = \int_{\mathbb{R}^d} e^{i\omega^Tx}\hat{f}^*(\omega)\,d\omega \approx \sum_{j=1}^m w_j\,e^{i\omega_j^Tx}, \quad \omega_j \sim \frac{|\hat{f}^*(\omega)|\|\omega\|^2}{C_{f^*}^2}. \]By Monte Carlo, each term contributes variance \(\le C_{f^*}^2/m\) to the approximation error. The dimension \(d\) does not appear because Fourier sampling is dimension-agnostic.
| Function class | Approximation rate (neurons \(m\)) | Curse? |
|---|---|---|
| Sobolev \(\mathcal{H}^s\) in \(\mathbb{R}^d\) | \(\Theta(m^{-s/d})\) | Yes — exponential in \(d\) |
| Barron class (spectral condition) | \(O(m^{-1})\) | No — dimension-free |
| Composition of Sobolev pieces (deep nets) | \(O(m^{-s/d'})\) with \(d'\ll d\) | Partially lifted if \(d'\) is small |
Even if the model class and data are well-behaved, finding the ERM solution is in general NP-hard.
The NP-hardness result applies to the worst case. Many practically important ML models define "benign" landscapes: losses with no spurious local minima, or where all local minima are global minima.
With enough parameters, the empirical loss typically has many global minima forming a connected manifold. Gradient descent converges to this manifold from any initialisation.
Even among global minima, gradient descent with small learning rate tends to pick low-complexity solutions, acting as an implicit regulariser.
Let \(\mathcal{L}\) be a \(\beta\)-smooth function. Noisy gradient descent (GD with small random perturbations) finds an \(\varepsilon\)-approximate second-order stationary point — a point where \(\|\nabla\mathcal{L}\|\le\varepsilon\) and \(\lambda_{\min}(\nabla^2\mathcal{L})\ge-\sqrt{\varepsilon}\) — in at most
\[ T = \tilde{O}\!\left(\frac{\beta}{\varepsilon^2}\right) \]iterations. The iteration count has no explicit dependence on dimension (up to log factors).
| Curse | Where it appears | Formal manifestation | Known escape |
|---|---|---|---|
| Statistical | \(\varepsilon_{\mathrm{stat}}\) | Minimax rate \(\Theta(n^{-2/d})\) for Lipschitz class | Geometric priors, invariance, low-dim structure |
| Approximation | \(\varepsilon_{\mathrm{appr}}\) | \(\Theta(m^{-s/d})\) for Sobolev class | Barron class; compositional / deep models |
| Optimisation | \(\varepsilon_{\mathrm{opt}}\) | NP-hard in general; exponential local minima | Benign landscapes; noisy GD; implicit regularisation |
The three curses leave us in a dilemma:
The statistical error is exponential in dimension: \(\varepsilon_{\mathrm{stat}}\sim n^{-2/d}\). Even with infinite compute, no algorithm can learn a generic Lipschitz function in high dimension from a polynomial number of samples.
The approximation error for Sobolev functions is \(\Theta(m^{-s/d})\), cursed again. The Barron class avoids this but requires very strong Fourier decay, which most real targets do not satisfy.
Real high-dimensional data — images, molecules, 3D shapes, social graphs — are not arbitrary vectors in \(\mathbb{R}^d\). They are signals defined on structured low-dimensional domains:
| Data type | High-dim ambient space | Low-dim geometric domain \(\Omega\) | Signal |
|---|---|---|---|
| Image | \(\mathbb{R}^{H\times W\times 3}\) | 2D pixel grid | colour per pixel |
| Molecule | \(\mathbb{R}^{3N}\) (atom positions) | graph of bonds | atom features per node |
| 3D shape | \(\mathbb{R}^{d}\) (point cloud) | manifold / mesh surface | geometric or semantic feature |
The geometric domain \(\Omega\) carries symmetries (translations, rotations, permutations) and a notion of locality (nearby pixels share context; bonded atoms interact). By encoding these priors into the hypothesis class, we can simultaneously reduce approximation and statistical error — without losing \(f^*\) from the class, provided \(f^*\) itself respects the symmetry.