The focus is mathematical logic: geometric domains, group actions, representations, invariance, equivariance, group smoothing, projection arguments, sample complexity, scale separation, and the GDL blueprint.
In high-dimensional learning, the number of samples needed to learn a generic function can grow exponentially with the input dimension. Geometric Deep Learning starts from the observation that many real data types are not arbitrary vectors. They are signals on structured domains: images on grids, molecular features on graphs, features on manifolds, or sets of particles.
How well the best function in the model class can approximate the true target \(f^\*\).
How many samples are needed to choose a good hypothesis from the class.
How well the training algorithm can actually find a good parameter setting.
Geometric priors mainly aim to reduce statistical and computational difficulty by making the hypothesis class smaller and better aligned with the task. The ideal situation is:
The starting object is a domain \(\Omega\). It may be a grid, graph, set, sphere, manifold, or group. A datum is then a signal on this domain.
Here \(\mathcal C\) is the channel space. For an RGB image, \(\mathcal C=\mathbb R^3\). For a molecular graph, \(\mathcal C\) may contain atom-type or node-feature channels.
| Data type | Domain \(\Omega\) | Signal \(x(u)\) | Typical channels \(\mathcal C\) |
|---|---|---|---|
| Image | 2D grid pixels | color value at pixel \(u\) | RGB, feature maps |
| Graph | vertices \(V\), sometimes \(V\times V\) | node or edge feature | atom type, degree, embedding vector |
| Point cloud | unordered set of points | feature attached to each point | coordinates, normals, colors |
| Manifold / mesh | surface points | scalar/vector/tensor feature | geometry, texture, tangent vector |
Signals can be added and scaled pointwise:
If \(\mathcal C\) has an inner product and \(\Omega\) has a measure \(\mu\), we get an inner product on signals:
A symmetry is a transformation that preserves the relevant structure. The symmetries of an object form a group. In geometric deep learning, the important symmetries are often not symmetries of one single sample, but symmetries of the domain description or of the label function.
A group is a pair \((\mathfrak G,\cdot)\), where \(\mathfrak G\) is a set and \(\cdot\) is a binary operation satisfying:
\[ (gh)k=g(hk) \quad \text{associativity}, \] \[ eg=ge=g \quad \text{identity}, \] \[ gg^{-1}=g^{-1}g=e \quad \text{inverse}. \]| Group | Elements | Operation | Typical GDL role |
|---|---|---|---|
| \((\mathbb Z,+)\) | integers | addition | discrete shifts |
| \(C_n\) | cyclic rotations | composition / modular addition | rotational symmetry |
| \(D_n\) | rotations and reflections of an \(n\)-gon | composition | dihedral symmetry |
| \(S_n\) | permutations of \(n\) objects | composition | sets and graphs |
| \(\mathbb R^d\) | continuous translations | vector addition | Euclidean translation symmetry |
| \(SO(3)\) | 3D rotations | matrix multiplication | spherical and 3D data |
The group operation \((g,h)\mapsto gh\) does not have to be one-to-one as a map from pairs to elements. For example, \(1+2=0+3=3\) in \((\mathbb Z,+)\). However, for a fixed \(h\), right multiplication \(R_h(g)=gh\) is a bijection of \(\mathfrak G\) onto itself.
If \(g_1h=g_2h\), then multiply on the right by \(h^{-1}\):
\[ (g_1h)h^{-1}=(g_2h)h^{-1} \quad\Rightarrow\quad g_1( hh^{-1})=g_2( hh^{-1}) \quad\Rightarrow\quad g_1=g_2. \]This is the algebraic reason why sums over a finite group can be re-indexed by \(k=gh\).
A group by itself is abstract. To use it on data, we need an action. A group action tells us how each group element transforms points of a space.
If \(x:\Omega\to\mathcal C\) is a signal and \(g\) acts on \(\Omega\), the induced action on signals is:
The inverse appears because the new value at location \(u\) should be the old value that came from the preimage \(g^{-1}u\). For a translation by \(t\), this gives:
The translated signal at the new location \(u\) uses the old value at \(u-t\).
A group representation maps group elements to invertible linear transformations. For finite-dimensional signal spaces, these transformations can be written as matrices.
In GDL, \(V\) is often a signal or feature space. The map \(\rho(g)\) tells us how features transform under \(g\).
A function is invariant if transforming the input by a group element does not change the output.
Invariance is appropriate when the target label should ignore certain transformations. For example, an image classifier should usually classify a dog as a dog even after translation or rotation, at least within the transformations assumed by the problem.
The orbit is the set of all transformed versions of \(x\). An invariant function is constant on each orbit.
Equivariance is weaker and richer than invariance. An equivariant map does not keep the output fixed. Instead, it makes the output transform in a predictable way when the input transforms.
If the same group acts on both input space \(X\) and output space \(Y\), one often writes this informally as:
\[ F(gx)=gF(x). \]| Property | Formula | Meaning | Typical use |
|---|---|---|---|
| Invariance | \(F(gx)=F(x)\) | Output does not change. | Final classification or global prediction. |
| Equivariance | \(F(gx)=gF(x)\) | Output changes in the same structured way. | Intermediate feature maps, segmentation, detection. |
Let \(T_t\) be translation by \(t\), so \((T_t x)(u)=x(u-t)\). A convolutional layer \(C_k\) with kernel \(k\) satisfies:
Intuitively: if an object shifts in the input, the activation pattern shifts in the feature map. The detector is the same; only its location changes.
A central lesson from Lecture 3 is that early invariance can be harmful. To recognize whole objects, a network often needs to recognize parts and their relative configuration. If intermediate features become invariant too early, the network loses information about where parts are.
If a layer says only “eye exists” and “mouth exists” but discards their relative positions, it may not distinguish a face from a scrambled collection of parts.
If the object shifts, the feature map shifts. The network preserves spatial organization, so later layers can reason about part-whole structure.
If every layer is equivariant, their composition is equivariant.
\[ F_1(\rho_0(g)x)=\rho_1(g)F_1(x), \quad F_2(\rho_1(g)y)=\rho_2(g)F_2(y). \] Then: \[ (F_2\circ F_1)(\rho_0(g)x) = F_2(\rho_1(g)F_1(x)) = \rho_2(g)F_2(F_1(x)). \]Suppose \(\mathfrak G\) is a finite group and \(f:\mathcal X\to\mathbb R\) is any scalar-valued function. Group smoothing creates a new function by averaging \(f\) over all transformed versions of the input.
We must show that for any \(h\in\mathfrak G\), \((S_{\mathfrak G}f)(hx)=(S_{\mathfrak G}f)(x)\).
If \(\mathfrak G=\{e,r\}\), where \(r\) is reflection, then:
This is reflection-invariant because replacing \(x\) by \(rx\) just swaps the two terms.
Lecture 4 states that approximation error is not affected by group smoothing:
The logic is easiest to understand in \(L^2(\mathcal X)\). The smoothing operator \(S_{\mathfrak G}\) is an orthogonal projection onto the subspace of invariant functions.
A projection satisfies \(S^2=S\). For group smoothing:
\(S_{\mathfrak G}f\) is the invariant component. \((I-S_{\mathfrak G})f=f-S_{\mathfrak G}f\) is the residual fluctuation around the group average.
The cross term disappears because the invariant component and residual component are perpendicular in \(L^2\). Intuitively, on every orbit, the residual has zero average.
On one orbit, suppose the values of \(f\) are \([2,4,6]\). The average is \(4\).
\[ \text{invariant part}=[4,4,4], \qquad \text{residual}=[-2,0,2]. \] \[ [4,4,4]\cdot[-2,0,2] = -8+0+8=0. \]Since the two components are orthogonal, the \(L^2\)-norm decomposes as:
If \(f^\*\) is invariant, then:
If \(f\) is \(\beta\)-Lipschitz in ordinary input space, then:
For invariant functions, the relevant distance is not the direct distance between \(x\) and \(x'\). Since \(x'\) and \(gx'\) are equivalent, we should compare \(x\) to the best-aligned transformed version of \(x'\).
The theorem in the lecture states that invariant learning can improve the effective sample size by a factor related to the group size:
Invariance can reduce statistical error. Roughly, if the group has many transformations, one sample represents information about many transformed versions.
The exponent is still \(-1/d\). In high dimension, this is still a cursed rate. Therefore, invariance alone is not sufficient.
Lecture 4 argues that symmetry should be combined with scale separation. Many real signals are not arbitrary high-dimensional objects: important information often appears at coarse scales, local scales, or hierarchical compositions of both.
Sometimes the target mostly depends on a coarsened version of the signal. For example, distinguishing a beach from a mountain may not require every pixel-level detail.
Sometimes the target depends on repeated local patterns. For example, texture recognition may depend on small patches rather than the full image at once.
A multiresolution analysis decomposes a signal into coarse content plus details needed to reconstruct fine-scale information. Wavelet filter banks are a classical implementation of this idea.
The GDL blueprint combines the two major priors:
Use equivariant layers so transformations of the input produce consistent transformations of the features.
Use locality, nonlinearities, pooling, and hierarchical composition to reduce effective dimensionality.
A linear invariant map often collapses too much information. For example, group averaging gives:
This is invariant, but it may destroy discriminative structure. Therefore, we need richer equivariant features before final invariant pooling.
Many familiar neural architectures can be interpreted through the same template: choose a domain, identify a symmetry group, use representations, and build equivariant/invariant maps.
| Architecture | Domain \(\Omega\) | Symmetry group | Main equivariant operation | Final invariant operation |
|---|---|---|---|---|
| CNN | grid | translations | convolution | global pooling / classifier head |
| Group CNN | grid or group | translations + rotations/reflections | group convolution | pooling over group/domain |
| Spherical CNN | sphere / \(SO(3)\) | 3D rotations | spherical/group convolution | rotation-invariant aggregation |
| Mesh / intrinsic CNN | manifold or mesh | isometries / gauge transformations | gauge-equivariant local operators | surface/global pooling |
| GNN | graph vertices | permutations of node order | message passing | readout pooling |
| Deep Sets | set | permutations | shared pointwise map | sum/mean/max pooling |
| Transformer | complete graph / sequence tokens | permutation or sequence symmetries depending on positional encoding | attention | CLS token / pooling / task head |
| Concept | Formula | Meaning |
|---|---|---|
| Signal | \(x:\Omega\to\mathcal C\) | A datum is a function on a structured domain. |
| Group | \((\mathfrak G,\cdot)\) | Set with associative operation, identity, and inverses. |
| Group action | \((g,u)\mapsto gu\) | How group elements transform domain points. |
| Signal action | \((gx)(u)=x(g^{-1}u)\) | How transformations act on functions defined on the domain. |
| Representation | \(\rho(gh)=\rho(g)\rho(h)\) | Linear realization of group elements on a vector space. |
| Invariance | \(f(gx)=f(x)\) | Output is unchanged along group orbits. |
| Equivariance | \(F(gx)=gF(x)\) | Output transforms consistently with input. |
| Group smoothing | \(S_{\mathfrak G}f(x)=\frac1{|\mathfrak G|}\sum_g f(gx)\) | Average over transformed versions of \(x\). |
| Orbit distance | \(\inf_g\|x-gx'\|\) | Distance after best symmetry alignment. |
| Scale separation | \(f^\*(x)\approx \tilde f^\*(\tilde x)\) or \(\sum_u g(x_u)\) | Learning depends on coarse or local/hierarchical structure. |