Geometric Deep Learning Notes

The focus is mathematical logic: geometric domains, group actions, representations, invariance, equivariance, group smoothing, projection arguments, sample complexity, scale separation, and the GDL blueprint.

Geometric priors Groups and actions Representations Invariance Equivariance Group smoothing Scale separation GDL blueprint

Table of contents

1. Big picture: why geometric priors matter motivation

In high-dimensional learning, the number of samples needed to learn a generic function can grow exponentially with the input dimension. Geometric Deep Learning starts from the observation that many real data types are not arbitrary vectors. They are signals on structured domains: images on grids, molecular features on graphs, features on manifolds, or sets of particles.

The central idea: use the structure of the domain to restrict the hypothesis class without throwing away the true target function.
Approximation error

How well the best function in the model class can approximate the true target \(f^\*\).

Statistical error

How many samples are needed to choose a good hypothesis from the class.

Optimization error

How well the training algorithm can actually find a good parameter setting.

Geometric priors mainly aim to reduce statistical and computational difficulty by making the hypothesis class smaller and better aligned with the task. The ideal situation is:

Ideal effect of geometric priors
\[ \text{smaller hypothesis class} \quad\Longrightarrow\quad \text{lower statistical complexity} \] \[ \text{but still } f^\* \text{ remains inside or near the class} \quad\Longrightarrow\quad \text{no serious approximation penalty}. \]
Warning. A prior is useful only when it matches the problem. For example, translation invariance is useful for image classification, but not for tasks where absolute location is the label.

2. Geometric domains and signals domain first

The starting object is a domain \(\Omega\). It may be a grid, graph, set, sphere, manifold, or group. A datum is then a signal on this domain.

Signal space
\[ x:\Omega\to\mathcal C, \qquad \mathcal X(\Omega,\mathcal C)=\{x:\Omega\to\mathcal C\}. \]

Here \(\mathcal C\) is the channel space. For an RGB image, \(\mathcal C=\mathbb R^3\). For a molecular graph, \(\mathcal C\) may contain atom-type or node-feature channels.

Data type Domain \(\Omega\) Signal \(x(u)\) Typical channels \(\mathcal C\)
Image 2D grid pixels color value at pixel \(u\) RGB, feature maps
Graph vertices \(V\), sometimes \(V\times V\) node or edge feature atom type, degree, embedding vector
Point cloud unordered set of points feature attached to each point coordinates, normals, colors
Manifold / mesh surface points scalar/vector/tensor feature geometry, texture, tangent vector

Vector-space and Hilbert-space viewpoint

Signals can be added and scaled pointwise:

Pointwise linear structure
\[ (\alpha x+\beta y)(u)=\alpha x(u)+\beta y(u). \]

If \(\mathcal C\) has an inner product and \(\Omega\) has a measure \(\mu\), we get an inner product on signals:

Inner product on signal space
\[ \langle x,y\rangle_{\mathcal X} = \int_{\Omega} \langle x(u),y(u)\rangle_{\mathcal C}\,d\mu(u). \]
Why this matters. Once signal spaces are Hilbert spaces, we can talk precisely about orthogonality, projections, energy, and \(L^2\)-errors. This is exactly what Lecture 4 uses when proving that group smoothing is an orthogonal projection.

3. Symmetries and groups formal language

A symmetry is a transformation that preserves the relevant structure. The symmetries of an object form a group. In geometric deep learning, the important symmetries are often not symmetries of one single sample, but symmetries of the domain description or of the label function.

Abstract group

A group is a pair \((\mathfrak G,\cdot)\), where \(\mathfrak G\) is a set and \(\cdot\) is a binary operation satisfying:

\[ (gh)k=g(hk) \quad \text{associativity}, \] \[ eg=ge=g \quad \text{identity}, \] \[ gg^{-1}=g^{-1}g=e \quad \text{inverse}. \]
Important distinction. A group element \(g\in\mathfrak G\) is not the same thing as the group operation. In \((\mathbb Z,+)\), the elements are integers and the operation is addition. In a symmetry group of transformations, the elements are maps and the operation is composition.
Group Elements Operation Typical GDL role
\((\mathbb Z,+)\) integers addition discrete shifts
\(C_n\) cyclic rotations composition / modular addition rotational symmetry
\(D_n\) rotations and reflections of an \(n\)-gon composition dihedral symmetry
\(S_n\) permutations of \(n\) objects composition sets and graphs
\(\mathbb R^d\) continuous translations vector addition Euclidean translation symmetry
\(SO(3)\) 3D rotations matrix multiplication spherical and 3D data

The bijection property used in group averages

The group operation \((g,h)\mapsto gh\) does not have to be one-to-one as a map from pairs to elements. For example, \(1+2=0+3=3\) in \((\mathbb Z,+)\). However, for a fixed \(h\), right multiplication \(R_h(g)=gh\) is a bijection of \(\mathfrak G\) onto itself.

Right multiplication is bijective

If \(g_1h=g_2h\), then multiply on the right by \(h^{-1}\):

\[ (g_1h)h^{-1}=(g_2h)h^{-1} \quad\Rightarrow\quad g_1( hh^{-1})=g_2( hh^{-1}) \quad\Rightarrow\quad g_1=g_2. \]

This is the algebraic reason why sums over a finite group can be re-indexed by \(k=gh\).

4. Group actions and representations from domain to signals

A group by itself is abstract. To use it on data, we need an action. A group action tells us how each group element transforms points of a space.

Group action on a domain
\[ \mathfrak G\times \Omega\to \Omega, \qquad (g,u)\mapsto gu, \] satisfying \[ e u=u, \qquad g(hu)=(gh)u. \]

Action on signals: why the inverse appears

If \(x:\Omega\to\mathcal C\) is a signal and \(g\) acts on \(\Omega\), the induced action on signals is:

Regular action on signals
\[ (gx)(u)=x(g^{-1}u). \]

The inverse appears because the new value at location \(u\) should be the old value that came from the preimage \(g^{-1}u\). For a translation by \(t\), this gives:

Translation example
\[ (T_t x)(u)=x(u-t). \]

The translated signal at the new location \(u\) uses the old value at \(u-t\).

Domain action Signal action u ↦ gu x ↦ gx (gx)(u)=x(g⁻¹u)
A transformation of the domain automatically induces a linear transformation of the signal space.

Representation viewpoint

A group representation maps group elements to invertible linear transformations. For finite-dimensional signal spaces, these transformations can be written as matrices.

Group representation
\[ \rho:\mathfrak G\to GL(V), \qquad \rho(gh)=\rho(g)\rho(h). \]

In GDL, \(V\) is often a signal or feature space. The map \(\rho(g)\) tells us how features transform under \(g\).

5. Invariant functions constant on orbits

A function is invariant if transforming the input by a group element does not change the output.

Invariant function
\[ f(gx)=f(x), \qquad \forall g\in\mathfrak G,\;x\in\mathcal X. \]

Invariance is appropriate when the target label should ignore certain transformations. For example, an image classifier should usually classify a dog as a dog even after translation or rotation, at least within the transformations assumed by the problem.

Orbits and quotient intuition

Group orbit
\[ \mathfrak Gx=\{gx:g\in\mathfrak G\}. \]

The orbit is the set of all transformed versions of \(x\). An invariant function is constant on each orbit.

orbit \(\mathfrak Gx\): same value \(f(x)\) another orbit: another value all points identified by symmetry function may differ between orbits
Invariance means \(f\) is constant along each orbit, but it can take different values on different orbits.
Precise intuition. Invariance does not mean the input is unchanged. It means the output is unchanged when the input moves inside the same orbit.

6. Equivariant functions consistent transformation

Equivariance is weaker and richer than invariance. An equivariant map does not keep the output fixed. Instead, it makes the output transform in a predictable way when the input transforms.

Equivariant map
\[ F(\rho_X(g)x)=\rho_Y(g)F(x), \qquad \forall g\in\mathfrak G. \]

If the same group acts on both input space \(X\) and output space \(Y\), one often writes this informally as:

\[ F(gx)=gF(x). \]
Property Formula Meaning Typical use
Invariance \(F(gx)=F(x)\) Output does not change. Final classification or global prediction.
Equivariance \(F(gx)=gF(x)\) Output changes in the same structured way. Intermediate feature maps, segmentation, detection.

Translation equivariance of convolution

Let \(T_t\) be translation by \(t\), so \((T_t x)(u)=x(u-t)\). A convolutional layer \(C_k\) with kernel \(k\) satisfies:

Convolution is translation equivariant
\[ C_k(T_t x)=T_t(C_k x). \]

Intuitively: if an object shifts in the input, the activation pattern shifts in the feature map. The detector is the same; only its location changes.

Why not automatically rotation equivariant? A standard CNN shares filters across translations, not across rotations. Rotating the input does not generally rotate the feature maps in a controlled way unless the architecture builds in rotational symmetry.

7. Why deep nets usually prefer equivariance before invariance pose information

A central lesson from Lecture 3 is that early invariance can be harmful. To recognize whole objects, a network often needs to recognize parts and their relative configuration. If intermediate features become invariant too early, the network loses information about where parts are.

Final outputs often should be invariant; intermediate representations often should be equivariant.
Bad early invariance

If a layer says only “eye exists” and “mouth exists” but discards their relative positions, it may not distinguish a face from a scrambled collection of parts.

Good equivariance

If the object shifts, the feature map shifts. The network preserves spatial organization, so later layers can reason about part-whole structure.

Layerwise equivariance composes

If every layer is equivariant, their composition is equivariant.

\[ F_1(\rho_0(g)x)=\rho_1(g)F_1(x), \quad F_2(\rho_1(g)y)=\rho_2(g)F_2(y). \] Then: \[ (F_2\circ F_1)(\rho_0(g)x) = F_2(\rho_1(g)F_1(x)) = \rho_2(g)F_2(F_1(x)). \]
Architectural message. A good geometric deep network often keeps features equivariant through many layers, then applies pooling or averaging to become invariant only near the end.

8. Group smoothing averaging over orbits

Suppose \(\mathfrak G\) is a finite group and \(f:\mathcal X\to\mathbb R\) is any scalar-valued function. Group smoothing creates a new function by averaging \(f\) over all transformed versions of the input.

Group smoothing operator
\[ S_{\mathfrak G}f = \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}f\circ g, \] equivalently \[ (S_{\mathfrak G}f)(x) = \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}f(gx). \]

Proof that \(S_{\mathfrak G}f\) is invariant

We must show that for any \(h\in\mathfrak G\), \((S_{\mathfrak G}f)(hx)=(S_{\mathfrak G}f)(x)\).

Invariance proof
\[ (S_{\mathfrak G}f)(hx) = \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}f(g(hx)). \] By the action property \(g(hx)=(gh)x\), so \[ (S_{\mathfrak G}f)(hx) = \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}f((gh)x). \] Since \(g\mapsto gh\) is a bijection of \(\mathfrak G\), set \(k=gh\): \[ \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}f((gh)x) = \frac{1}{|\mathfrak G|} \sum_{k\in\mathfrak G}f(kx) = (S_{\mathfrak G}f)(x). \]
Intuition. Applying \(h\) before averaging only reorders the transformed copies. An average does not care about order.

Example: reflection group

If \(\mathfrak G=\{e,r\}\), where \(r\) is reflection, then:

Two-element group smoothing
\[ (S_{\mathfrak G}f)(x) = \frac{1}{2}\bigl(f(x)+f(rx)\bigr). \]

This is reflection-invariant because replacing \(x\) by \(rx\) just swaps the two terms.

9. Projection logic: why approximation error is not worsened orthogonal decomposition

Lecture 4 states that approximation error is not affected by group smoothing:

Approximation error identity
\[ \inf_{f\in\mathcal F}\|f-f^\*\|^2 = \inf_{f\in S_{\mathfrak G}\mathcal F}\|f-f^\*\|^2, \] assuming the target \(f^\*\) is \(\mathfrak G\)-invariant and \(S_{\mathfrak G}\mathcal F=\{S_{\mathfrak G}f:f\in\mathcal F\}\).

The logic is easiest to understand in \(L^2(\mathcal X)\). The smoothing operator \(S_{\mathfrak G}\) is an orthogonal projection onto the subspace of invariant functions.

Step 1: \(S_{\mathfrak G}\) is idempotent

A projection satisfies \(S^2=S\). For group smoothing:

Idempotence
\[ S_{\mathfrak G}^2f = S_{\mathfrak G}(S_{\mathfrak G}f) = S_{\mathfrak G}f, \] because \(S_{\mathfrak G}f\) is already invariant.

Step 2: Decompose any function

Invariant plus residual
\[ f=S_{\mathfrak G}f+(I-S_{\mathfrak G})f. \]

\(S_{\mathfrak G}f\) is the invariant component. \((I-S_{\mathfrak G})f=f-S_{\mathfrak G}f\) is the residual fluctuation around the group average.

Step 3: Why the two parts are orthogonal

The cross term disappears because the invariant component and residual component are perpendicular in \(L^2\). Intuitively, on every orbit, the residual has zero average.

Orbit-average intuition

On one orbit, suppose the values of \(f\) are \([2,4,6]\). The average is \(4\).

\[ \text{invariant part}=[4,4,4], \qquad \text{residual}=[-2,0,2]. \] \[ [4,4,4]\cdot[-2,0,2] = -8+0+8=0. \]

Step 4: The Pythagorean identity

Since the two components are orthogonal, the \(L^2\)-norm decomposes as:

Pythagorean decomposition
\[ \|f-g\|^2 = \|S_{\mathfrak G}f-S_{\mathfrak G}g\|^2 + \|(I-S_{\mathfrak G})f-(I-S_{\mathfrak G})g\|^2. \]

Step 5: Apply it to an invariant target

If \(f^\*\) is invariant, then:

Invariant target
\[ S_{\mathfrak G}f^\*=f^\*, \qquad (I-S_{\mathfrak G})f^\*=0. \] Therefore: \[ \|f-f^\*\|^2 = \|S_{\mathfrak G}f-f^\*\|^2 + \|(I-S_{\mathfrak G})f\|^2. \]
The non-invariant part is pure extra error when the target is invariant. Smoothing removes that part and cannot make the approximation worse.
Mathematical caution. The exact equality of infima depends on how the norm and function class are defined. The robust message is that for every \(f\), the smoothed function \(S_{\mathfrak G}f\) is at least as close to an invariant target as \(f\) is.

10. Sample complexity under invariance helpful but not enough

If \(f\) is \(\beta\)-Lipschitz in ordinary input space, then:

Ordinary Lipschitz condition
\[ |f(x)-f(x')|\leq \beta\|x-x'\|. \]

For invariant functions, the relevant distance is not the direct distance between \(x\) and \(x'\). Since \(x'\) and \(gx'\) are equivalent, we should compare \(x\) to the best-aligned transformed version of \(x'\).

Orbit distance
\[ d_{\mathfrak G}(x,x') = \inf_{g\in\mathfrak G}\|x-gx'\|. \] \[ |f(x)-f(x')| \leq \beta\, d_{\mathfrak G}(x,x') = \beta\inf_{g\in\mathfrak G}\|x-gx'\|. \]
orbit of x orbit of x' shortest aligned distance raw distance may be misleading
Invariant learning compares orbits, not just raw points. Two inputs can be far in raw coordinates but close after a symmetry alignment.

The sample-complexity message

The theorem in the lecture states that invariant learning can improve the effective sample size by a factor related to the group size:

Representative rate
\[ \mathbb E\mathcal R(\tilde f)\lesssim (|\mathfrak G|n)^{-1/d}. \]
Good news

Invariance can reduce statistical error. Roughly, if the group has many transformations, one sample represents information about many transformed versions.

Bad news

The exponent is still \(-1/d\). In high dimension, this is still a cursed rate. Therefore, invariance alone is not sufficient.

Symmetry gives a real gain, but scale separation and locality are needed to go further.

11. Scale separation and multiresolution structure beyond invariance

Lecture 4 argues that symmetry should be combined with scale separation. Many real signals are not arbitrary high-dimensional objects: important information often appears at coarse scales, local scales, or hierarchical compositions of both.

Case A: coarse scales dominate

Sometimes the target mostly depends on a coarsened version of the signal. For example, distinguishing a beach from a mountain may not require every pixel-level detail.

Coarse-scale prior
\[ f^\*(x)\approx \tilde f^\*(\tilde x), \] where \(\tilde x\) is a coarsened version of \(x\) on a smaller domain \(\tilde\Omega\), with \[ |\tilde\Omega|\ll |\Omega|. \]
Why this helps. If the relevant dimension is closer to \(|\tilde\Omega|\) than \(|\Omega|\), the curse of dimensionality is reduced.

Case B: local scales dominate

Sometimes the target depends on repeated local patterns. For example, texture recognition may depend on small patches rather than the full image at once.

Local-patch prior
\[ f^\*(x)\approx \sum_{u\in\Omega} g(x_u), \] where \(x_u\) is a local patch centered at \(u\).
Why this helps. The relevant dimension is the patch dimension, not the full input dimension. Locality turns one huge problem into many shared smaller problems.

Multiresolution analysis

A multiresolution analysis decomposes a signal into coarse content plus details needed to reconstruct fine-scale information. Wavelet filter banks are a classical implementation of this idea.

Coarse plus details
\[ x \quad\leadsto\quad (\tilde x,\text{details}), \] where \(\tilde x\) is defined on a coarser domain and the details encode what is missing at the fine scale.
fine signal x coarse details hierarchy of scales
Scale separation decomposes the learning problem into coarse information and local/fine details.
Limit of pure scale assumptions. Coarse-only and local-only assumptions are often too strong. Real architectures use compositional hierarchies that combine local processing, coarsening, and global aggregation.

12. The Geometric Deep Learning blueprint constructive architecture

The GDL blueprint combines the two major priors:

Symmetry

Use equivariant layers so transformations of the input produce consistent transformations of the features.

Scale separation

Use locality, nonlinearities, pooling, and hierarchical composition to reduce effective dimensionality.

Why linear invariants alone are weak

A linear invariant map often collapses too much information. For example, group averaging gives:

Group average
\[ Ax=\bar x = \frac{1}{|\mathfrak G|} \sum_{g\in\mathfrak G}gx. \]

This is invariant, but it may destroy discriminative structure. Therefore, we need richer equivariant features before final invariant pooling.

Core building blocks

1
Linear equivariant layer. \[ B(gx)=gB(x). \] Examples include convolutions on grids and message-passing/diffusion-like operators on graphs.
2
Pointwise nonlinearity. \[ (\sigma x)(u)=\sigma(x(u)). \] If the group acts by moving positions, pointwise nonlinearities preserve equivariance.
3
Local pooling / coarsening. Reduce spatial resolution while preserving the relevant symmetry structure approximately or exactly.
4
Global invariant layer. Aggregate over the domain or group orbit to produce a final invariant prediction.
Blueprint schematic
\[ x \xrightarrow{\text{equivariant }B_1} h_1 \xrightarrow{\sigma} \sigma(h_1) \xrightarrow{\text{pool}} h_2 \xrightarrow{\text{equivariant }B_2} \cdots \xrightarrow{\text{global invariant pooling}} y. \]
Equivariance preserves structured information layer by layer; final invariance removes nuisance transformations only when the prediction is ready.

13. Architecture map domain, group, model

Many familiar neural architectures can be interpreted through the same template: choose a domain, identify a symmetry group, use representations, and build equivariant/invariant maps.

Architecture Domain \(\Omega\) Symmetry group Main equivariant operation Final invariant operation
CNN grid translations convolution global pooling / classifier head
Group CNN grid or group translations + rotations/reflections group convolution pooling over group/domain
Spherical CNN sphere / \(SO(3)\) 3D rotations spherical/group convolution rotation-invariant aggregation
Mesh / intrinsic CNN manifold or mesh isometries / gauge transformations gauge-equivariant local operators surface/global pooling
GNN graph vertices permutations of node order message passing readout pooling
Deep Sets set permutations shared pointwise map sum/mean/max pooling
Transformer complete graph / sequence tokens permutation or sequence symmetries depending on positional encoding attention CLS token / pooling / task head
Unifying message. These architectures differ in implementation, but they follow the same mathematical pattern: define how symmetries act on the data, then build layers compatible with that action.

14. Final checklist definitions and logic

Concept Formula Meaning
Signal \(x:\Omega\to\mathcal C\) A datum is a function on a structured domain.
Group \((\mathfrak G,\cdot)\) Set with associative operation, identity, and inverses.
Group action \((g,u)\mapsto gu\) How group elements transform domain points.
Signal action \((gx)(u)=x(g^{-1}u)\) How transformations act on functions defined on the domain.
Representation \(\rho(gh)=\rho(g)\rho(h)\) Linear realization of group elements on a vector space.
Invariance \(f(gx)=f(x)\) Output is unchanged along group orbits.
Equivariance \(F(gx)=gF(x)\) Output transforms consistently with input.
Group smoothing \(S_{\mathfrak G}f(x)=\frac1{|\mathfrak G|}\sum_g f(gx)\) Average over transformed versions of \(x\).
Orbit distance \(\inf_g\|x-gx'\|\) Distance after best symmetry alignment.
Scale separation \(f^\*(x)\approx \tilde f^\*(\tilde x)\) or \(\sum_u g(x_u)\) Learning depends on coarse or local/hierarchical structure.

Proof reminders

To prove group smoothing is invariant: compute \((S_{\mathfrak G}f)(hx)\), use \(g(hx)=(gh)x\), then re-index \(k=gh\). The re-indexing is valid because right multiplication by \(h\) is a bijection.
To understand \((I-S_{\mathfrak G})f\): it is the residual after subtracting the orbit average. It has zero average on every orbit.
To understand the missing cross term: \(S_{\mathfrak G}f\) and \((I-S_{\mathfrak G})f\) are orthogonal components, so the \(2\langle a,b\rangle\) term is zero.
To remember invariance vs equivariance: invariant means “same output”; equivariant means “output moves consistently.”
To remember the Lecture 4 conclusion: symmetry improves sample complexity, but high-dimensional learning still needs scale separation, locality, and compositionality.
The complete logic: domain structure gives symmetries; symmetries give group actions and representations; representations define invariant/equivariant maps; equivariant layers preserve structure; pooling creates final invariance; scale separation makes the model statistically and computationally efficient.