Information Geometry: The Shape of Probability

February 1, 2025·14 min read

mathematicsmachine-learningstatistics

What if probability distributions lived on a curved surface? This is the starting point of information geometry — a field that applies differential geometry to statistics, yielding deep insights into optimization, learning, and the nature of statistical models.

Statistical Manifolds

A parametric family of distributions $\{p_\theta : \theta \in \Theta\}$ forms a manifold where each point is a distribution and the coordinates are the parameters $\theta$ .

The natural metric on this manifold is the Fisher information matrix:

g_{ij}(\theta) = \mathbb{E}_{p_\theta}\left[\frac{\partial \log p_\theta}{\partial \theta_i} \frac{\partial \log p_\theta}{\partial \theta_j}\right]

This is the unique Riemannian metric (up to scaling) that is invariant under sufficient statistics — meaning it captures the intrinsic geometry of the statistical model, independent of parameterization.

The Fisher information matrix tells us how "distinguishable" nearby distributions are. Large eigenvalues mean the model is sensitive to parameter changes in that direction; small eigenvalues mean many parameter values give nearly the same distribution.

Geodesics and KL Divergence

On a statistical manifold, the KL divergence is approximately half the squared geodesic distance:

D_{\text{KL}}(p_\theta \| p_{\theta + d\theta}) \approx \frac{1}{2} \sum_{ij} g_{ij} \, d\theta_i \, d\theta_j

This means KL divergence is the natural "distance" induced by the Fisher metric — though it's not symmetric, reflecting the asymmetric nature of information.

Natural Gradient Descent

Standard gradient descent moves in the direction of steepest descent in parameter space. But parameter space is not the right space — the same distribution can be parameterized in many ways.

The natural gradient corrects for the geometry:

\tilde{\nabla} L(\theta) = \mathbf{F}^{-1} \nabla L(\theta)

where $\mathbf{F}$ is the Fisher information matrix. This gives the steepest descent direction in distribution space, not parameter space.

# Standard gradient descent
theta -= lr * grad

# Natural gradient descent
fisher_inv = torch.linalg.inv(fisher_matrix)
natural_grad = fisher_inv @ grad
theta -= lr * natural_grad

In practice, computing $\mathbf{F}^{-1}$ is expensive. Approximations like K-FAC and the empirical Fisher make natural gradient methods practical for deep learning.

The Bigger Picture

Information geometry connects:

Statistics: Fisher information, sufficiency, exponential families
Optimization: natural gradients, mirror descent
Physics: thermodynamics, entropy, statistical mechanics
Information theory: coding, channel capacity

The exponential and mixture families form dual coordinate systems on the statistical manifold — a beautiful duality that mirrors the Legendre transform in thermodynamics.