Blog/Research

Topological Data Analysis: How Shape Theory Is Changing Medicine

Persistent homology, Betti numbers, and why the shape of data matters more than its coordinates.

Introduction

Medical data is high-dimensional. A single gene expression profile might have 20,000 features. A histology image, when tiled and embedded, lives in a space with hundreds of dimensions. An fMRI scan produces time series across thousands of voxels. Traditional statistical methods, PCA, clustering, regression, handle this data by projecting it into lower dimensions or partitioning it into groups. These methods work well when the data has a simple structure. But biological data rarely does.

The problem is that PCA finds linear directions of maximum variance, not shape. Clustering assumes data comes in blobs. Neither captures the loops, branches, voids, and connected components that naturally arise in biological systems. A tumor with a necrotic core surrounded by a ring of proliferating cells has topological structure (a loop) that k-means clustering will never detect. This is where topology enters.

Abstract geometric forms representing topological surfaces
Abstract geometric forms — topology studies the properties that survive continuous deformationPhoto on Unsplash

What Topology Brings to Data

Topology is the study of properties that survive continuous deformation. You can stretch, bend, and twist a shape, but as long as you do not tear it or glue parts together, its topological properties remain unchanged. A coffee mug and a donut are topologically identical (both have one hole). A sphere and a cube are identical (neither has holes). This level of abstraction sounds impractical, but it turns out to be exactly what you want for noisy, high-dimensional data.

Coordinates are fragile. Rotate a point cloud, and the coordinates change entirely. Scale it, and they change again. Add noise, and they change unpredictably. But the shape of the point cloud, its connected components, loops, and voids, is robust to all of these transformations. Topological data analysis (TDA) extracts these shape features and uses them as input to downstream analysis.

1
Coordinate-free. Topological features depend on the relationships between points, not their absolute positions. This makes TDA naturally invariant to rotation, translation, and many forms of noise.
2
Multi-scale. TDA examines data at every scale simultaneously, from local neighborhoods to global structure. This reveals features that exist only at specific resolutions, which single-scale methods miss entirely.
3
Interpretable. A loop in the persistence diagram corresponds to an actual loop in the data. Unlike the latent dimensions of PCA, topological features have geometric meaning that domain experts can interrogate.

Persistent Homology

The central technique in TDA is persistent homology. The idea is elegant. Start with a set of data points. Grow a ball of radius rr around each point. As rr increases from zero, the balls overlap and form increasingly complex shapes. At small rr, you have many disconnected components. As rr grows, components merge. Eventually loops form. Then loops fill in. Then higher-dimensional voids appear and disappear.

Persistent homology tracks these events. Each topological feature (a connected component, a loop, a void) has a birth time (the radius at which it appears) and a death time (the radius at which it disappears). Features that persist across a wide range of radii are considered real structure. Features that appear and vanish quickly are likely noise.

The output is a persistence diagram: a scatter plot where each point represents a topological feature, with its birth on the x-axis and death on the y-axis. Points far from the diagonal are long-lived features (signal). Points near the diagonal are short-lived features (noise). This single visualization summarizes the multi-scale topological structure of an entire dataset.

1
Step 1: Build a filtration. Construct a nested sequence of simplicial complexes (generalizations of triangulations) at increasing scale parameters. Common choices are the Vietoris-Rips complex (connect points within distance rr) and the alpha complex (based on Delaunay triangulation).
2
Step 2: Compute homology at each scale. At each value of rr, compute the homology groups, which count the topological features in each dimension: connected components (H0H_0), loops (H1H_1), voids (H2H_2), and so on.
3
Step 3: Track persistence. As rr increases, features are born and die. Record each birth-death pair. The collection of these pairs is the persistence diagram.
4
Step 4: Vectorize for machine learning. Persistence diagrams can be converted into fixed-length vectors (persistence images, persistence landscapes) for use as features in standard ML pipelines.
The key insight

Persistent homology converts the question “what is the shape of this data?” into a computable, stable, multi-scale summary. The persistence diagram is a topological fingerprint of the dataset.

Betti Numbers

The Betti numbers are the simplest topological invariants, and understanding them is essential for reading persistence diagrams.

1
β0\beta_0 (beta-zero): connected components. How many separate pieces does the space have? A dataset with three distinct clusters has β0=3\beta_0 = 3. As the filtration radius grows, components merge and β0\beta_0 decreases toward 1.
2
β1\beta_1 (beta-one): loops. How many independent loops exist? A torus has β1=2\beta_1 = 2 (two independent loops around its surface). In a cell arrangement, a ring of cells surrounding a necrotic core produces a persistent β1\beta_1 feature.
3
β2\beta_2 (beta-two): voids. How many enclosed cavities exist? A hollow sphere has β2=1\beta_2 = 1. In molecular data, voids can correspond to binding pockets, cavities where drug molecules can dock.

The power of persistent homology is that it computes these Betti numbers not at a single scale but across all scales simultaneously, revealing which topological features are robust and which are artifacts of a particular resolution.

Applications in Medicine

This is where abstract mathematics meets clinical impact. TDA has found applications across multiple areas of medicine, often revealing structure that conventional methods miss.

1
Tumor classification from histology. The spatial arrangement of cells in a tissue biopsy carries diagnostic information. In breast cancer, the topological features of cell nuclei arrangements (persistent β0\beta_0 for cluster structure, persistent β1\beta_1 for ring patterns) have been shown to distinguish between cancer grades with accuracy competitive with deep learning, while providing interpretable features that pathologists can validate against known morphological criteria.
2
Drug discovery and molecular shape. A molecule's biological activity depends on its three-dimensional shape, particularly the topology of its binding surfaces. Persistent homology of molecular point clouds captures shape similarity in a way that is invariant to rigid transformation, enabling more robust virtual screening than traditional molecular fingerprints. Binding pocket voids (β2\beta_2 features) directly correspond to druggable sites.
3
Brain connectivity analysis. Functional MRI produces correlation matrices between brain regions, which define weighted graphs. Persistent homology of these graphs reveals connectivity patterns, loops in functional networks, that distinguish healthy subjects from patients with neurological disorders. The filtration naturally handles the threshold-selection problem that plagues traditional graph-based brain analysis.
4
Genomics and gene expression. Gene expression data from single-cell RNA sequencing has complex structure: branching trajectories of cell differentiation, loops in cell-cycle dynamics. TDA captures these topological signatures, enabling the identification of cell states and transition pathways that clustering algorithms flatten into discrete groups.
Neural connectivity visualization showing topological network structures in the brain
Topological analysis of brain connectivity data — persistent homology reveals network structures invisible to traditional graph thresholding methodsPhoto on Unsplash
Why this matters personally

I build medical AI systems. The standard pipeline is: collect data, extract features, train a classifier. TDA offers a fundamentally different kind of feature, one that captures the shape of biological processes rather than just their statistical summaries. When I look at a persistence diagram of tumor histology data, I see mathematics making contact with medicine in a way that neither discipline could achieve alone.

TDA vs Traditional Dimensionality Reduction

Comparing Methods for High-Dimensional Medical Data

MethodWhat it capturesStrengthsLimitations
PCALinear variance directionsFast, well-understood, deterministicMisses nonlinear structure, no shape information
t-SNELocal neighborhood structureExcellent visualization for clustersNon-deterministic, distorts global geometry, no persistence
UMAPLocal and some global structureFast, preserves more global structure than t-SNEHyperparameter-sensitive, topological claims are approximate
TDA (Persistent Homology)Multi-scale topological featuresCoordinate-free, stable, interpretable, rigorousComputationally expensive for large datasets, requires vectorization for ML

These methods are not competitors; they are complementary. PCA and UMAP are excellent for visualization and preprocessing. TDA adds a layer of shape analysis that the others fundamentally cannot provide. The most effective pipelines I have seen use UMAP for initial exploration and TDA for feature extraction that feeds into predictive models.

Getting Started

The TDA ecosystem in Python has matured significantly. Three libraries cover most use cases: Ripser for fast persistent homology computation, GUDHI for a broader range of topological tools, and scikit-tda for integration with scikit-learn pipelines.

Here is a minimal example computing persistent homology of a point cloud using Ripser:

python
import numpy as np
from ripser import ripser
from persim import plot_dgms

# Generate a noisy circle (a dataset with a known topological feature: one loop)

n_points = 200
theta = np.random.uniform(0, 2 * np.pi, n_points)
noise = np.random.normal(0, 0.1, (n_points, 2))
circle = np.column_stack([np.cos(theta), np.sin(theta)]) + noise

# Compute persistent homology up to dimension 1

result = ripser(circle, maxdim=1)

# result['dgms'][0] contains H0 (connected components) birth-death pairs

# result['dgms'][1] contains H1 (loops) birth-death pairs

h0 = result['dgms'][0]
h1 = result['dgms'][1]

print(f"H0 features (connected components): {len(h0)}")
print(f"H1 features (loops): {len(h1)}")

# The longest-lived H1 feature corresponds to the circle

if len(h1) > 0:
    persistence = h1[:, 1] - h1[:, 0]
    most_persistent = h1[np.argmax(persistence)]
    print(f"Most persistent loop: birth={most_persistent[0]:.3f}, "
          f"death={most_persistent[1]:.3f}, "
          f"persistence={np.max(persistence):.3f}")

# Visualize the persistence diagram

plot_dgms(result['dgms'], show=True)

In this example, the point cloud is sampled from a circle with Gaussian noise. Ripser computes the persistence diagram in milliseconds. The H1H_1 diagram will show one prominent point far from the diagonal (the real loop) and many short-lived points near the diagonal (noise artifacts). This separation between signal and noise is exactly what makes persistent homology useful for real data where the ground truth is unknown.

Practical advice

Start with synthetic data where you know the topology. A noisy circle should produce one persistent H1H_1 feature. A noisy figure-eight should produce two. Build intuition with known shapes before applying TDA to complex medical datasets where the topological ground truth is what you are trying to discover.

Conclusion

Topological data analysis is not a replacement for the statistical and machine learning tools we already use. It is a new lens, one that reveals structure invisible to methods that depend on coordinates, distances, and linear projections. For medical data, where the geometry of cell arrangements, molecular surfaces, and brain networks carries diagnostic information, this lens is particularly powerful.

What draws me to TDA is the same thing that draws me to both mathematics and medicine: the conviction that structure matters. A tumor is not just a collection of cells with abnormal gene expression values. It is an organized structure with topological properties, loops of proliferation, voids of necrosis, connected components of invasion, that carry information about prognosis and treatment response. TDA gives us the mathematical language to read that structure directly.

The field is young. The computational tools are maturing. The medical applications are being validated. For anyone who sits at the intersection of mathematics, computer science, and biology, this is one of the most fertile areas of research I know. The shape of data is not a metaphor. It is a measurable, computable, clinically relevant quantity, and we are only beginning to learn how to use it.

The bridge

Abstract mathematics built to study coffee mugs and donuts is now classifying tumors and mapping brain networks. That is not a coincidence. It is a reminder that the most powerful tools in applied science often come from the most abstract corners of pure mathematics.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.