Introduction
Medical data is high-dimensional. A single gene expression profile might have 20,000 features. A histology image, when tiled and embedded, lives in a space with hundreds of dimensions. An fMRI scan produces time series across thousands of voxels. Traditional statistical methods, PCA, clustering, regression, handle this data by projecting it into lower dimensions or partitioning it into groups. These methods work well when the data has a simple structure. But biological data rarely does.
The problem is that PCA finds linear directions of maximum variance, not shape. Clustering assumes data comes in blobs. Neither captures the loops, branches, voids, and connected components that naturally arise in biological systems. A tumor with a necrotic core surrounded by a ring of proliferating cells has topological structure (a loop) that k-means clustering will never detect. This is where topology enters.

What Topology Brings to Data
Topology is the study of properties that survive continuous deformation. You can stretch, bend, and twist a shape, but as long as you do not tear it or glue parts together, its topological properties remain unchanged. A coffee mug and a donut are topologically identical (both have one hole). A sphere and a cube are identical (neither has holes). This level of abstraction sounds impractical, but it turns out to be exactly what you want for noisy, high-dimensional data.
Coordinates are fragile. Rotate a point cloud, and the coordinates change entirely. Scale it, and they change again. Add noise, and they change unpredictably. But the shape of the point cloud, its connected components, loops, and voids, is robust to all of these transformations. Topological data analysis (TDA) extracts these shape features and uses them as input to downstream analysis.
Persistent Homology
The central technique in TDA is persistent homology. The idea is elegant. Start with a set of data points. Grow a ball of radius around each point. As increases from zero, the balls overlap and form increasingly complex shapes. At small , you have many disconnected components. As grows, components merge. Eventually loops form. Then loops fill in. Then higher-dimensional voids appear and disappear.
Persistent homology tracks these events. Each topological feature (a connected component, a loop, a void) has a birth time (the radius at which it appears) and a death time (the radius at which it disappears). Features that persist across a wide range of radii are considered real structure. Features that appear and vanish quickly are likely noise.
The output is a persistence diagram: a scatter plot where each point represents a topological feature, with its birth on the x-axis and death on the y-axis. Points far from the diagonal are long-lived features (signal). Points near the diagonal are short-lived features (noise). This single visualization summarizes the multi-scale topological structure of an entire dataset.
Persistent homology converts the question “what is the shape of this data?” into a computable, stable, multi-scale summary. The persistence diagram is a topological fingerprint of the dataset.
Betti Numbers
The Betti numbers are the simplest topological invariants, and understanding them is essential for reading persistence diagrams.
The power of persistent homology is that it computes these Betti numbers not at a single scale but across all scales simultaneously, revealing which topological features are robust and which are artifacts of a particular resolution.
Applications in Medicine
This is where abstract mathematics meets clinical impact. TDA has found applications across multiple areas of medicine, often revealing structure that conventional methods miss.

I build medical AI systems. The standard pipeline is: collect data, extract features, train a classifier. TDA offers a fundamentally different kind of feature, one that captures the shape of biological processes rather than just their statistical summaries. When I look at a persistence diagram of tumor histology data, I see mathematics making contact with medicine in a way that neither discipline could achieve alone.
TDA vs Traditional Dimensionality Reduction
Comparing Methods for High-Dimensional Medical Data
| Method | What it captures | Strengths | Limitations |
|---|---|---|---|
| PCA | Linear variance directions | Fast, well-understood, deterministic | Misses nonlinear structure, no shape information |
| t-SNE | Local neighborhood structure | Excellent visualization for clusters | Non-deterministic, distorts global geometry, no persistence |
| UMAP | Local and some global structure | Fast, preserves more global structure than t-SNE | Hyperparameter-sensitive, topological claims are approximate |
| TDA (Persistent Homology) | Multi-scale topological features | Coordinate-free, stable, interpretable, rigorous | Computationally expensive for large datasets, requires vectorization for ML |
These methods are not competitors; they are complementary. PCA and UMAP are excellent for visualization and preprocessing. TDA adds a layer of shape analysis that the others fundamentally cannot provide. The most effective pipelines I have seen use UMAP for initial exploration and TDA for feature extraction that feeds into predictive models.
Getting Started
The TDA ecosystem in Python has matured significantly. Three libraries cover most use cases: Ripser for fast persistent homology computation, GUDHI for a broader range of topological tools, and scikit-tda for integration with scikit-learn pipelines.
Here is a minimal example computing persistent homology of a point cloud using Ripser:
import numpy as np
from ripser import ripser
from persim import plot_dgms
# Generate a noisy circle (a dataset with a known topological feature: one loop)
n_points = 200
theta = np.random.uniform(0, 2 * np.pi, n_points)
noise = np.random.normal(0, 0.1, (n_points, 2))
circle = np.column_stack([np.cos(theta), np.sin(theta)]) + noise
# Compute persistent homology up to dimension 1
result = ripser(circle, maxdim=1)
# result['dgms'][0] contains H0 (connected components) birth-death pairs
# result['dgms'][1] contains H1 (loops) birth-death pairs
h0 = result['dgms'][0]
h1 = result['dgms'][1]
print(f"H0 features (connected components): {len(h0)}")
print(f"H1 features (loops): {len(h1)}")
# The longest-lived H1 feature corresponds to the circle
if len(h1) > 0:
persistence = h1[:, 1] - h1[:, 0]
most_persistent = h1[np.argmax(persistence)]
print(f"Most persistent loop: birth={most_persistent[0]:.3f}, "
f"death={most_persistent[1]:.3f}, "
f"persistence={np.max(persistence):.3f}")
# Visualize the persistence diagram
plot_dgms(result['dgms'], show=True)In this example, the point cloud is sampled from a circle with Gaussian noise. Ripser computes the persistence diagram in milliseconds. The diagram will show one prominent point far from the diagonal (the real loop) and many short-lived points near the diagonal (noise artifacts). This separation between signal and noise is exactly what makes persistent homology useful for real data where the ground truth is unknown.
Start with synthetic data where you know the topology. A noisy circle should produce one persistent feature. A noisy figure-eight should produce two. Build intuition with known shapes before applying TDA to complex medical datasets where the topological ground truth is what you are trying to discover.
Conclusion
Topological data analysis is not a replacement for the statistical and machine learning tools we already use. It is a new lens, one that reveals structure invisible to methods that depend on coordinates, distances, and linear projections. For medical data, where the geometry of cell arrangements, molecular surfaces, and brain networks carries diagnostic information, this lens is particularly powerful.
What draws me to TDA is the same thing that draws me to both mathematics and medicine: the conviction that structure matters. A tumor is not just a collection of cells with abnormal gene expression values. It is an organized structure with topological properties, loops of proliferation, voids of necrosis, connected components of invasion, that carry information about prognosis and treatment response. TDA gives us the mathematical language to read that structure directly.
The field is young. The computational tools are maturing. The medical applications are being validated. For anyone who sits at the intersection of mathematics, computer science, and biology, this is one of the most fertile areas of research I know. The shape of data is not a metaphor. It is a measurable, computable, clinically relevant quantity, and we are only beginning to learn how to use it.
Abstract mathematics built to study coffee mugs and donuts is now classifying tumors and mapping brain networks. That is not a coincidence. It is a reminder that the most powerful tools in applied science often come from the most abstract corners of pure mathematics.