Blog/Research

Building Medical AI: A Complete Guide To Pathology Slide Analysis

The architectures, datasets, evaluation methods, and deployment challenges every team building medical imaging AI needs to understand.

Introduction

Medical imaging AI sits at one of the most promising intersections of deep learning and clinical practice. The core task is deceptively simple: given a digitized pathology slide, identify and classify regions of diagnostic interest. The execution is not simple at all.

A single whole-slide image can exceed 100,000 x 100,000 pixels. Standard convolutional networks cannot process inputs at this scale. The field has converged on a set of architectural patterns to handle this, and understanding them is prerequisite to building anything that works in production.

Microscopic view of tissue under laboratory analysis
Microscopy at the cellular level — medical AI processes thousands of tiles like this from a single slidePhoto by CDC on Unsplash

How Medical AI Analyzes Pathology Slides

Understanding how medical AI works begins with understanding the data it processes. A digitized pathology slide is not a photograph. It is a pyramidal image, stored at multiple resolution levels, with the highest resolution capturing cellular-level detail across an area that may span several square centimeters of tissue.

The model never sees the full slide at once. Instead, the slide is divided into overlapping tiles, typically 256x256 or 512x512 pixels at 20x or 40x magnification. Each tile is processed independently through a feature extractor, producing a fixed-length embedding vector. The collection of embeddings is then aggregated using an attention mechanism to produce a slide-level prediction.

This mechanism explains two things that otherwise feel arbitrary.

1
It explains why tile-level labels are not required for training. The attention mechanism learns which tiles are diagnostically relevant. This is critical because pathologists annotate at the slide level, not the tile level.
2
It explains why two slides that look similar to a human observer can produce meaningfully different model outputs. The model attends to spatial patterns in the embedding space that may not correspond to visually obvious features. A complete breakdown of attention-based MIL architectures covers the full mechanism in depth.

The Core Pipeline Architecture

With the mechanism in place, the pipeline taxonomy becomes intelligible. Each stage is a structured transformation of the raw slide data into a clinically actionable prediction.

1
Tissue detection and segmentation. Before any analysis begins, background regions must be separated from tissue. This is typically handled by color-space thresholding in HSV or Otsu binarization, though learned segmentation models perform better on staining artifacts.
2
Tile extraction and normalization. Tissue regions are divided into tiles at the target magnification. Stain normalization (Macenko or Vahadane methods) standardizes color distributions across slides from different scanners and labs.
3
Feature extraction. A pretrained encoder (ResNet-50, ViT, or increasingly, pathology-specific foundation models like UNI or CONCH) maps each tile to a dense embedding vector.
4
Aggregation and classification. An attention-based pooling layer combines tile embeddings into a single slide-level representation, which is fed to a classifier for the target task (cancer detection, grading, molecular subtyping).

Approaches to Medical Image Analysis

Medical AI vs. Alternatives: The Diagnostic Guide

ApproachRoot cause it solvesWhen to use itThe signal that it's time
Multiple Instance LearningSlide-level labels only, no tile annotations availableAlways start here for whole-slide classificationYou have slide diagnoses but no pixel-level masks
Supervised segmentationNeed precise spatial localization of pathologyAfter pathologist-annotated masks are availableClinical workflow requires tumor boundary delineation
Foundation model fine-tuningLimited labeled data for rare pathologiesWhen pretrained pathology encoders are availableSmall dataset but high performance requirements
Graph neural networksSpatial relationships between cells matter for diagnosisWhen tissue microenvironment drives the predictionTile-level features alone miss inter-cellular context

In practice, the tile extraction step is straightforward to implement. Here's a simplified version of the core loop:

python
import openslide
import numpy as np

def extract_tiles(slide_path, tile_size=256, level=0):
    slide = openslide.OpenSlide(slide_path)
    w, h = slide.level_dimensions[level]

    tiles = []
    for y in range(0, h, tile_size):
        for x in range(0, w, tile_size):
            tile = slide.read_region((x, y), level, (tile_size, tile_size))
            tile_np = np.array(tile.convert('RGB'))

            # Skip background tiles (mostly white)

            if tile_np.mean() < 220:
                tiles.append((x, y, tile_np))

    return tiles
Research laboratory with microscope and tissue samples
The wet lab side of computational pathology — digitized slides begin here before any model touches themPhoto by National Cancer Institute on Unsplash

Foundation model fine-tuning vs. training from scratch covers the decision in full. The core rule is that you should not train from scratch until you have thoroughly evaluated pretrained encoders. Because training on a dataset you haven't benchmarked against existing models rarely yields the expected result and makes the underlying failure harder to see.

Key Technical Challenges

Medical AI has a defined set of failure modes. Three technical challenges — distribution shift, label noise, and computational scale — account for the majority of deployment failures.

Key takeaway

Distribution shift is the silent killer. A model trained at one hospital can drop 15-20% in accuracy when deployed at another — even on the same cancer type — purely due to differences in scanners and staining protocols.

Distribution shift occurs when a model trained on slides from one institution encounters slides from another. Different scanners, staining protocols, and tissue preparation methods produce systematic visual differences that can degrade model performance dramatically. Stain normalization reduces but does not eliminate this problem.

Label noise is endemic to pathology datasets. Inter-observer agreement among pathologists for certain diagnoses (particularly grading) can be surprisingly low. Training with noisy labels requires either robust loss functions or consensus-based label refinement, both of which add complexity to the training pipeline.

Medical researcher analyzing samples in a clinical laboratory
Clinical validation requires pathologist review — the human side of the evaluation pipeline that no model can skipPhoto by National Cancer Institute on Unsplash

What Changed in 2025-2026

The most significant shift in computational pathology over the past year has been the emergence of pathology foundation models. Models like UNI, CONCH, and Virchow were trained on hundreds of thousands to millions of pathology slides using self-supervised learning. They produce tile embeddings that transfer effectively to downstream tasks with minimal fine-tuning.

Why this matters

Foundation models have fundamentally changed the economics: you need hundreds of labeled slides instead of thousands, making clinical AI feasible for smaller research groups and rare pathologies.

This changes the economics of medical AI development. Previously, building a competitive pathology model required access to large, well-annotated datasets and significant compute for training. Foundation models reduce the data requirement to hundreds rather than thousands of labeled slides, making it feasible for smaller research groups to develop clinically useful models.

Conclusion

Building medical AI for pathology is a systems problem, not just a modeling problem. The pipeline from raw whole-slide images to clinically validated predictions involves tissue detection, tile extraction, feature encoding, attention-based aggregation, and careful evaluation against distribution shift and label noise.

The arrival of foundation models has simplified the feature extraction stage considerably, but the surrounding infrastructure — the data pipelines, evaluation frameworks, and deployment considerations — remains the hard part. The teams that get this right will define the next generation of diagnostic tools.

Get involved

This is an active area of my research. If you're working on medical imaging AI or want to collaborate, feel free to reach out.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.