Blog/Research

Biochemical Pathway Discovery: Cross-Referencing Research Papers With AI

Using AI to accelerate biochemical pathway discovery by cross-referencing thousands of research papers.

Introduction

Biochemical pathway discovery has a scaling problem. The interactions between genes, proteins, and metabolites that govern cellular behavior are documented across millions of research papers, spread across decades of journals, each describing a small fragment of a larger biological network. No single paper contains the full picture. No single researcher can read them all.

In practice, the literature review phase of pathway research can take weeks. A graduate student investigating a novel gene-metabolite interaction will read hundreds of papers, manually trace citations, and maintain spreadsheets of entity relationships. The cognitive overhead is enormous, and the process is fundamentally error-prone because humans miss connections that span disciplinary boundaries.

Aethon is my attempt to fix this. It is an AI system that ingests biomedical research papers, extracts biological entities and their relationships, constructs a knowledge graph, and surfaces pathway connections that would take a human researcher months to find manually. The hard part was not building the NLP pipeline. It was building a system whose outputs biologists actually trust.

Three-dimensional molecular structure model showing interconnected atoms and chemical bonds
Molecular structures form the foundation of biochemical pathway networks, with each interaction representing a potential discovery waiting to be connected across the literature.Photo on Unsplash

The Problem With Manual Literature Review

Consider a concrete scenario. A researcher hypothesizes that a specific kinase, AMPK, mediates a previously unknown connection between lipid metabolism and inflammatory signaling in hepatocytes. To validate or refute this hypothesis, they need to answer several questions simultaneously.

1
What substrates does AMPK phosphorylate? This information is scattered across enzymology papers, phosphoproteomics studies, and structural biology publications. No single database is comprehensive.
2
Which of those substrates participate in inflammatory cascades? This requires cross-referencing immunology literature with the enzymology results. The terminology differs between fields, making keyword search unreliable.
3
Has anyone observed this interaction in hepatocyte-specific contexts? Tissue-specific pathway behavior is documented in clinical and translational papers that rarely cite the foundational biochemistry work.
4
What experimental evidence supports or contradicts the hypothesis? Negative results are underreported, so the absence of evidence is not evidence of absence. The researcher needs to infer gaps in the literature, not just find what exists.

This is fundamentally a graph traversal problem disguised as a reading problem. The information exists. It is distributed across a graph of entities and relationships that no human can hold in working memory. Aethon makes this graph explicit and traversable.

How Aethon Works

Aethon operates as a four-stage pipeline. Each stage transforms unstructured biomedical text into structured, queryable knowledge.

Stage 1: Paper Ingestion and Preprocessing

The system ingests papers from PubMed Central, bioRxiv, and user-uploaded PDFs. Raw text is extracted, section-segmented (abstract, methods, results, discussion), and sentence-tokenized. Section segmentation matters because the same entity mention carries different evidential weight depending on where it appears. A gene mentioned in the results section of a controlled experiment is stronger evidence than the same gene mentioned speculatively in a discussion.

Stage 2: Named Entity Recognition

Biomedical NER is not the same as general-purpose NER. Standard models trained on news corpora fail catastrophically on text like phosphorylated STAT3 binds the IL-6 promoter region. Aethon uses a fine-tuned PubMedBERT model with entity type heads for genes, proteins, metabolites, diseases, drugs, and biological processes.

python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

class BioEntityExtractor:
    def __init__(self, model_name="microsoft/BiomedNLP-PubMedBERT-base"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(
            "aethon/pubmedbert-ner-bio",
            num_labels=13  # BIO tags for 6 entity types + O

        )
        self.label_map = {
            0: "O", 1: "B-GENE", 2: "I-GENE",
            3: "B-PROTEIN", 4: "I-PROTEIN",
            5: "B-METABOLITE", 6: "I-METABOLITE",
            7: "B-DISEASE", 8: "I-DISEASE",
            9: "B-DRUG", 10: "I-DRUG",
            11: "B-PROCESS", 12: "I-PROCESS",
        }

    def extract(self, text):
        inputs = self.tokenizer(
            text, return_tensors="pt",
            truncation=True, max_length=512
        )
        with torch.no_grad():
            logits = self.model(**inputs).logits
        predictions = torch.argmax(logits, dim=-1)[0]
        tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        return self._merge_bio_tags(tokens, predictions)

The critical detail is the _merge_bio_tags method, which reconstructs multi-token entity spans from BIO-tagged subwords. Getting this wrong means splitting tumor necrosis factor alpha into separate entities, which corrupts every downstream relationship.

Stage 3: Relation Extraction

Entity extraction alone produces a list of biological actors. Relation extraction identifies how they interact. Aethon classifies sentence-level entity pairs into relation types: phosphorylates, inhibits, activates, binds, upregulates, downregulates, and co-expressed_with.

The model is a relation classification head on top of the same PubMedBERT encoder, taking the concatenated entity representations as input. In practice, the precision-recall tradeoff here is critical. A false positive relation pollutes the knowledge graph with phantom edges. Aethon operates at a high-precision, moderate-recall operating point, preferring to miss real relations rather than hallucinate fake ones.

Stage 4: Knowledge Graph Construction

Extracted entities become nodes. Extracted relations become edges. Each edge carries metadata: the source paper, the section it was extracted from, the confidence score, and the relation type. Edge weights are computed as a function of evidence strength, which combines extraction confidence, publication recency, journal impact, and the number of independent papers supporting the same relation.

Key takeaway

The knowledge graph is not a static artifact. Every new paper ingested can add nodes, add edges, or increase the weight of existing edges. The graph is a living representation of the field's collective knowledge.

The Knowledge Graph Architecture

The graph itself is stored in Neo4j with a property graph model. Nodes have types (gene, protein, metabolite, disease, drug, process) and properties (canonical name, aliases, organism, UniProt ID where applicable). Edges have types (the seven relation categories) and properties (weight, source papers, extraction confidence).

The choice of a property graph over a simple triple store was deliberate. Biomedical relationships are not binary. The statement "AMPK phosphorylates ACC" is incomplete without knowing the organism, tissue context, experimental conditions, and confidence level. Property graphs encode this naturally.

Pathway discovery is then framed as graph queries. To find novel connections between two entities, Aethon computes shortest paths, identifies high-betweenness intermediary nodes, and ranks candidate pathways by cumulative edge weight. A pathway supported by ten high-confidence edges from recent papers ranks higher than one supported by two low-confidence edges from 2003.

python
# Simplified pathway discovery query

def discover_pathways(graph, source, target, max_hops=4):
    paths = graph.find_all_paths(
        start=source, end=target,
        max_depth=max_hops,
        relationship_types=["activates", "inhibits", "phosphorylates", "binds"]
    )
    scored = []
    for path in paths:
        edge_weights = [e["weight"] for e in path.edges]
        novelty = compute_novelty_score(path)
        score = geometric_mean(edge_weights) * novelty
        scored.append((path, score))
    return sorted(scored, key=lambda x: x[1], reverse=True)

The novelty score is the key innovation. It measures how unexpected a pathway is by comparing it against known pathways in KEGG and Reactome. A high-confidence pathway that already exists in canonical databases is useful for validation but not for discovery. Aethon prioritizes paths that are well-supported but previously uncharacterized.

Laboratory bench with precision scientific equipment used for wet-lab validation experiments
Computational pathway predictions ultimately require wet-lab validation. The knowledge graph narrows the search space, but the bench work confirms the biology.Photo on Unsplash

What Aethon Has Surfaced

During development and testing against curated pathway databases, Aethon identified several categories of findings that demonstrate the system's practical value.

1
Cross-pathway bridging nodes. Entities that appear in multiple well-known pathways but whose role as connectors between those pathways has not been explicitly characterized. These are the most actionable outputs because they suggest specific experiments.
2
Underexplored metabolite-gene interactions. The literature has a strong bias toward protein-protein interactions. Metabolite-gene regulatory relationships are documented but rarely synthesized into pathway models. Aethon's entity type diversity captures these systematically.
3
Temporal evidence patterns. By tracking publication dates, Aethon identifies relationships where supporting evidence has been accumulating incrementally across papers without anyone synthesizing the pattern explicitly. These are hypotheses hiding in plain sight.
Why this matters

Drug discovery depends on understanding pathway biology. Every novel pathway connection Aethon surfaces is a potential therapeutic target that would otherwise require months of manual literature synthesis to identify.

Implications for Drug Discovery

The pharmaceutical industry spends billions on target identification, the process of finding biological entities whose modulation could treat a disease. This process is bottlenecked by exactly the literature synthesis problem Aethon addresses. If a kinase sits at the intersection of two disease-relevant pathways, it is a candidate for dual-indication drug development. But finding that intersection requires reading papers from both fields, which rarely happens because researchers specialize.

Aethon does not replace the biologist. It replaces the months of literature review that precede hypothesis formation. The biologist still designs the experiment, interprets the results, and validates the pathway in wet-lab conditions. But they start from a position of comprehensive rather than partial knowledge of what the literature already contains.

The system currently indexes over 50,000 papers and maintains a graph with approximately 180,000 entity nodes and 420,000 relationship edges. Processing a new paper takes under 30 seconds. Running a pathway discovery query between two arbitrary entities returns results in under two seconds.

Conclusion

Biochemical pathway discovery is a graph problem masked by the linear format of research papers. Aethon converts scattered textual evidence into a structured, weighted, queryable knowledge graph where novel pathway connections emerge from the intersection of thousands of independent observations.

The technical components, biomedical NER, relation extraction, evidence-weighted graph construction, are individually well-understood. The contribution is in combining them into a system that operates at literature scale and produces outputs that biologists find trustworthy and actionable. The hard part was never the NLP. It was building something that a researcher would actually use instead of reading one more paper.

Get involved

If you're working on computational biology, knowledge graphs, or biomedical NLP, I'd love to hear from you. Reach out to discuss Aethon or related research.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.