Introduction
Biochemical pathway discovery has a scaling problem. The interactions between genes, proteins, and metabolites that govern cellular behavior are documented across millions of research papers, spread across decades of journals, each describing a small fragment of a larger biological network. No single paper contains the full picture. No single researcher can read them all.
In practice, the literature review phase of pathway research can take weeks. A graduate student investigating a novel gene-metabolite interaction will read hundreds of papers, manually trace citations, and maintain spreadsheets of entity relationships. The cognitive overhead is enormous, and the process is fundamentally error-prone because humans miss connections that span disciplinary boundaries.
Aethon is my attempt to fix this. It is an AI system that ingests biomedical research papers, extracts biological entities and their relationships, constructs a knowledge graph, and surfaces pathway connections that would take a human researcher months to find manually. The hard part was not building the NLP pipeline. It was building a system whose outputs biologists actually trust.

The Problem With Manual Literature Review
Consider a concrete scenario. A researcher hypothesizes that a specific kinase, AMPK, mediates a previously unknown connection between lipid metabolism and inflammatory signaling in hepatocytes. To validate or refute this hypothesis, they need to answer several questions simultaneously.
This is fundamentally a graph traversal problem disguised as a reading problem. The information exists. It is distributed across a graph of entities and relationships that no human can hold in working memory. Aethon makes this graph explicit and traversable.
How Aethon Works
Aethon operates as a four-stage pipeline. Each stage transforms unstructured biomedical text into structured, queryable knowledge.
Stage 1: Paper Ingestion and Preprocessing
The system ingests papers from PubMed Central, bioRxiv, and user-uploaded PDFs. Raw text is extracted, section-segmented (abstract, methods, results, discussion), and sentence-tokenized. Section segmentation matters because the same entity mention carries different evidential weight depending on where it appears. A gene mentioned in the results section of a controlled experiment is stronger evidence than the same gene mentioned speculatively in a discussion.
Stage 2: Named Entity Recognition
Biomedical NER is not the same as general-purpose NER. Standard models trained on news corpora fail catastrophically on text like phosphorylated STAT3 binds the IL-6 promoter region. Aethon uses a fine-tuned PubMedBERT model with entity type heads for genes, proteins, metabolites, diseases, drugs, and biological processes.
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
class BioEntityExtractor:
def __init__(self, model_name="microsoft/BiomedNLP-PubMedBERT-base"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForTokenClassification.from_pretrained(
"aethon/pubmedbert-ner-bio",
num_labels=13 # BIO tags for 6 entity types + O
)
self.label_map = {
0: "O", 1: "B-GENE", 2: "I-GENE",
3: "B-PROTEIN", 4: "I-PROTEIN",
5: "B-METABOLITE", 6: "I-METABOLITE",
7: "B-DISEASE", 8: "I-DISEASE",
9: "B-DRUG", 10: "I-DRUG",
11: "B-PROCESS", 12: "I-PROCESS",
}
def extract(self, text):
inputs = self.tokenizer(
text, return_tensors="pt",
truncation=True, max_length=512
)
with torch.no_grad():
logits = self.model(**inputs).logits
predictions = torch.argmax(logits, dim=-1)[0]
tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
return self._merge_bio_tags(tokens, predictions)The critical detail is the _merge_bio_tags method, which reconstructs multi-token entity spans from BIO-tagged subwords. Getting this wrong means splitting tumor necrosis factor alpha into separate entities, which corrupts every downstream relationship.
Stage 3: Relation Extraction
Entity extraction alone produces a list of biological actors. Relation extraction identifies how they interact. Aethon classifies sentence-level entity pairs into relation types: phosphorylates, inhibits, activates, binds, upregulates, downregulates, and co-expressed_with.
The model is a relation classification head on top of the same PubMedBERT encoder, taking the concatenated entity representations as input. In practice, the precision-recall tradeoff here is critical. A false positive relation pollutes the knowledge graph with phantom edges. Aethon operates at a high-precision, moderate-recall operating point, preferring to miss real relations rather than hallucinate fake ones.
Stage 4: Knowledge Graph Construction
Extracted entities become nodes. Extracted relations become edges. Each edge carries metadata: the source paper, the section it was extracted from, the confidence score, and the relation type. Edge weights are computed as a function of evidence strength, which combines extraction confidence, publication recency, journal impact, and the number of independent papers supporting the same relation.
The knowledge graph is not a static artifact. Every new paper ingested can add nodes, add edges, or increase the weight of existing edges. The graph is a living representation of the field's collective knowledge.
The Knowledge Graph Architecture
The graph itself is stored in Neo4j with a property graph model. Nodes have types (gene, protein, metabolite, disease, drug, process) and properties (canonical name, aliases, organism, UniProt ID where applicable). Edges have types (the seven relation categories) and properties (weight, source papers, extraction confidence).
The choice of a property graph over a simple triple store was deliberate. Biomedical relationships are not binary. The statement "AMPK phosphorylates ACC" is incomplete without knowing the organism, tissue context, experimental conditions, and confidence level. Property graphs encode this naturally.
Pathway discovery is then framed as graph queries. To find novel connections between two entities, Aethon computes shortest paths, identifies high-betweenness intermediary nodes, and ranks candidate pathways by cumulative edge weight. A pathway supported by ten high-confidence edges from recent papers ranks higher than one supported by two low-confidence edges from 2003.
# Simplified pathway discovery query
def discover_pathways(graph, source, target, max_hops=4):
paths = graph.find_all_paths(
start=source, end=target,
max_depth=max_hops,
relationship_types=["activates", "inhibits", "phosphorylates", "binds"]
)
scored = []
for path in paths:
edge_weights = [e["weight"] for e in path.edges]
novelty = compute_novelty_score(path)
score = geometric_mean(edge_weights) * novelty
scored.append((path, score))
return sorted(scored, key=lambda x: x[1], reverse=True)The novelty score is the key innovation. It measures how unexpected a pathway is by comparing it against known pathways in KEGG and Reactome. A high-confidence pathway that already exists in canonical databases is useful for validation but not for discovery. Aethon prioritizes paths that are well-supported but previously uncharacterized.

What Aethon Has Surfaced
During development and testing against curated pathway databases, Aethon identified several categories of findings that demonstrate the system's practical value.
Drug discovery depends on understanding pathway biology. Every novel pathway connection Aethon surfaces is a potential therapeutic target that would otherwise require months of manual literature synthesis to identify.
Implications for Drug Discovery
The pharmaceutical industry spends billions on target identification, the process of finding biological entities whose modulation could treat a disease. This process is bottlenecked by exactly the literature synthesis problem Aethon addresses. If a kinase sits at the intersection of two disease-relevant pathways, it is a candidate for dual-indication drug development. But finding that intersection requires reading papers from both fields, which rarely happens because researchers specialize.
Aethon does not replace the biologist. It replaces the months of literature review that precede hypothesis formation. The biologist still designs the experiment, interprets the results, and validates the pathway in wet-lab conditions. But they start from a position of comprehensive rather than partial knowledge of what the literature already contains.
The system currently indexes over 50,000 papers and maintains a graph with approximately 180,000 entity nodes and 420,000 relationship edges. Processing a new paper takes under 30 seconds. Running a pathway discovery query between two arbitrary entities returns results in under two seconds.
Conclusion
Biochemical pathway discovery is a graph problem masked by the linear format of research papers. Aethon converts scattered textual evidence into a structured, weighted, queryable knowledge graph where novel pathway connections emerge from the intersection of thousands of independent observations.
The technical components, biomedical NER, relation extraction, evidence-weighted graph construction, are individually well-understood. The contribution is in combining them into a system that operates at literature scale and produces outputs that biologists find trustworthy and actionable. The hard part was never the NLP. It was building something that a researcher would actually use instead of reading one more paper.
If you're working on computational biology, knowledge graphs, or biomedical NLP, I'd love to hear from you. Reach out to discuss Aethon or related research.