Blog/Product

Building Real-Time Data Pipelines: From Research to Production

Lessons learned building real-time data pipelines that bridge research prototypes and production systems.

Introduction

Every data-intensive project I have built -- Histia, Trovex, Aethon -- hit the same wall: the thing that works in a Jupyter notebook does not work in production. The model is fine. The processing logic is fine. What breaks is everything between the data source and the model, and everything between the model output and the user.

This is the research-to-production gap. Not a skills gap. An architecture gap. Research prototypes process data in batches, synchronously, with a human watching. Production systems process data continuously, asynchronously, with no one watching except dashboards you also have to build. This post covers the patterns, tooling decisions, and lessons from building pipelines that actually run.

Server racks in a modern data center
The infrastructure powering real-time data pipelines — reliability starts with the physical layerPhoto on Unsplash

Why Notebooks Do Not Scale

The Jupyter notebook has a structural problem: it encourages a linear, stateful programming model that is the opposite of what production requires. You load data into memory, transform it in-place, pass it to a model, inspect the output. One process, one machine, one execution context. Production pipelines need four properties that notebooks actively work against:

1
Stateless. Each step should be a pure function. Notebooks accumulate state across cells, making execution order matter and bugs invisible.
2
Distributed. When data exceeds one machine, you need horizontal scaling. Notebooks run on one kernel.
3
Fault-tolerant. When a step fails at 3 AM, retry, dead-letter the record, keep processing. Notebooks crash and wait for a human.
4
Observable. You need latency percentiles, throughput metrics, and error rates at every stage. Notebooks give you print().

The transition from notebook to production is not refactoring. It is a redesign. Accepting that early saves months of debugging distributed systems that never needed to be distributed.

Pipeline Architecture Patterns

Two foundational architectures dominate, and choosing between them is the first real decision you make.

Lambda architecture runs two parallel pipelines: a batch layer for accuracy and a speed layer for low latency. A serving layer merges outputs. The cost is operational complexity -- two codepaths that must produce consistent outputs, and in practice they drift apart.

Kappa architecture eliminates the batch layer. Everything is a stream. Historical reprocessing means replaying the event log. Simpler to reason about, but requires your streaming infrastructure to handle both real-time and historical scale.

Key takeaway

Start with Kappa if your data volume is under 1TB and your processing logic is the same for historical and real-time data. Move to Lambda when you need different processing semantics for batch and stream, or when reprocessing the full event log becomes prohibitively slow.

For Trovex, I use Kappa. Document ingestion, search queries, and relevance feedback all flow through the same streaming pipeline. Reindexing means replaying the document stream. This works because processing logic is identical regardless of whether a document arrived 5 seconds or 5 months ago.

Message Queues and Streaming

The message broker is the backbone. Two options dominate: Kafka and Redis Streams. The choice is about which failure modes you accept.

DimensionKafkaRedis Streams
ThroughputMillions of messages/sec with partitioningHundreds of thousands/sec, single-threaded
DurabilityReplicated to disk across brokersIn-memory with optional AOF persistence
ReplayFull log retention, replay from any offsetCapped streams, limited retention window
Best forHigh-volume event sourcing, multi-consumerLow-latency lightweight queuing, prototyping

For Histia, Redis Streams handles the task queue (image IDs to process) while actual image data moves through object storage references. Never put large payloads on the message bus. Pass references and let consumers fetch from storage.

Data Validation and Schema Enforcement

Data pipelines fail silently. A malformed record flows through every stage, produces garbage, and you discover it days later when a user reports wrong results. The fix is validation at the boundary: every record validated on entry, every transformation output validated before it moves downstream.

python
from pydantic import BaseModel, validator
from datetime import datetime

class DocumentEvent(BaseModel):
    doc_id: str
    content: str
    source: str
    timestamp: datetime
    embedding: list[float] | None = None

    @validator("content")
    def content_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Document content cannot be empty")
        return v

    @validator("embedding")
    def embedding_dimensions(cls, v):
        if v is not None and len(v) != 768:
            raise ValueError(f"Expected 768-dim, got {len(v)}")
        return v

Pydantic handles record-level validation. For aggregate data quality, Great Expectations runs suite-level checks: column distributions, null rates, referential integrity on micro-batches within the stream, catching drift before it propagates.

Why this matters

Schema enforcement is not just about catching bugs. It is about trust. When downstream consumers know that every record has passed validation, they skip defensive null-checking and focus on business logic. The pipeline becomes a contract, not a hope.

Monitoring, Observability, and Failure Handling

Pipeline monitoring is different from application monitoring. Application monitoring asks: is the server up? Pipeline monitoring asks: is the data flowing correctly?

1
Latency percentiles, not averages. A pipeline with 100ms average latency might have a p99 of 5 seconds. Track p50, p95, and p99 at every stage. Alert on p99 because that is where problems surface first.
2
Dead letter queues. When a record fails, route it to a DLQ instead of dropping or blocking. Review DLQ contents daily. It is the most honest debugging tool you have: it shows exactly what your pipeline cannot handle.
3
Data drift detection. Monitor statistical distributions of key fields. If embedding vectors shift in mean or variance, something changed upstream. Alert on distribution shift, not just schema violations.
4
Consumer lag. In Kafka, track offset lag per consumer group. Rising lag means consumers are falling behind producers. This is the earliest scaling signal and the easiest to automate.

In practice, I use Prometheus for metrics, Grafana for dashboards, and a Python script polling consumer lag and DLQ depth every 30 seconds. The first version of monitoring should take an afternoon, not a sprint.

Real-time streaming metrics and monitoring dashboard for data pipelines
Monitoring and observability in production pipelines — dashboards transform raw metrics into actionable insightPhoto on Unsplash

ML Pipeline-Specific Challenges

ML pipelines add three concerns on top of standard data engineering.

1
Feature stores. A dual interface: offline store for batch training (BigQuery, Parquet on S3) and online store for real-time inference (Redis, DynamoDB). Same feature definitions serve both. Without this, you compute features differently at training vs. inference, producing training-serving skew -- the most insidious ML bug because it passes evaluation and fails silently in production.
2
Model serving. Version management, canary deployments, graceful fallback. In Histia, a new model gets 5% of traffic while the previous version handles 95%. If error rate exceeds a threshold, traffic routes back automatically. Basic production safety when output informs clinical decisions.
3
A/B testing. Harder than UI testing because feedback loops are longer and metrics noisier. For Trovex, validating a search relevance improvement takes weeks of user data. The infrastructure maintains parallel versions, splits traffic deterministically by user, and logs every prediction with its model version.

Scaling and Lessons Learned

The first scaling problem is not throughput. It is backpressure: producers generating data faster than consumers process it. The pattern is watermark-based: a high watermark triggers throttling, a low watermark resumes flow. Kafka handles this natively. For custom pipelines:

python
class BackpressureQueue:
    def __init__(self, high_watermark=10000, low_watermark=2000):
        self.queue, self.high, self.low = [], high_watermark, low_watermark
        self.paused = False

    async def put(self, item):
        self.queue.append(item)
        if len(self.queue) >= self.high and not self.paused:
            self.paused = True
            await self.signal_producers("pause")

    async def get(self):
        item = self.queue.pop(0)
        if len(self.queue) <= self.low and self.paused:
            self.paused = False
            await self.signal_producers("resume")
        return item

For horizontal scaling, the key constraint is ordering. If events from the same user must be sequential, partition by user ID and scale by adding partitions and consumers together. If ordering does not matter, round-robin for simpler scaling. After building pipelines across Histia, Trovex, Aethon, and Clover, the lessons converge:

1
Start with a batch script. Before building streaming infrastructure, write a Python script that processes data end-to-end in a for loop. Get the logic right. Validate outputs. Only then decompose into pipeline stages.
2
Add complexity only when metrics justify it. Do not add Kafka until you have measured your current solution cannot handle the load. Every component you add is a component to monitor, debug, and maintain at 3 AM.
3
Make every stage independently testable. Each stage should be a function you can call with test input and verify the output. If you cannot unit test a stage, it is doing too many things.
4
Log structured, trace everything. Every record gets a trace ID. Every stage logs trace ID, processing time, and status as structured JSON. When something breaks, you grep by trace ID and see the full lifecycle across every stage.
5
Design for replay from day one. Your pipeline will produce wrong outputs. The question is whether you can reprocess without manual intervention. Idempotent stages plus a replayable message broker means recovery is a one-line command, not a weekend.
Key takeaway

The best pipeline architecture is the simplest one that meets your requirements today, with clear extension points for tomorrow. Overengineering is just as costly as underengineering, and harder to diagnose because the symptoms look like operational complexity rather than technical failure.

Real-time data pipelines are not glamorous. They do not demo well. But they are the difference between a research prototype that impresses in a meeting and a production system that serves users reliably. Every project I have built that actually works in production got there because the pipeline was right, not because the model was right. The model is the easy part. The pipeline is the system.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.