Navtej Singh

Introduction

Every data-intensive project I have built -- Histia, Trovex, Aethon -- hit the same wall: the thing that works in a Jupyter notebook does not work in production. The model is fine. The processing logic is fine. What breaks is everything between the data source and the model, and everything between the model output and the user.

This is the research-to-production gap. Not a skills gap. An architecture gap. Research prototypes process data in batches, synchronously, with a human watching. Production systems process data continuously, asynchronously, with no one watching except dashboards you also have to build. This post covers the patterns, tooling decisions, and lessons from building pipelines that actually run.

Server racks in a modern data center — The infrastructure powering real-time data pipelines — reliability starts with the physical layer — Photo on Unsplash

Why Notebooks Do Not Scale

The Jupyter notebook has a structural problem: it encourages a linear, stateful programming model that is the opposite of what production requires. You load data into memory, transform it in-place, pass it to a model, inspect the output. One process, one machine, one execution context. Production pipelines need four properties that notebooks actively work against:

Stateless. Each step should be a pure function. Notebooks accumulate state across cells, making execution order matter and bugs invisible.

Distributed. When data exceeds one machine, you need horizontal scaling. Notebooks run on one kernel.

Fault-tolerant. When a step fails at 3 AM, retry, dead-letter the record, keep processing. Notebooks crash and wait for a human.

Observable. You need latency percentiles, throughput metrics, and error rates at every stage. Notebooks give you print().

The transition from notebook to production is not refactoring. It is a redesign. Accepting that early saves months of debugging distributed systems that never needed to be distributed.

Pipeline Architecture Patterns

Two foundational architectures dominate, and choosing between them is the first real decision you make.

Lambda architecture runs two parallel pipelines: a batch layer for accuracy and a speed layer for low latency. A serving layer merges outputs. The cost is operational complexity -- two codepaths that must produce consistent outputs, and in practice they drift apart.

Kappa architecture eliminates the batch layer. Everything is a stream. Historical reprocessing means replaying the event log. Simpler to reason about, but requires your streaming infrastructure to handle both real-time and historical scale.

Key takeaway

Start with Kappa if your data volume is under 1TB and your processing logic is the same for historical and real-time data. Move to Lambda when you need different processing semantics for batch and stream, or when reprocessing the full event log becomes prohibitively slow.

For Trovex, I use Kappa. Document ingestion, search queries, and relevance feedback all flow through the same streaming pipeline. Reindexing means replaying the document stream. This works because processing logic is identical regardless of whether a document arrived 5 seconds or 5 months ago.

Message Queues and Streaming

The message broker is the backbone. Two options dominate: Kafka and Redis Streams. The choice is about which failure modes you accept.

Dimension	Kafka	Redis Streams
Throughput	Millions of messages/sec with partitioning	Hundreds of thousands/sec, single-threaded
Durability	Replicated to disk across brokers	In-memory with optional AOF persistence
Replay	Full log retention, replay from any offset	Capped streams, limited retention window
Best for	High-volume event sourcing, multi-consumer	Low-latency lightweight queuing, prototyping

For Histia, Redis Streams handles the task queue (image IDs to process) while actual image data moves through object storage references. Never put large payloads on the message bus. Pass references and let consumers fetch from storage.

Data Validation and Schema Enforcement

Data pipelines fail silently. A malformed record flows through every stage, produces garbage, and you discover it days later when a user reports wrong results. The fix is validation at the boundary: every record validated on entry, every transformation output validated before it moves downstream.

python

from pydantic import BaseModel, validator
from datetime import datetime

class DocumentEvent(BaseModel):
    doc_id: str
    content: str
    source: str
    timestamp: datetime
    embedding: list[float] | None = None

    @validator("content")
    def content_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Document content cannot be empty")
        return v

    @validator("embedding")
    def embedding_dimensions(cls, v):
        if v is not None and len(v) != 768:
            raise ValueError(f"Expected 768-dim, got {len(v)}")
        return v

Pydantic handles record-level validation. For aggregate data quality, Great Expectations runs suite-level checks: column distributions, null rates, referential integrity on micro-batches within the stream, catching drift before it propagates.

Why this matters

Schema enforcement is not just about catching bugs. It is about trust. When downstream consumers know that every record has passed validation, they skip defensive null-checking and focus on business logic. The pipeline becomes a contract, not a hope.

Monitoring, Observability, and Failure Handling

Pipeline monitoring is different from application monitoring. Application monitoring asks: is the server up? Pipeline monitoring asks: is the data flowing correctly?

Latency percentiles, not averages. A pipeline with 100ms average latency might have a p99 of 5 seconds. Track p50, p95, and p99 at every stage. Alert on p99 because that is where problems surface first.

Dead letter queues. When a record fails, route it to a DLQ instead of dropping or blocking. Review DLQ contents daily. It is the most honest debugging tool you have: it shows exactly what your pipeline cannot handle.

Data drift detection. Monitor statistical distributions of key fields. If embedding vectors shift in mean or variance, something changed upstream. Alert on distribution shift, not just schema violations.

Consumer lag. In Kafka, track offset lag per consumer group. Rising lag means consumers are falling behind producers. This is the earliest scaling signal and the easiest to automate.

In practice, I use Prometheus for metrics, Grafana for dashboards, and a Python script polling consumer lag and DLQ depth every 30 seconds. The first version of monitoring should take an afternoon, not a sprint.

Real-time streaming metrics and monitoring dashboard for data pipelines — Monitoring and observability in production pipelines — dashboards transform raw metrics into actionable insight — Photo on Unsplash

ML Pipeline-Specific Challenges

ML pipelines add three concerns on top of standard data engineering.

Feature stores. A dual interface: offline store for batch training (BigQuery, Parquet on S3) and online store for real-time inference (Redis, DynamoDB). Same feature definitions serve both. Without this, you compute features differently at training vs. inference, producing training-serving skew -- the most insidious ML bug because it passes evaluation and fails silently in production.

Model serving. Version management, canary deployments, graceful fallback. In Histia, a new model gets 5% of traffic while the previous version handles 95%. If error rate exceeds a threshold, traffic routes back automatically. Basic production safety when output informs clinical decisions.

A/B testing. Harder than UI testing because feedback loops are longer and metrics noisier. For Trovex, validating a search relevance improvement takes weeks of user data. The infrastructure maintains parallel versions, splits traffic deterministically by user, and logs every prediction with its model version.

Scaling and Lessons Learned

The first scaling problem is not throughput. It is backpressure: producers generating data faster than consumers process it. The pattern is watermark-based: a high watermark triggers throttling, a low watermark resumes flow. Kafka handles this natively. For custom pipelines:

python

class BackpressureQueue:
    def __init__(self, high_watermark=10000, low_watermark=2000):
        self.queue, self.high, self.low = [], high_watermark, low_watermark
        self.paused = False

    async def put(self, item):
        self.queue.append(item)
        if len(self.queue) >= self.high and not self.paused:
            self.paused = True
            await self.signal_producers("pause")

    async def get(self):
        item = self.queue.pop(0)
        if len(self.queue) <= self.low and self.paused:
            self.paused = False
            await self.signal_producers("resume")
        return item

For horizontal scaling, the key constraint is ordering. If events from the same user must be sequential, partition by user ID and scale by adding partitions and consumers together. If ordering does not matter, round-robin for simpler scaling. After building pipelines across Histia, Trovex, Aethon, and Clover, the lessons converge:

Start with a batch script. Before building streaming infrastructure, write a Python script that processes data end-to-end in a for loop. Get the logic right. Validate outputs. Only then decompose into pipeline stages.

Add complexity only when metrics justify it. Do not add Kafka until you have measured your current solution cannot handle the load. Every component you add is a component to monitor, debug, and maintain at 3 AM.

Make every stage independently testable. Each stage should be a function you can call with test input and verify the output. If you cannot unit test a stage, it is doing too many things.

Log structured, trace everything. Every record gets a trace ID. Every stage logs trace ID, processing time, and status as structured JSON. When something breaks, you grep by trace ID and see the full lifecycle across every stage.

Design for replay from day one. Your pipeline will produce wrong outputs. The question is whether you can reprocess without manual intervention. Idempotent stages plus a replayable message broker means recovery is a one-line command, not a weekend.

Key takeaway

The best pipeline architecture is the simplest one that meets your requirements today, with clear extension points for tomorrow. Overengineering is just as costly as underengineering, and harder to diagnose because the symptoms look like operational complexity rather than technical failure.

Real-time data pipelines are not glamorous. They do not demo well. But they are the difference between a research prototype that impresses in a meeting and a production system that serves users reliably. Every project I have built that actually works in production got there because the pipeline was right, not because the model was right. The model is the easy part. The pipeline is the system.

Building Real-Time Data Pipelines: From Research to Production

Introduction

Why Notebooks Do Not Scale

Pipeline Architecture Patterns

Message Queues and Streaming

Data Validation and Schema Enforcement

Monitoring, Observability, and Failure Handling

ML Pipeline-Specific Challenges

Scaling and Lessons Learned

More posts

Stay in the loop

Building Real-Time Data Pipelines: From Research to Production

Introduction

Why Notebooks Do Not Scale

Pipeline Architecture Patterns

Message Queues and Streaming

Data Validation and Schema Enforcement

Monitoring, Observability, and Failure Handling

ML Pipeline-Specific Challenges

Scaling and Lessons Learned

More posts

Linear Algebra Is All You Need: The Math Behind Every AI Model

Building Medical AI: A Complete Guide To Pathology Slide Analysis

What Is Prompt Engineering? A Complete Guide From Definition To Production

Stay in the loop