Introduction
Every data-intensive project I have built -- Histia, Trovex, Aethon -- hit the same wall: the thing that works in a Jupyter notebook does not work in production. The model is fine. The processing logic is fine. What breaks is everything between the data source and the model, and everything between the model output and the user.
This is the research-to-production gap. Not a skills gap. An architecture gap. Research prototypes process data in batches, synchronously, with a human watching. Production systems process data continuously, asynchronously, with no one watching except dashboards you also have to build. This post covers the patterns, tooling decisions, and lessons from building pipelines that actually run.

Why Notebooks Do Not Scale
The Jupyter notebook has a structural problem: it encourages a linear, stateful programming model that is the opposite of what production requires. You load data into memory, transform it in-place, pass it to a model, inspect the output. One process, one machine, one execution context. Production pipelines need four properties that notebooks actively work against:
print().The transition from notebook to production is not refactoring. It is a redesign. Accepting that early saves months of debugging distributed systems that never needed to be distributed.
Pipeline Architecture Patterns
Two foundational architectures dominate, and choosing between them is the first real decision you make.
Lambda architecture runs two parallel pipelines: a batch layer for accuracy and a speed layer for low latency. A serving layer merges outputs. The cost is operational complexity -- two codepaths that must produce consistent outputs, and in practice they drift apart.
Kappa architecture eliminates the batch layer. Everything is a stream. Historical reprocessing means replaying the event log. Simpler to reason about, but requires your streaming infrastructure to handle both real-time and historical scale.
Start with Kappa if your data volume is under 1TB and your processing logic is the same for historical and real-time data. Move to Lambda when you need different processing semantics for batch and stream, or when reprocessing the full event log becomes prohibitively slow.
For Trovex, I use Kappa. Document ingestion, search queries, and relevance feedback all flow through the same streaming pipeline. Reindexing means replaying the document stream. This works because processing logic is identical regardless of whether a document arrived 5 seconds or 5 months ago.
Message Queues and Streaming
The message broker is the backbone. Two options dominate: Kafka and Redis Streams. The choice is about which failure modes you accept.
| Dimension | Kafka | Redis Streams |
|---|---|---|
| Throughput | Millions of messages/sec with partitioning | Hundreds of thousands/sec, single-threaded |
| Durability | Replicated to disk across brokers | In-memory with optional AOF persistence |
| Replay | Full log retention, replay from any offset | Capped streams, limited retention window |
| Best for | High-volume event sourcing, multi-consumer | Low-latency lightweight queuing, prototyping |
For Histia, Redis Streams handles the task queue (image IDs to process) while actual image data moves through object storage references. Never put large payloads on the message bus. Pass references and let consumers fetch from storage.
Data Validation and Schema Enforcement
Data pipelines fail silently. A malformed record flows through every stage, produces garbage, and you discover it days later when a user reports wrong results. The fix is validation at the boundary: every record validated on entry, every transformation output validated before it moves downstream.
from pydantic import BaseModel, validator
from datetime import datetime
class DocumentEvent(BaseModel):
doc_id: str
content: str
source: str
timestamp: datetime
embedding: list[float] | None = None
@validator("content")
def content_not_empty(cls, v):
if not v.strip():
raise ValueError("Document content cannot be empty")
return v
@validator("embedding")
def embedding_dimensions(cls, v):
if v is not None and len(v) != 768:
raise ValueError(f"Expected 768-dim, got {len(v)}")
return vPydantic handles record-level validation. For aggregate data quality, Great Expectations runs suite-level checks: column distributions, null rates, referential integrity on micro-batches within the stream, catching drift before it propagates.
Schema enforcement is not just about catching bugs. It is about trust. When downstream consumers know that every record has passed validation, they skip defensive null-checking and focus on business logic. The pipeline becomes a contract, not a hope.
Monitoring, Observability, and Failure Handling
Pipeline monitoring is different from application monitoring. Application monitoring asks: is the server up? Pipeline monitoring asks: is the data flowing correctly?
p50, p95, and p99 at every stage. Alert on p99 because that is where problems surface first.In practice, I use Prometheus for metrics, Grafana for dashboards, and a Python script polling consumer lag and DLQ depth every 30 seconds. The first version of monitoring should take an afternoon, not a sprint.

ML Pipeline-Specific Challenges
ML pipelines add three concerns on top of standard data engineering.
Scaling and Lessons Learned
The first scaling problem is not throughput. It is backpressure: producers generating data faster than consumers process it. The pattern is watermark-based: a high watermark triggers throttling, a low watermark resumes flow. Kafka handles this natively. For custom pipelines:
class BackpressureQueue:
def __init__(self, high_watermark=10000, low_watermark=2000):
self.queue, self.high, self.low = [], high_watermark, low_watermark
self.paused = False
async def put(self, item):
self.queue.append(item)
if len(self.queue) >= self.high and not self.paused:
self.paused = True
await self.signal_producers("pause")
async def get(self):
item = self.queue.pop(0)
if len(self.queue) <= self.low and self.paused:
self.paused = False
await self.signal_producers("resume")
return itemFor horizontal scaling, the key constraint is ordering. If events from the same user must be sequential, partition by user ID and scale by adding partitions and consumers together. If ordering does not matter, round-robin for simpler scaling. After building pipelines across Histia, Trovex, Aethon, and Clover, the lessons converge:
grep by trace ID and see the full lifecycle across every stage.The best pipeline architecture is the simplest one that meets your requirements today, with clear extension points for tomorrow. Overengineering is just as costly as underengineering, and harder to diagnose because the symptoms look like operational complexity rather than technical failure.
Real-time data pipelines are not glamorous. They do not demo well. But they are the difference between a research prototype that impresses in a meeting and a production system that serves users reliably. Every project I have built that actually works in production got there because the pipeline was right, not because the model was right. The model is the easy part. The pipeline is the system.