NRT Vector Search — Part 5: The Full Pipeline End to End

Posts 1 through 4 covered the tools individually — what each one does, where the friction lives, how to mitigate it. This post is about what happens when you wire them all together and turn it on.

The full pipeline looks clean on a whiteboard. Running continuously against real data volume with a streaming job that doesn't stop, a vector store that keeps growing, and users querying while writes are in flight — that's where the architecture either holds or doesn't.

The Complete Flow

Left to right, the pipeline is:

Replication tool → ADLS landing zone → Auto Loader → Bronze Delta → Structured Streaming → Silver Delta → Structured Streaming → Gold Delta → foreachBatch → LanceDB

Each arrow is a handoff. Each handoff has a cadence. The latency budget for the full pipeline is the sum of those cadences, and every layer contributes.

End to end latency is not a number you discover after you build. It's a number you design for before you build. Map the expected contribution from each layer — replication cadence, trigger intervals, transformation time, embedding throughput, optimize cadence — and you have a latency budget. Build to that budget. Measure against it.

Owning the End to End Number

One failure mode that shows up in team settings: latency gets measured in silos. The streaming team measures Bronze to Gold. The ML team measures query latency. Nobody owns the number from source transaction to queryable vector.

That gap is where production surprises live.

Own the end to end number explicitly. Define it as: time from record landing in ADLS to that record being queryable in LanceDB. Instrument it. Put it on a dashboard. Make it the metric that matters, not the per-layer proxies.

The formula is straightforward:

Replication cadence + Auto Loader detection latency + Bronze trigger interval + Silver transformation time + Gold trigger interval + embedding time + LanceDB write time = time to queryable

Each term is measurable. Measure them individually first, then measure the sum. The sum is usually larger than the parts because of queuing — a trigger interval that fires every 30 seconds means a record that arrives one second after a trigger just fired waits 29 seconds before processing begins. That queuing cost compounds across layers.

Trigger Interval Alignment Across Layers

When Structured Streaming runs across multiple medallion layers — Bronze to Silver, Silver to Gold — trigger intervals interact.

A Silver job with a 60 second trigger interval processing output from a Bronze job with a 30 second trigger interval will batch up two Bronze micro-batches per Silver run. That's not inherently wrong but it means Silver latency is always at least one full Silver trigger interval behind Bronze, even if Bronze is running clean.

Trigger intervals are not independent settings. They compound. A 30 second Bronze interval plus a 60 second Silver interval plus a 30 second Gold interval means a minimum of 2 minutes of trigger latency before a record reaches foreachBatch — and that's assuming every layer processes its batch immediately with no transformation time.

Set trigger intervals with the full chain in mind, not layer by layer in isolation.

Catchup Scenarios

Every streaming pipeline will eventually restart. Cluster maintenance, job failure, deliberate pause — whatever the reason, the stream will stop and start again. What happens at restart defines how operationally sound the pipeline actually is.

Without maxFilesPerTrigger configured, Auto Loader on restart will process everything that accumulated during downtime in whatever batch sizes Spark decides are appropriate. A six hour outage at moderate write volume can produce a catchup batch that overwhelms downstream processing — embedding throughput can't keep pace, foreachBatch slows, backpressure builds, and the stream takes hours to recover to real time.

With maxFilesPerTrigger configured, catchup batches are capped. The stream processes the backlog at a controlled rate. Embedding throughput stays within bounds. Recovery is predictable.

Test your catchup scenario before production. Stop the stream deliberately, let files accumulate for a representative period, restart, and measure recovery time.

Concurrent Operations: The Contention Map

When the full pipeline is running, multiple operations are active simultaneously — Auto Loader scanning, Bronze and Silver and Gold streaming jobs reading and writing Delta, foreachBatch embedding and writing to LanceDB, optimize running on its own schedule. Each touches shared resources. Contention is not theoretical.

A few decisions that reduce contention:

Separate clusters for separate concerns. The streaming pipeline and the optimize job should not share a cluster. Optimize is compute intensive and intermittent — it will starve the streaming jobs when it runs if they're sharing resources.

Delta concurrent reads are safe. Multiple streaming jobs reading the same Delta table simultaneously is supported and efficient. Don't serialize reads that don't need to be serialized.

LanceDB writes and optimize must not overlap. The coordination flag from Part 4 is load-bearing architecture, not optional housekeeping.

The Latency Measurement Framework

File landing to Bronze commit — Auto Loader detection latency plus Bronze transformation time. First indicator of ingestion health.

Bronze commit to Gold commit — the full medallion transformation latency. Silver and Gold trigger intervals plus transformation time per layer.

Gold commit to LanceDB queryable — foreachBatch time including embedding, plus LanceDB write time. The last mile before the record is searchable.

End to end — file landing to LanceDB queryable. The only number that matters to a user asking a question.

Query latency distribution — P50, P95, P99. The distribution tells you more than the average.

Instrument all of these from day one. Retrofitting observability onto a running pipeline is significantly more painful than building it in at the start.

Tuning Is Not a One-Time Event

The tuning decisions you make at launch are calibrated against your data volume at launch. That volume will change.

As the corpus grows, optimize runtime increases. As write volume scales, fragment accumulation accelerates. As query load increases, flat scan latency on the unindexed tail becomes more visible. The settings that were right at 100K records are not automatically right at 1M.

Build a tuning review into your operational cadence. When end to end latency trends up without an obvious cause, the culprit is usually optimize cadence falling behind corpus growth or trigger intervals that were right at a smaller volume but are now compounding across a larger one.

The pipeline is a system. Systems require ongoing calibration, not a one-time configuration.

What Production Actually Looks Like

A production NRT vector search pipeline running cleanly looks unremarkable from the outside. Records land, get processed, get embedded, become queryable — continuously, without intervention, within a defined and measured latency budget.

Getting to unremarkable is the work. The seams covered in this series — Auto Loader schema evolution, foreachBatch idempotency, LanceDB optimize cadence, trigger interval alignment — are what stand between a pipeline that works in a demo and one that runs in production without someone watching it.

That gap is where the real engineering lives.

Next: Part 6 — What I'd Do Differently. The honest retrospective on what surprised me, what I'd tell someone starting this build from scratch, and where retrieval quality lives relative to pipeline engineering.

Clarity through the chaos.

Arjun Krishnamoorthi is the founder of LogicLens LLC, a fractional data architecture and AI consulting practice. If you have a data infrastructure problem or an AI project that needs senior hands — let's talk.