NRT Vector Search — Part 6: What I'd Do Differently

Every build teaches you something the documentation couldn't. This post is about what this one taught me — the friction points that surprised me, the decisions I'd make earlier, and what I'd tell someone starting this build from scratch.

No happy path recap. That's what the previous five posts are for. This one is the unfiltered version.

Start With the Latency Budget, Not the Code

The mistake most teams make with NRT pipelines is starting with the implementation and discovering the latency profile at the end. By then trigger intervals are set, cluster configurations are locked, and the architecture has opinions baked in that are expensive to revisit.

Do it in reverse. Before writing a line of code, map the expected latency contribution from every layer. Replication cadence. Auto Loader detection interval. Trigger interval per medallion layer. Embedding throughput. Optimize cadence. Add them up. That number is your realistic end to end latency floor — and it's almost always larger than anyone expects.

Get stakeholder alignment on that number before the build begins. "Near real time" means different things to different people. A business stakeholder may hear it as seconds. Your actual floor may be minutes. That gap, discovered after go-live, is the kind of conversation nobody wants to have.

Define it, document it, align on it. Then build to it.

Bronze Completeness Is Not Optional

This one came from a production incident that shouldn't have been an incident.

A record went missing downstream. The investigation traced it to the landing step — the record never made it into Bronze in the first place. Not a LanceDB problem, not a streaming problem, not an embedding problem. A Bronze completeness problem.

The diagnosis took longer than it should have because Bronze had no completeness contract. There was no expected record count to validate against. The pipeline was doing quality checks on format, schema, and nulls — but completeness was blind.

If I'm building this again, Bronze completeness validation is a first-class step, not an afterthought. Before Silver processing begins, confirm the batch is whole. Get expected record counts from upstream systems. If upstream can't provide them, document that gap explicitly — a Bronze layer with no completeness contract is an architectural finding, not a minor gap.

That one missing record cost more time to diagnose than the entire completeness framework would have taken to build.

The foreachBatch Idempotency Pattern Goes In on Day One

Not after the first duplicate write incident. Day one.

The instinct is to get the pipeline running first and add the deduplication table later. That instinct is wrong. By the time the first retry-induced duplicate shows up in production, you have stale vectors in the index serving retrieval results. Cleaning that up is significantly more work than the idempotency pattern would have been to build at the start.

Three columns in a Delta table. A check at the top of foreachBatch. A write at the bottom after confirmed success. There is no version of this pipeline where that work isn't worth doing before the first production write.

Profile Embedding Throughput Before You Set a Trigger Interval

This seems obvious in retrospect. It wasn't obvious enough to do before the first trigger interval was set.

Embedding throughput is environment-specific. The number you get on a development cluster is not the number you get on a production cluster. The batch size that performs well at 1,000 records performs differently at 10,000. The model that embeds quickly on GPU takes meaningfully longer on CPU.

Profile in the environment where the pipeline will actually run, against batch sizes that reflect actual write volume, before committing to a trigger interval. The data from that profiling session drives two settings — trigger interval and maxFilesPerTrigger — that control the operational behavior of the pipeline for its entire lifespan.

Decouple Optimize From the Streaming Job Earlier Than You Think You Need To

The optimize cadence conversation can feel premature early in the build. The table is small, fragments aren't accumulating yet, query latency looks fine. It's easy to defer.

Defer it past the point where the table has meaningful volume and you're now retrofitting a decoupled optimize job onto a running streaming pipeline. That retrofit is a non-trivial operational change to a production system — not dangerous, but avoidable.

Design the optimize job as a separate concern from the start. Even if it runs on the same cluster initially, structure it so it can be moved independently. The separation costs nothing when the table is small and protects you when the table is large.

Document What Gets Ruled Out and Why

The spike process produced a clear decision: no external CDC, no Kafka, no Fivetran. The existing stack already had everything needed. That decision was right and well-reasoned.

What almost didn't happen was documenting the reasoning explicitly before the build started.

Three months into production, someone asks why the pipeline doesn't use Kafka. If the ruling-out rationale isn't documented, that question reopens a decision that was already made correctly. You're relitigating architecture instead of building product.

Write down what you ruled out and why before you start. One page. The decision log pays for itself the first time someone asks a question you already answered.

Retrieval Quality Is a Separate Discipline

The pipeline can be architecturally sound and operationally clean — data moving continuously, vectors fresh, optimize cadence well tuned, end to end latency within budget — and still produce mediocre answers.

Retrieval quality lives downstream of data engineering. Chunk design, embedding model selection, query construction, context window management — these are ML engineering concerns, not pipeline concerns. The pipeline's job is to deliver clean, fresh, correctly structured vectors to the retrieval layer. Whether those vectors produce good answers is measured by frameworks like RAGAS and owned by someone with ML engineering depth.

Know where your lane ends. Build the best pipeline you can. Hand it cleanly to the people who own what happens next. That handoff, clearly defined and respected in both directions, is what makes an AI system actually work in production.

The Bigger Lesson

Every friction point in this series — schema evolution, foreachBatch idempotency, embedding throughput, fragment accumulation, optimize contention, latency compounding across layers — is a seam problem. The tools themselves work. The complexity lives at the handoffs between them.

That's not unique to this stack. It's true of every production data system worth building. The happy path tutorials cover the tools in isolation. Production is the seams.

If you're building this pipeline, build the seams first. Idempotency, observability, completeness validation, latency budgeting — none of these are features. They're the foundation. Everything else sits on top of them.

That's what this series was about. Not the tools. The seams.

This is the sixth and final post in "NRT Vector Search: From Spike to Production." If you're building something like this and want to talk through the architecture, LogicLens is where that conversation starts.

Clarity through the chaos.

Arjun Krishnamoorthi is the founder of LogicLens LLC, a fractional data architecture and AI consulting practice. If you have a data infrastructure problem or an AI project that needs senior hands — let's talk.