NRT Vector Search — Part 2: The Spike: What We Actually Found

Every implementation starts with a question that sounds simple until you try to answer it.

Ours was: how do we keep the vector index fresh without rebuilding it from scratch on every run?

That question is what kicked off the spike. What we found — and what we ruled out — is what this post is about.

What a Spike Actually Is

Not a prototype. Not a proof of concept. A spike is a time-boxed investigation with a specific question to answer and a decision to make at the end. You're not building something shippable. You're buying down uncertainty before you commit architecture to code.

The NRT spike had one job: determine whether the existing stack could support near real time vector search, and if so, how.

The Question Behind the Question

The surface question was freshness. The deeper question was whether solving freshness required new infrastructure.

That's always the real question in a spike. New infrastructure means new failure modes, new operational overhead, new expertise requirements, new cost. If the existing stack can do the job, that's almost always the right answer — not because it's easier, but because complexity you don't introduce can't bite you in production.

So before writing a line of code, the investigation started with what was already there.

What Was Already There

The existing pipeline landed source data into ADLS via replication, processed it through a Delta medallion architecture on Databricks, and wrote vectors to LanceDB at the Gold layer. The problem was that write was batch — triggered manually or on a schedule, not continuously.

The stack itself was capable of more. Delta's transaction log is a native change cursor. Auto Loader reads that cursor incrementally. Structured Streaming processes micro-batches continuously. foreachBatch hands those batches to arbitrary write logic. LanceDB accepts appends at any cadence.

Every piece needed was already present. The question became one of configuration and wiring, not infrastructure.

What Got Ruled Out

Three approaches were evaluated and dismissed early.

External CDC tooling. The instinct when you hear "near real time" is often to reach for a CDC pipeline — Debezium, AWS DMS, something that captures row-level change events at the source. In this case that instinct was wrong. The Delta transaction log already provides native change tracking. Structured Streaming already reads that log as a cursor, processing only new commits since the last checkpoint. Adding external CDC infrastructure on top of Delta would be solving a problem that was already solved, at the cost of a distributed system that needs to be operated.

Kafka. No event streaming source existed upstream of the pipeline. Data arrived in Delta via replication. Kafka solves the problem of high-throughput event ingestion from heterogeneous producers — that wasn't the problem here. Introducing Kafka would have added distributed messaging infrastructure to route data that was already where it needed to be.

Fivetran. A SaaS EL tool for pulling data from external source systems. By the time data reaches the Bronze Delta layer the ingestion problem is already solved. Fivetran operates upstream of where this pipeline begins. Not relevant.

The pattern across all three: each tool solves a real problem in the right context. None of them were in the right context here. Ruling them out wasn't a knock on the tools — it was recognizing where the actual work lived.

What the Spike Confirmed

The existing stack — Delta, Databricks, Auto Loader, Structured Streaming, LanceDB — supports NRT vector search natively. Implementation is primarily configuration and tuning rather than new infrastructure.

That's the conclusion that unlocked the rest of the work. Not exciting as a headline. Extremely valuable as an architectural decision.

The spike also surfaced where the real complexity lived — not in getting data to move continuously, but in what happens at the seams between tools. The foreachBatch handoff. The LanceDB write and index cadence. The upstream replication boundary as the effective latency floor.

Those seams are what the rest of this series is about.

The Decision Before the Build

A spike ends with a decision, not a demo. Ours was:

Proceed with NRT implementation using the existing stack. No new infrastructure. Primary implementation effort concentrated at the foreachBatch seam and LanceDB optimize cadence. Upstream replication cadence accepted as the latency floor.

That decision went into the architecture decision record before a single streaming job was written. When questions came up later about why certain choices were made, the answer was already documented.

That's what a spike is for.

Next: Part 3 — The foreachBatch Seam. Where Spark's guarantees end and your responsibilities begin.

Clarity through the chaos.

Arjun Krishnamoorthi is the founder of LogicLens LLC, a fractional data architecture and AI consulting practice. If you have a data infrastructure problem or an AI project that needs senior hands — let's talk.