This is Post 1 of The Production Ready Series — an opinionated, practitioner-first look at what it actually takes to operationalize an AI system. Not theory. Not vendor slides. Real architecture decisions made on real production systems.

"RAG is a retrieval problem. Production is an observability problem. And when something breaks, the business doesn't care what layer failed — they want to know what happened, when it happened, and what you're doing to make sure it never happens again."

The Gap Nobody Talks About

Most AI systems go to production with two things: a pipeline that runs and an endpoint that responds. That is it. No structured record of what the application did, what the user asked, how long each step took, or whether the result was any good.

When something goes wrong — and it will — the investigation starts with logs. Unstructured, scattered, hard to correlate. You are reconstructing a timeline manually, hopping between tools, trying to connect a user complaint to a specific request to a specific document to a specific processing event. It takes hours. Sometimes you never find it.

The business does not wait for that process. They want to know what happened, when it happened, and what you are doing to make sure it never happens again. "We are checking the logs" is not an answer. It is a credibility problem.

This is not a monitoring gap. It is an architectural gap. The application has no operational memory — no structured, queryable record of its own behavior. Without that layer you are not running an AI system in production. You are running an experiment that has not failed visibly yet.

This post is my blueprint for closing that gap. Six components. All opinionated. All built from real production work. Together they define what I consider a production ready AI system.

Component 1: Operational Memory — The Event Layer

Operational memory is not logging. Logs are passive — they capture what the system felt like printing out. Operational memory is intentional. You decide what matters, you define the structure, and you build a layer that captures it consistently across every interaction the application has.

For an AI application that means six tables, each owning a distinct slice of system behavior.

ApplicationHealthEvents — every endpoint called, latency per request, and any errors returned. The app-facing request record. This is the first place you look when the business asks what happened at 2pm yesterday.

DocProcessingEvents — every document that moved through ingestion, parsing, and processing. The application's internal behavior captured at the document level. What got processed, when, and what happened during it.

PipelineEvents — Databricks pipeline execution telemetry. Job runs, task durations, batch identifiers. The infrastructure layer's operational record.

OperationalMetrics — LanceDB and infrastructure signals in a flexible key/value structure. Metric name in one column, numeric value in another. Blob exists, last captured timestamp, index state. Designed to evolve without a schema change every time a new signal gets added.

UserAdvancedSearchEvents — user query activity and advanced search behavior. What users are actually asking the system to do.

HeartbeatEvents — the one that got missed in the MVP schema and gets added in v1. A regular structured pulse from each service confirming it is running and healthy. Embarrassingly basic. Absolutely necessary.

Six tables. Six perspectives on the same system. Together they give you something no log file can — a correlated, structured, queryable picture of what your AI application actually did, how it performed, and whether it was healthy when it did it.

And yes — it took a schema review to catch the most basic one. That is how this work actually goes.

Component 2: Schema Discipline — The Maturity Conversation

Moving from MVP to v1 is not a technical milestone. It is a maturity milestone.

The services are more defined. The team's understanding of what the application needs to do in production is sharper. And the metrics layer — which started as good enough — needs to grow up alongside both.

That is what the schema review is really about. Not just columns and types. It is the team asking harder questions together. What are we actually capturing. What does production accountability require that MVP did not. Where are the gaps we have not admitted yet.

Bringing in the testing resource as part of that conversation is intentional. V1 means the layer gets verified, not just designed. Testing needs to be aligned on the schema before it gets built out further. That is how a maturing team operates — the right people in the room at the right moment, not after the decisions are already made.

The output was alignment. Consistent typing. Bigints for identity columns. Naming conventions built to last. And the one thing nobody had on their list — HeartbeatEvents. A basic structured pulse confirming each service is alive and healthy. The review surfaced it. The team owned it. It goes in v1.

That is what maturing a service looks like. Not heroics. Methodical, collaborative, unglamorous — and exactly right.

Component 3: Pipeline Instrumentation and the Vector Store Gap

Databricks telemetry gets talked about like it is a missing piece. In practice the UI gives you most of what you need for batch pipeline operations today. Job run history, task duration, retry counts, failure states — it is there. Building a parallel instrumentation layer on top of it for its own sake is not the priority.

What is the priority is discipline at the pipeline level. Small consistent additions that compound over time. An ingest_date column on every table — stamped at ingestion, costs nothing, gives you a reliable timeline anchor for every record. Null checks per object baked into the pipeline logic. Test conditions that stop the pipeline when data quality thresholds are not met. Not glamorous. The unglamorous stuff that makes a pipeline trustworthy.

When something goes wrong at the data layer Delta Lake time travel is your first tool. Not logs. Not the UI. You query the table as it existed at a specific point in time — what did the data look like before that job ran, what changed, when did it change. That is the forensic layer Databricks gives you natively and it is genuinely powerful. Used correctly it answers the what happened question at the pipeline and table level without needing a separate instrumentation layer to tell you.

For the NRT pipeline the design is foreachBatch as the processing pattern. That choice is deliberate — and batch_id is the correlation key that makes it operationally visible. It threads through the telemetry connecting each micro-batch to what happened downstream. Without it you have two separate stories. The pipeline ran. LanceDB responded. But you cannot connect them. batch_id makes that connection explicit and queryable.

And that is where the real gap is. LanceDB. The vector store is a black box in a way that pipeline execution is not. Query latency, index state at query time, scan type — full scan or ANN — result counts returned. None of that exists in structured queryable form today. For a production RAG system that is not acceptable. Retrieval is the core of the application. If you cannot see what the vector store is doing you cannot reason about why the application behaves the way it does.

The exact LanceDB telemetry schema firms up as the NRT solution matures. What is locked is the need. This is not instrumentation for its own sake — it is the visibility layer for the most critical component in the stack.

Component 4: Request Tracing and Playback

A structured event layer earns its value when something goes wrong. That is when the questions get specific. Not is the system healthy — but what exactly did the application do for this user, at this time, in this sequence.

There are two ways that question gets asked in practice.

The first is single request forensics. A user got a bad result. Or no result. You take the request identifier, pull the ApplicationHealthEvents record, and walk the chain. What endpoint was called. What retrieval fired. What documents were in scope. What LanceDB returned. How long each step took. The event tables are the thread — each one capturing its domain slice of the same request. Together they reconstruct the full picture without touching a single log file.

The second is time window replay. Not one request — a sequence. What did the application do between 2pm and 4pm yesterday. Which endpoints were hammered. Where did latency spike. Did document processing keep up with the request volume or fall behind. You are walking through the application's behavior in order, reconstructing a timeline from structured queryable data rather than grepping through logs hoping to find the right line.

Both use cases live on the same foundation. The event taxonomy, the consistent typing, the correlation keys threading through every table. Build that correctly and both become possible.

Component 5: Closing the Loop — AI Observability and Where It All Lands

The metrics layer is not the final destination. It is the foundation that makes the final destination possible.

A proper APM platform is where operational intelligence surfaces for the business. Not SQL queries, not notebooks, not a dashboard someone bolted on after the fact. AI-specific metrics, structured alerting, and the kind of visibility that lets an engineering team answer the question before it gets asked.

But an observability platform is only as good as what you feed it. Most teams push raw logs or unstructured telemetry directly into APM tooling and wonder why the dashboards are noisy and hard to act on. The signal is not there because the structure is not there.

That is what the event layer provides. Clean, typed, domain-specific tables that feed downstream tooling with signal instead of noise. User search activity becomes request latency trends and endpoint utilization. Document processing becomes ingestion throughput and parsing failure rates. HeartbeatEvents becomes uptime and availability. The metrics layer becomes the AI-specific performance story that actually means something to the people asking the questions.

This is the direction every production AI architecture should be heading. You do not retrofit observability onto a production AI system. You build toward it from the first schema decision.

Which is why every disciplined decision in this blueprint matters. Consistent typing. batch_id threading through NRT telemetry. HeartbeatEvents added before v1 ships. Each one a decision that pays forward when the data hits the observability layer.

The business wants to know what happened, when it happened, and what you are doing to make sure it never happens again. This blueprint is what makes that answer possible.

The operational blueprint above was built for a RAG pipeline. Multi-agent workflows do not simplify this problem — they multiply it. Every agent is a node. Every handoff is an event. Every tool call is a latency measurement. The foundation is the same. The stakes are higher. That is a series for another day.

What's Next in This Series

Post 2 — First Class or Useless: Domain tables, schema discipline, and the decisions that make the layer hold up over time
Post 3 — Two Engineers. One Schema. No Shortcuts.: How two engineers align on a shared data contract and why that process matters as much as the output
Post 4 — What Databricks Gives You and What It Doesn't: What Databricks gives you natively and where you still need to build
Post 5 — Follow the Request: Walking a request chain from endpoint to vector store and back
Post 6 — Closing the Loop: Feeding clean signal to your observability platform and making the business conversation possible