Stop Fighting Your Documents. Let the JSON Win.

There's a conversation that happens on every enterprise document project. Someone opens a Teams thread about column size. A varchar(2000) wasn't big enough. Content got dropped. Now there's a meeting about it.

That conversation is not a column size problem. It's an architecture problem.

The moment the schema breaks

You have complex operational documents flowing into your system. Variable formats, inconsistent structure, different layouts across document types. You stand up Azure Document Intelligence to parse them. It works beautifully — documents come back broken into sections, paragraphs, tables, headings. Clean structured output.

Then you try to land that output into SQL Server and everything falls apart.

To insert into SQL Server you have to know what's coming before it arrives. Every field needs a column. Every column needs a data type. Every data type needs a size. So you define your schema — title, three section headings, three paragraphs, a handful of table cells. You map the Document Intelligence output to those columns and insert.

Here's what actually happens:

# Must define exact fields upfront.
# What if the doc doesn't have these? Null. Dead letter. Hope for the best.
document_data = {
    "document_title": None,
    "section_heading_1": None,
    "section_heading_2": None,
    "section_heading_3": None,
    "paragraph_1": None,
    "paragraph_2": None,
}

for para in result.paragraphs:
    if para.role == "sectionHeading":
        section_heading_count += 1
        key = f"section_heading_{section_heading_count}"
        if key in document_data:
            document_data[key] = para.content[:500]  # truncate to fit varchar
        # else — silently drop it. No column for it.

    elif para.role == "paragraph":
        paragraph_count += 1
        key = f"paragraph_{paragraph_count}"
        if key in document_data:
            document_data[key] = para.content[:2000]
        # else — silently drop it. Dead letter territory.

Count the silent drops. Content truncated to fit varchar. Section headings beyond three discarded. Every table after the first one gone. Paragraphs that don't fit the predefined slots dropped without a trace.

A new document type arrives with four section headings. Schema change. Migration. Another Teams thread. The varchar(2000) wasn't enough. Another thread.

This is not a column size problem. The schema is fighting the documents and losing.

Why a dead letter queue isn't the answer

The natural response is a dead letter queue — catch the documents that don't conform, route them somewhere, deal with them later.

A dead letter queue is a legitimate pattern. For late arriving facts in a streaming pipeline, for out of order events, for transient failures — absolutely. But using a dead letter queue because your database can't handle document variability is not a solution. It's an infrastructure component built to manage a problem that shouldn't exist in the first place.

The document variability is not going away. There will always be more document types. There will always be inconsistent structure. You cannot define your way out of it upfront.

So stop trying.

Two passes. No fighting.

The architectural shift is simple. Stop imposing structure on documents before you understand them. Let the documents tell you what they are.

Pass 1 — Land the JSON. No decisions required.

Document Intelligence parses the document and returns structured output. Take that output, shape it into a Python dictionary, store it in Postgres JSONB. Done.

document_json = {
    "source_file": file_path,
    "paragraphs": [],
    "tables": [],
}

for para in result.paragraphs:
    document_json["paragraphs"].append({
        "role": para.role,       # title, sectionHeading, paragraph, footnote
        "content": para.content  # full content — nothing truncated
    })

for i, table in enumerate(result.tables):
    document_json["tables"].append({
        "table_index": i,        # all tables — not just the first one
        "row_count": table.row_count,
        "column_count": table.column_count,
        "cells": [
            {
                "row_index": cell.row_index,
                "column_index": cell.column_index,
                "content": cell.content
            }
            for cell in table.cells
        ]
    })

No truncation. No silent drops. No schema decisions. Everything Document Intelligence returns gets stored exactly as it came back.

The Postgres table is three columns:

CREATE TABLE operational_documents (
    id SERIAL PRIMARY KEY,
    source_file TEXT,
    document_data JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_documents_data
ON operational_documents USING GIN (document_data);

JSONB stores documents in binary format — decomposed on write, pre-parsed for reads. You pay a small write cost once. Every query into the structure after that is indexed and fast.

The GIN index is what makes this work at scale. GIN stands for Generalized Inverted Index. Unlike a standard B-tree index that indexes a column value, GIN indexes the contents inside the JSONB document — every key, every value, every element. Without it, querying across thousands of documents means a full table scan into the binary structure on every request. With it, Postgres knows exactly where values live inside the stored JSON and goes straight there.

-- Fast because of GIN — no full table scan
SELECT * FROM operational_documents
WHERE document_data @> '{"role": "sectionHeading"}';

-- Find all documents that contain a specific key
SELECT * FROM operational_documents
WHERE document_data ? 'aircraft_tail_number';

As the document corpus grows the GIN index is what keeps retrieval fast. It's a one-time architectural decision that pays forward on every query from that point on.

Both JSONB and GIN index support have been available since PostgreSQL 9.4, released in 2014. This is not a new feature or a bleeding edge bet — it has been battle tested in production for over a decade. It is available in Azure Database for PostgreSQL and in the more restrictive Azure GovCloud environment. If it runs there, it runs anywhere.

A new document type arrives. No ALTER TABLE. No migration. No Teams thread. It lands in JSONB and you understand it on your timeline, not before.

Pass 2 — Let the data tell you what structure it needs.

After real documents have flowed through the system, patterns emerge. Certain fields show up consistently across every document type. Those fields are candidates to promote to proper Postgres columns — indexed, typed, filterable. Everything else stays in JSONB.

The schema emerges from the data instead of being imposed before you've seen a single document. That's the difference.

And Pass 2 isn't a manual process. It's part of the codebase. The pipeline inspects the JSONB, detects which fields are consistent across documents, and promotes them automatically — typed columns, backfill, indexes. No human review. No waiting. No ALTER TABLE tickets.

import json
from sqlalchemy import text

PROMOTION_RULES = {
    "document_date": "DATE",
    "document_type": "TEXT",
    "source_system": "TEXT",
}

async def promote_consistent_fields(session):
    """
    Inspect JSONB across all documents.
    Promote fields that appear consistently to typed columns.
    Runs as part of the pipeline — no manual intervention required.
    """
    for field, pg_type in PROMOTION_RULES.items():

        exists = await session.execute(text("""
            SELECT column_name FROM information_schema.columns
            WHERE table_name = 'operational_documents'
            AND column_name = :field
        """), {"field": field})

        if exists.fetchone():
            continue

        result = await session.execute(text("""
            SELECT COUNT(*) as total,
                   COUNT(document_data ->> :field) as has_field
            FROM operational_documents
        """), {"field": field})

        row = result.fetchone()
        coverage = row.has_field / row.total if row.total > 0 else 0

        if coverage >= 0.95:
            await session.execute(text(f"""
                ALTER TABLE operational_documents
                ADD COLUMN IF NOT EXISTS {field} {pg_type}
            """))

            await session.execute(text(f"""
                UPDATE operational_documents
                SET {field} = (document_data ->> '{field}')::{pg_type}
                WHERE {field} IS NULL
            """))

            await session.execute(text(f"""
                CREATE INDEX IF NOT EXISTS idx_{field}
                ON operational_documents ({field})
            """))

            await session.commit()

The pipeline runs Pass 1 on every document. Pass 2 runs behind it, continuously. When a field crosses the consistency threshold it gets promoted — typed, indexed, queryable as a proper column. The rest of the document stays in JSONB.

Fast typed column filters for the fields that earned it. Full JSONB flexibility for everything else. No waiting, no review, no tickets. The codebase handles it.

SQL Server required you to make this decision before document one. This stack makes it automatically after the data proves it's ready.

Why Postgres fits the modern API stack

The database choice doesn't exist in isolation. It has to fit the stack it's being built into.

FastAPI is the natural API layer for Python-first data applications — async native, lightweight, built for exactly this kind of document retrieval and AI-assisted query pattern. Postgres is the better fit for that stack for specific reasons. Native JSONB means Document Intelligence output flows from a Python dictionary directly into the database and back out to a FastAPI response with no impedance mismatch — the data shape never has to change. SQLAlchemy async connects to Postgres cleanly with first class support. Alembic manages schema migrations when you promote fields from JSONB to columns — versioned, tracked, repeatable. Pydantic, which FastAPI is built on, serializes directly to and from Python dictionaries, which is exactly what your JSONB documents are after Document Intelligence parses them. The whole stack speaks the same language.

# The data flow is clean end to end
# Document Intelligence → dict → Postgres JSONB → FastAPI response
# No driver gymnastics. No schema impedance mismatch.

from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text

async def get_documents_by_role(session: AsyncSession, role: str):
    result = await session.execute(
        text("""
            SELECT id, source_file, document_data
            FROM operational_documents
            WHERE document_data @> :filter
        """),
        {"filter": json.dumps({"role": role})}
    )
    return result.fetchall()

SQL Server in the same stack means pyodbc, different driver configuration, synchronous connection handling by default, and friction at every layer where Python expects Postgres conventions. It can be made to work. It just fights you the whole way.

To be fair, SQL Server does have JSON support — JSON_VALUE, JSON_QUERY, and OPENJSON exist. But it is bolt-on, not built-in. Three things that matter are missing: it is not a native type, there is no JSONB, and there is no GIN index. JSON is stored as NVARCHAR. You are working around the database, not with it. That is not a first class citizen. That is a workaround with a name.

Postgres doesn't fight you. It's the right tool for the stack.

Where similarity search fits in

Once your documents are in Postgres JSONB the next problem is retrieval. When the system needs to answer a question, it needs relevant context fast — not keyword matching, but semantic similarity. Finding documents that are conceptually related even when the exact words don't match.

Two use cases that matter across any document-heavy operation:

Finding relevant precedent across large document sets. A new document arrives — what existing documents are most similar? What prior cases, decisions, or records are conceptually related? This is semantic retrieval at scale.

Context retrieval for AI-assisted decision making. When the AI needs to answer a question it doesn't scan every document. It finds the most semantically relevant chunks first, pulls those as context, then generates its answer. The quality of the retrieval directly determines the quality of the answer.

The two-pass architecture sets this up cleanly. Pass 1 lands the full document in JSONB. The same pipeline generates embeddings from the parsed content and stores them in a vector index alongside Postgres. When a query comes in, similarity search finds the relevant chunks, JSONB returns the full context, and the AI has what it needs.

Rigid schema breaks this. You can't generate meaningful embeddings from truncated content or silently dropped paragraphs. You need the full document. JSONB gives you the full document.

The flip

The argument is straightforward. SQL Server forces you to understand your documents before you ingest them. JSONB lets you ingest first and understand second.

In a world of variable format documents you don't control — documents that change, documents you haven't seen yet — requiring upfront understanding is not a constraint you can meet. The schema will always be wrong. The dead letter queue will always be busy. The Teams thread about column size will keep coming.

Stop defining structure you don't have yet. Land the JSON. Let the data tell you what it is. Promote structure when the patterns are clear, not before.

The document variability doesn't go away. You just stop caring about it.

Clarity through the chaos.

Arjun Krishnamoorthi is the founder of LogicLens LLC, a fractional data architecture and AI consulting practice. If you have a data infrastructure problem or an AI project that needs senior hands — let's talk.