~/articles/design-document-extraction

◆◆◆Advancedasked at Googleasked at AWSasked at Microsoftasked at Stripeasked at Ramp

Design an Intelligent Document Processing Pipeline

Q: When should you use an end-to-end vision-language model instead of a traditional OCR pipeline?

End-to-end VLMs (GOT-OCR 2.0, Mistral OCR 4, Qwen2.5-VL-72B) skip the multi-stage rasterize-deskew-OCR-layout-LLM chain and handle everything in one forward pass, eliminating cascading error propagation. They shine on handwriting, complex tables, and documents where layout context is essential. The trade-off: they cost more per page, provide less deterministic bounding-box provenance, and are harder to debug when they fail. Most production systems use a hybrid: OCR for bulk text and structured fields, VLM for problem cases like tables or handwritten forms flagged by a quality signal.

Q: What does confidence scoring actually measure and why is calibration critical?

Confidence in a document extraction system is typically a composite of three signals: OCR engine character-level probability, LLM log-probability of generated tokens, and downstream validation rule pass/fail. A score of 0.90 should mean the extracted field is correct roughly 90% of the time — that is calibration. Uncalibrated logprobs often overstate confidence on out-of-distribution document formats; a model that has never seen a particular vendor invoice layout may output 0.95 confidence on a completely wrong value. Calibration requires empirical testing against a labeled ground-truth sample across the real distribution of incoming documents.

Q: How do you handle multi-page documents, rotated scans, and poor image quality?

Preprocessing must run before any model sees the document. At minimum: rasterize PDFs at 300 DPI (below 200 DPI, Tesseract CER degrades 3–5x), deskew using a Hough transform (even 2–3° rotation cuts recognition by ~10%), contrast-normalize, and detect 90/180/270° rotations with a fast classifier before running OCR. For genuinely low-quality scans, a restoration step using diffusion-based image enhancement (PreP-OCR pipeline) reduces character error rate by 63–70% before text recognition even runs. Page count is handled by splitting the document at ingest into per-page units, processing pages in parallel, then assembling ordered results by page index.

Q: What is the build-vs-buy decision point for managed document AI APIs?

Managed APIs (AWS Textract, Google Document AI, Azure Document Intelligence) cost $1.50–$65 per 1,000 pages depending on feature tier. Self-hosted open-source stacks (DocTR, Docling, GOT-OCR, Mistral OCR 4 in a container) require GPU infrastructure investment but cut per-page cost 10–50x. At 1 million pages/month processing forms and tables, AWS Textract costs ~$65,000/month while Mistral OCR 4 batch API costs ~$4,000/month ($4/1,000 pages) — a ~16x spread. The crossover point where self-hosting pays off is roughly 100K–500K pages/month for a team that can manage GPU infra. Below that, managed APIs are cheaper when you include engineering time.

Q: Why does line-item extraction accuracy drop so dramatically compared to header field accuracy?

Header fields like invoice_date and vendor_name appear once, in predictable locations, often with labeled keys nearby. Line items are a table: variable number of rows, inconsistent column widths, cells that span rows, totals that repeat information from rows above. Most models show a significant accuracy gap between headers and line items — anywhere from 6 points (Azure Document Intelligence: 93% overall vs 87% on line items) up to 40 points or more (GPT-4o+OCR: 98% overall vs 57% on line items; Google Document AI: 82% overall vs 40% on line items). The implication: do not benchmark only on header fields and assume line items will follow. If your use case requires line-item accuracy, test specifically for it and expect to use a specialized table extraction model alongside your general extractor.

Turn millions of messy PDFs, scans, and invoices into validated structured JSON at scale — the end-to-end pipeline covering OCR, layout analysis, LLM-based field extraction, confidence-scored routing, human-in-the-loop review, and the cost math that determines build-vs-buy.

28 min read2026-06-25Ironclad Academy

#interview #ai #llm #vision #search

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Stripe's finance team processes invoices from thousands of vendors. Ramp customers upload receipts from every gas station and coffee shop in the country. Mortgage underwriters at United Wholesale Mortgage receive W-2s and pay stubs scanned on phones in bad lighting. In each case, an employee uploaded a file that contains financial data, and a human has to read it and type numbers into a system. The economic cost is measurable: AP teams report an average 14.6-day invoice cycle when processing is manual, dropping to 3–5 days with automation. At enterprise volume, that time difference is tens of millions of dollars in working capital.

The technical problem is deceptively hard. A PDF is not data — it is a rendering instruction set. Pixels arranged on a page look like "Invoice Total: $4,712.50" to a human, but to a computer they are a JPEG embedded in a binary container with no semantic structure. Getting from "JPEG of a document" to {"total_amount": 4712.50, "currency": "USD", "vendor": "Acme Corp"} requires a pipeline that solves five distinct problems: seeing the text, understanding the layout, knowing which text belongs to which field, handling the variation in how different vendors format the same concept, and knowing when it got something wrong.

This article is about building that pipeline at scale — millions of documents per day, across dozens of document types, with confidence scores that route uncertain extractions to humans instead of silently corrupting downstream systems. It is not about building a vector store or RAG retrieval on top of the extracted content; that is covered in Design a RAG Pipeline. It is not about the general ETL infrastructure for moving data between systems; that is Design a Data Pipeline (ETL). The document extraction pipeline is specifically the piece that converts unstructured file bytes into structured, validated, provenance-linked JSON.

Functional requirements

Accept PDF, TIFF, JPEG, PNG, and DOCX documents via direct upload, presigned URL, or API push.
Classify the document type (invoice, contract, receipt, ID card, bank statement, W-2, etc.) without requiring the caller to specify it.
Extract a schema-defined set of fields per document type: header fields (vendor, date, total) and line-item tables.
Return per-field confidence scores and bounding-box coordinates linking each extracted value to its location in the source document.
Route low-confidence fields to a human review queue; expose a review UI with the original document and highlighted fields.
Feed human corrections back as labeled training data to improve future accuracy.
Publish validated structured output to downstream consumers: ERP systems, data warehouses, or the RAG ingestion pipeline.

Non-functional requirements

Throughput: sustain 1 million documents per day (~12 docs/second) with burst capacity to 3x on month-end peaks.
Latency: p99 end-to-end < 5 minutes for standard documents; < 30 minutes for high-complexity docs routed through human review.
Accuracy: field-level exact match rate (EMR) ≥ 95% for header fields, ≥ 85% for line items on documents with scan quality above 200 DPI.
Straight-through processing (STP) rate: ≥ 70% of documents complete without human review, targeting 80% at steady state.
Durability: no document lost after acknowledged receipt; DLQ captures all failures for inspection and replay.
Auditability: every extracted field must link to a source location (page number, bounding box) for compliance and dispute resolution.

Capacity estimation

Dimension	Estimate	How we got there
Document ingest rate	12 docs/second sustained; 36 docs/second at peak	1M docs/day ÷ 86,400 seconds; 3x peak multiplier for month-end runs
Pages per second	60 pages/second sustained (avg 5 pages/doc)	12 docs/sec × 5 pages/doc
OCR GPU requirement	~58 A100-equivalents	DocTR: ~1 sec/page on A100; 5M page-seconds ÷ 86,400 sec/day = 57.9 → ~58 GPU-equivalents
LLM extraction cost	~$31,500/day at list price	1M docs × $0.0315/doc (Claude Sonnet, 8 pages, ~8K tokens in / 500 out; at $3/M input + $15/M output = $0.024 + $0.0075 = $0.0315)
LLM extraction cost with batching	~$15,000/day	Batch API discounts (50% off list) for non-real-time workloads
Raw document storage	2 TB/day ingested	1M docs × 2 MB average PDF
Structured output storage	10 GB/day	1M docs × 10 KB JSON output + bounding box provenance
Human review queue	~300,000 fields/day	30% of docs hit review on at least one field; avg 1 field/doc in review
Queue message throughput	~150 messages/second at peak	36 docs/sec × 4 queue touches per doc (ingest, classify, OCR, extract)

Takeaway: OCR GPU is the primary infrastructure cost — roughly 58 A100-equivalents for OCR alone, separate from LLM inference. LLM cost dominates the variable per-document spend; batch mode cuts it in half. Human review at 30% of docs is the service-level SLA risk: if queue depth grows, the p99 latency breach comes from the human bottleneck, not the machines.

Building up to the design

V1: synchronous API call to a managed service

The naive implementation: receive a PDF via HTTP, call AWS Textract AnalyzeDocument synchronously, and return the extracted fields in the response. This works for a demo.

It breaks at three places. First, Textract's synchronous API is limited to single-page documents; anything multi-page requires the async StartDocumentAnalysis path with SQS polling — the caller's request cannot block for minutes waiting for a callback. Second, a single Lambda with 3-second default timeout fails silently on any document that takes longer to process; the AWS reference architecture explicitly adjusts Lambda timeout to 2 minutes. Third, at 12 docs/second, you hit Textract's default TPS quota almost immediately.

V2: async queue + per-document workers

The second version decouples ingest from processing. The API endpoint writes the S3 object key into an SQS queue and returns a job ID immediately. Workers pull from the queue, call Textract, and write results to DynamoDB. The caller polls for status or subscribes to an SNS notification.

This is the standard AWS reference architecture — S3 → SQS → Lambda → Textract → DynamoDB. It scales well for batch workloads and costs nothing when idle. Two remaining failures: the queue visibility timeout must exceed P99 processing time or workers will race on the same message. At 300-second P99 for a complex 20-page document, a 30-second visibility timeout (SQS default) causes roughly 4% of messages to be processed twice. The other failure is extraction quality: Textract forms+tables costs $65/1,000 pages, and its invoice line-item accuracy benchmarks at 82%, which is not acceptable for AP automation.

V3: specialized model routing + GPU inference service

flowchart TD
    API[Ingest API] --> S3[(S3 / GCS)]
    S3 --> Q[(SQS Work Queue<br/>visibility=300s · DLQ after 3)]
    Q --> WRK[Orchestration Workers<br/>CPU fleet]
    WRK --> CLS[Classifier Service<br/>CLIP-KNN · 10ms]
    CLS --> PRE[Preprocessor<br/>300 DPI · deskew · rotate]
    PRE --> GPU[GPU Inference Service<br/>DocTR OCR · TATR tables]
    GPU --> EXT[LLM Extractor<br/>schema-constrained JSON]
    EXT --> VAL[Validation Engine]
    VAL --> CONF[Confidence Router]
    CONF -->|STP pass| OUT[(Structured Output<br/>DynamoDB + S3)]
    CONF -->|review| HQ[Human Review Queue]
    HQ --> UI[Review UI<br/>bounding-box highlight]
    UI --> FB[(Training Feedback Store)]
    OUT --> DS[Downstream<br/>ERP · Warehouse · RAG]
    style GPU fill:#0e7490,color:#fff
    style EXT fill:#ffaa00,color:#0a0a0f
    style CONF fill:#ff6b1a,color:#0a0a0f
    style VAL fill:#15803d,color:#fff
    style HQ fill:#a855f7,color:#fff
    style CLS fill:#ff2e88,color:#fff

V3 separates orchestration from GPU inference. CPU workers handle the I/O-heavy steps — fetching from S3, calling the classifier, writing results — while a containerized GPU inference service handles OCR and table detection. The two scale independently: OCR GPU auto-scales on queue depth, orchestration workers auto-scale on CPU utilization. The LLM extraction step uses Instructor (which retries on schema-validation failure) or Anthropic tool-use JSON schema (which constrains token generation), meaning the model outputs valid JSON matching the target schema — no parsing step needed downstream.

API

The pipeline exposes two endpoints: a submission endpoint for callers and a result fetch endpoint.

POST /v1/documents
Content-Type: multipart/form-data

{
  "file": <binary PDF>,
  "document_type": "invoice",       // optional; if omitted, classifier determines it
  "target_schema": "invoice_v2",    // which extraction schema to use
  "webhook_url": "https://..."      // optional; notified on completion
}

→ 202 Accepted
{
  "document_id": "doc_01J9XKZ...",
  "status": "queued",
  "estimated_completion_ms": 45000
}

GET /v1/documents/{document_id}

→ 200 OK
{
  "document_id": "doc_01J9XKZ...",
  "status": "completed",            // queued | processing | review | completed | failed
  "document_type": "invoice",
  "confidence_summary": { "overall": 0.91, "min_field": 0.72 },
  "fields": {
    "vendor_name": {
      "value": "Acme Corp",
      "confidence": 0.97,
      "source": { "page": 1, "bbox": [112, 88, 340, 106] }
    },
    "invoice_total": {
      "value": 4712.50,
      "confidence": 0.99,
      "source": { "page": 1, "bbox": [480, 612, 560, 628] }
    },
    "line_items": [
      {
        "description": "Widget A",
        "qty": 10,
        "unit_price": 471.25,
        "amount": 4712.50,
        "confidence": 0.88,
        "source": { "page": 2, "bbox": [40, 120, 560, 135] }
      }
    ]
  },
  "review_required_fields": ["po_number"],
  "processing_ms": 12400
}

The bbox coordinates are in PDF point space (points from top-left origin), matching the convention used by PyMuPDF's Rect(x0, y0, x1, y1) and AWS Textract polygon blocks. For rotated text, a quad field with four corner coordinates replaces bbox.

The schema

The tracking store records one row per document, with status transitions and per-stage timing:

-- document tracking (Aurora Postgres or DynamoDB GSI-equivalent)
CREATE TABLE documents (
  document_id   TEXT PRIMARY KEY,
  tenant_id     TEXT NOT NULL,
  s3_uri        TEXT NOT NULL,
  document_type TEXT,               -- null until classifier runs
  schema_name   TEXT NOT NULL,
  status        TEXT NOT NULL,      -- queued/preprocessing/ocr/extracting/validating/review/completed/failed
  page_count    INTEGER,
  queued_at     TIMESTAMPTZ NOT NULL,
  ocr_started_at     TIMESTAMPTZ,
  extract_started_at TIMESTAMPTZ,
  completed_at       TIMESTAMPTZ,
  overall_confidence NUMERIC(4,3),
  stp_passed    BOOLEAN,
  error_code    TEXT,
  retry_count   INTEGER DEFAULT 0
);

-- per-field results with provenance
CREATE TABLE extracted_fields (
  document_id   TEXT REFERENCES documents,
  field_name    TEXT NOT NULL,
  field_value   JSONB,             -- typed: text, number, date, array
  confidence    NUMERIC(4,3),
  ocr_confidence NUMERIC(4,3),
  llm_logprob_confidence NUMERIC(4,3),
  validation_passed BOOLEAN,
  source_page   INTEGER,
  source_bbox   JSONB,             -- {x0, y0, x1, y1} or {quad: [[x,y]×4]}
  review_required BOOLEAN DEFAULT false,
  reviewer_override JSONB,         -- set when human corrects the field
  PRIMARY KEY (document_id, field_name)
);

Extraction schemas are versioned separately in a schema registry. The schema_name field ties each document to its extraction schema version, enabling backward-compatible schema evolution without reprocessing historical documents.

Architecture

Full system

flowchart LR
    subgraph EXT_IN["External ingestion"]
        API[REST API<br/>ingest endpoint]
        BATCH[Batch S3 drop<br/>bulk upload]
    end
    subgraph STORE["Storage layer"]
        S3[(Raw docs<br/>S3 immutable)]
        META[(Tracking DB<br/>Aurora / Dynamo)]
        SCH[(Schema Registry)]
    end
    subgraph QUEUE["Queue layer"]
        CQ[(Classify Queue)]
        OQ[(OCR Queue)]
        EQ[(Extract Queue)]
        VQ[(Validate Queue)]
        DLQ[(Dead Letter Queue)]
    end
    subgraph GPU_SVC["GPU inference service"]
        OCR_SVC[OCR + Layout<br/>DocTR · TATR · Mistral OCR]
    end
    subgraph CPU_WRK["CPU orchestration workers"]
        CW[Classifier Worker<br/>CLIP-KNN]
        PW[Preprocessor Worker<br/>deskew · 300 DPI · rotate]
        EW[Extractor Worker<br/>LLM + Instructor schema]
        VW[Validator Worker<br/>cross-field rules]
        RT[Confidence Router]
    end
    subgraph OUTPUT["Output layer"]
        HQ[(Review Queue + UI)]
        EVT[Event Bus<br/>Kafka / EventBridge]
        DS[Downstream<br/>ERP · Snowflake · RAG]
    end
    API --> S3
    BATCH --> S3
    S3 --> META
    S3 --> CQ
    CQ --> CW
    CW --> OQ
    OQ --> PW
    PW --> OCR_SVC
    OCR_SVC --> EQ
    EQ -->|schema lookup| SCH
    EQ --> EW
    EW --> VQ
    VQ --> VW
    VW --> RT
    RT -->|STP| EVT
    RT -->|review| HQ
    EVT --> DS
    CQ -.->|fail after 3| DLQ
    OQ -.->|fail after 3| DLQ
    EQ -.->|fail after 3| DLQ
    style OCR_SVC fill:#0e7490,color:#fff
    style EW fill:#ffaa00,color:#0a0a0f
    style RT fill:#ff6b1a,color:#0a0a0f
    style VW fill:#15803d,color:#fff
    style HQ fill:#a855f7,color:#fff
    style EVT fill:#ff2e88,color:#fff
    style CW fill:#ff2e88,color:#fff

The pipeline is a sequence of queues. Each stage pulls from its own SQS queue, does its work, and pushes a message to the next queue. Failures DLQ after three retries. This means any stage can be reprocessed independently — if the LLM extraction model is swapped, you can replay the extract queue without re-running OCR.

Hot path sequence diagram

sequenceDiagram
    participant C as Caller
    participant API as Ingest API
    participant S3 as Object Store
    participant CQ as Classify Queue
    participant CW as Classifier Worker
    participant PW as Preprocessor Worker
    participant GPU as GPU OCR Service
    participant EW as Extractor Worker
    participant VW as Validator Worker
    participant RT as Confidence Router
    participant DS as Downstream

    C->>API: POST /v1/documents with PDF body
    API->>S3: PUT raw file (immutable)
    API->>CQ: enqueue doc_id and s3_uri
    API-->>C: 202 Accepted with doc_id and status queued
    CQ->>CW: pull message
    CW->>S3: fetch first page thumbnail
    CW-->>CQ: CLIP-KNN result - invoice at 96% confidence
    CW->>PW: enqueue to OCR Queue with doc_type=invoice
    PW->>S3: fetch all pages
    PW->>PW: rasterize 300 DPI, deskew, detect rotation
    PW->>GPU: POST /ocr with page image bytes
    GPU-->>PW: OCR blocks with bbox, tables with cell structure
    PW->>EW: enqueue to Extract Queue with ocr_result and doc_type
    EW->>EW: fetch invoice_v2 schema from registry
    EW->>EW: LLM call with schema-constrained prompt via Instructor
    EW->>VW: enqueue to Validate Queue
    VW->>VW: cross-field checks line_items.sum vs total
    VW->>RT: fields with composite confidence scores
    RT->>DS: publish validated event for STP fields above 0.95
    RT-->>C: webhook notification with doc_id and completed status

The GPU OCR service is the only stateful shared resource — it holds loaded model weights in GPU memory. Everything else is stateless workers pulling from queues. The OCR service runs DocTR for standard documents and Mistral OCR 4 for documents flagged by the quality signal (low DPI, handwriting, complex tables). Both return word-level bounding boxes and typed block labels.

Deep dives

Document classification

Classification happens before OCR because the document type determines which extraction schema to apply. Getting it wrong is expensive — a W-2 schema applied to an invoice produces mostly empty or wrong fields, burning OCR time and LLM tokens for nothing.

The classifier uses a CLIP-based image embedding (or a fine-tuned EfficientNet/LayoutLM) that converts the first-page thumbnail into a feature vector and does KNN lookup against a labeled prototype set. This runs in roughly 10 ms per page and costs about $0.001 per page — ten times cheaper than sending the page to a VLM. For the ~4% of documents that land in ambiguous KNN regions (cover pages with no content, forms that look like invoices), the classifier escalates to a VLM with a classification prompt. The overall accuracy of this hybrid approach is ~96%, matching VLM-only classification at a fraction of the cost.

The output is a document type label (invoice, contract, id_card, w2, bank_statement, etc.) and a routing key that determines the extraction schema and the set of validation rules to apply downstream.

OCR and layout analysis

OCR is the throughput bottleneck. In a pipeline processing an 8-page document, OCR consumes roughly two-thirds of wall-clock time. This is the step engineers are most tempted to underscale.

Two paths exist in production:

Traditional pipeline. DocTR or PaddleOCR runs on GPU, extracting word-level bounding boxes and character-level probabilities. A separate Table Transformer (TATR) pass detects table boundaries and recovers cell structure. Outputs are JSON with blocks typed as text, table, figure, header, or footer. Processing: 1–2 seconds/page on an A100. This path gives deterministic bounding boxes in native PDF coordinate space, which is non-negotiable for compliance audits.

End-to-end VLM. GOT-OCR 2.0 (580M parameters, 8.5 GB VRAM) or Mistral OCR 4 handles layout, OCR, and table parsing in a single forward pass. Mistral OCR 4 processes up to 2,000 pages/minute on a single GPU and benchmarks at 96.6% table accuracy vs Textract's 84.8%. The trade-off is cost — Mistral OCR 4 is $4/1,000 pages, vs $1.50/1,000 pages for Textract basic text and DocTR at ~$0.10/1,000 pages on self-hosted GPU — and less deterministic provenance for the bounding boxes.

Production systems use the hybrid: DocTR for the 90% of pages that are clean, standard-layout text, and Mistral OCR 4 (or Qwen2.5-VL-72B) for the 10% flagged by a quality signal — low-confidence OCR output, detected handwriting, table density above a threshold.

Preprocessing matters

Before any model sees a document, the preprocessing step runs:

Rasterize PDFs at minimum 300 DPI. Below 200 DPI, Tesseract CER degrades 3–5x. For vector PDFs, 300 DPI captures all text; for scans, you are stuck with source quality.
Rotation detection. A fast classifier (PaddleOCR PPLCNet) detects 90°/180°/270° rotations and corrects them before OCR. Uncorrected rotation by even 2–3° reduces recognition accuracy by ~10%.
Deskewing. The Hough transform detects dominant line angles and applies an affine rotation to straighten the page. Production deskewing adds ~50 ms/page but is worth it on any batch of real-world scans.
Low-quality restoration. For documents below a quality threshold (estimated by variance-of-Laplacian blur detection), a diffusion-based restoration step (analogous to the PreP-OCR pipeline from arXiv:2505.20429) reduces character error rate by 63–70% before text recognition runs. This is computationally expensive (~1 sec/page) and is applied only to flagged documents.

Structured extraction

OCR gives you a flat list of text blocks with bounding boxes. The extraction step maps those blocks onto a typed schema. This is where the LLM enters.

The extraction prompt packages the serialized OCR output (blocks in reading order, with their labels and bounding box coordinates) plus the target schema, and instructs the model to fill in the fields. Schema enforcement via Instructor (retry-based validation) or Anthropic tool-use JSON schemas (constrained at generation time) forces the output to conform to the target structure — no post-hoc parsing, no regex cleanup. A hallucinated invoice_total of "$4,712.50" when the schema requires number fails to generate; the model is constrained to produce 4712.50.

At roughly $0.0315/document for Claude Sonnet via Bedrock (8 pages serialized with bounding-box coordinates, ~8K input tokens, ~500 output tokens; at $3/M input + $15/M output this works out to $0.024 + $0.0075 = $0.0315/doc), and ~$31,500/day at 1M documents, cost is real. Three levers:

Batch API. Non-real-time workloads can use Anthropic's batch API at 50% discount, halving the LLM line item to ~$15,000/day.
Smaller models for simple documents. A fine-tuned 7B model can match Sonnet accuracy on common invoice formats at a fraction of the cost; route simple, high-confidence document types to the smaller model.
Context compression. Strip the OCR output of figure blocks, headers, and footers before the LLM prompt. An 8-page invoice has maybe 600 tokens of meaningful fields; the rest is noise that burns tokens.

Bounding-box grounding

Every extracted field must link to a source location. This is not optional for finance, legal, or healthcare. When a reviewer disputes an extracted amount, the system must highlight the exact text region in the original document. When a regulator audits an extraction, the bounding box is the audit trail.

In the OCR-backed pipeline, bounding boxes come naturally from the OCR engine: PyMuPDF returns Rect(x0, y0, x1, y1) in PDF point space; Textract returns a polygon as a fraction of page dimensions. The extraction step carries the bounding box metadata from OCR block to field output — the LLM is prompted to reference the OCR block ID it used for each field, and the extractor resolves block ID → bounding box at write time.

For VLM-only extraction (GOT-OCR 2.0, Qwen2.5-VL), the model can return region-level bounding boxes when prompted with coordinates, though the convention varies by model. LandingAI's agentic extraction explicitly guarantees that every extracted value links to page location and image snippet, treating grounding as a first-class design constraint rather than an afterthought.

Confidence scoring and human-in-the-loop routing

The confidence score attached to each field is a composite of three signals:

OCR confidence. The OCR engine returns character-level probabilities. A word confidently transcribed as "4712.50" has high OCR confidence; a smudged digit produces low confidence. Aggregated at the field level as the minimum character probability in the region.
LLM logprob confidence. The log-probability of the tokens generated for each field value. A total amount that appears verbatim in the OCR output and matches a pattern the model has seen thousands of times has high logprob confidence. A vendor name transcribed ambiguously from a low-quality logo has low logprob confidence.
Validation rule pass/fail. Structural validation rules are deterministic signals. If line_items[].amount.sum != invoice_total (within a tolerance for rounding), something is wrong — regardless of what the OCR and LLM confidence scores say.

These three signals are combined into a single 0–1 score per field, calibrated against a labeled ground-truth sample. Calibration is the key word: a 0.90 composite score should mean the field is correct ~90% of the time in practice. Uncalibrated logprobs systematically overstate confidence on out-of-distribution document formats — a vendor invoice the model has never seen in training may produce 0.95 logprob confidence on a wrong value.

The routing tiers in production (following Extend AI's published architecture):

> 0.95: auto-pass. The field goes straight to downstream output without human review.
0.70–0.95: conditional. A second-pass validation runs (additional cross-field checks, format verification). If it passes, the field auto-passes. If not, it queues for review.
< 0.70: human review. The field is surfaced in the review UI with the bounding box highlighted against the original document image. The reviewer corrects the value and submits. The correction is logged as a labeled training example.

The straight-through processing (STP) rate — the percentage of documents where all fields pass without human review — is the primary business metric. Production AP automation systems target 60–80% STP. Below 60%, the human review queue becomes the throughput bottleneck. Above 80%, you have enough trust to consider loosening thresholds.

Human corrections close the training loop. Every reviewer correction is a labeled data point: the original document region, the model's wrong extraction, and the correct value. These accumulate into fine-tuning datasets. Models retrained on this feedback should show measurable accuracy improvement on the specific document types and failure modes that actually occur in production.

Table and line-item extraction

Tables are the hardest part. Header fields like vendor name and invoice date appear once, in predictable locations, with labeled keys nearby. Line items are a variable-length table with inconsistent column widths, merged cells, totals that repeat data from rows above, and occasional multi-line descriptions.

The accuracy gap is brutal: Azure Document Intelligence achieves 93% overall invoice accuracy but only 87% on line items. Google Document AI achieves 82% overall but only 40% on line items. GPT-4o with OCR preprocessing achieves 98% overall but only 57% on line items. The implication is that overall accuracy is a misleading benchmark if line-item extraction is what your use case requires.

The best production approach uses Microsoft's Table Transformer (TATR): a DETR-based model trained on PubTables-1M (~575K table images) and FinTabNet. TATR runs two passes — first detecting table boundaries, then recovering cell structure within each detected table. It handles merged cells correctly and achieves 91.2% AP on table structure recognition (PubTables-1M benchmark). For handwritten or irregular tables where TATR fails, Mistral OCR 4 achieves 96.6% table accuracy (vs Textract's 84.8%) and is the fallback.

After table detection and cell extraction, a reconciliation step handles:

Spanning cells. When a description spans two rows, the extractor must merge the cell values and attribute the combined text to a single line item.
Nested headers. Some invoices have two-row headers ("Unit Price" spanning both "Quantity" and "Unit"). A rule-based reconciliation identifies the header structure before parsing data rows.
Math validation. Every line item is cross-validated: qty × unit_price == amount (within floating-point tolerance). Discrepancies flag the row for human review regardless of confidence scores.

Throughput at scale: async queue + GPU batching

At 1 million documents per day, synchronous processing is not viable. The architecture is purely async: the ingest API returns a job ID immediately, and downstream consumers receive results via webhook or event bus.

The OCR GPU inference service is the throughput-limiting resource. On an A100, DocTR processes ~1 page/second at unit concurrency; continuous batching via vLLM (for VLM-based extraction models) achieves 3–4x throughput over static batching, reaching ~60 tokens/second per request at batch saturation. For the OCR path specifically, the GPU service queues page images in micro-batches and processes them at maximum GPU utilization, targeting ~95% GPU saturation during business hours.

Queue visibility timeout is a detail that breaks naive implementations. SQS visibility timeout must exceed the P99 processing time for the slowest documents in that queue stage. For the OCR queue, P99 is ~300 seconds for a 20-page dense document. A 30-second visibility timeout (a common default) causes roughly 4% of messages to become visible again while the original worker is still processing, triggering duplicate work. Set visibility timeouts to 300–600 seconds for OCR and extraction stages.

The Dead Letter Queue (DLQ) captures messages that fail after 3 retries. These are inspected manually or routed to a fallback processing path (e.g., a managed API like Textract when the self-hosted model fails on a pathological document). DLQ depth is a critical alarm: a spike means a category of document is systematically failing, not just isolated bad scans.

Auto-scaling is driven by queue depth, not CPU or memory utilization. OCR GPU workers add capacity when the OCR queue depth exceeds 500 messages; they drain when it drops below 100. Scaling in is delayed by 5 minutes to avoid GPU cold-start thrash — loading DocTR model weights into GPU memory takes ~30 seconds, which is expensive to do repeatedly.

Edge cases & gotchas

OCR is the bottleneck, not the LLM. In an 8-page pipeline, OCR consumes two-thirds of wall-clock time. Teams that optimize LLM call patterns (prompt compression, model selection) while leaving OCR under-scaled are optimizing the wrong thing.

Queue visibility timeout shorter than P99 processing time causes duplicate reprocessing. At 1 million docs/day, a 30-second timeout (SQS default) on a P99 300-second job triggers ~4% re-processing — roughly 40,000 duplicate submissions per day. In a financial pipeline, processing the same invoice twice is not a minor inefficiency — it can result in duplicate payment. Set visibility timeout to the maximum expected processing time, not the average.

98% model accuracy still produces 20,000 wrong extractions per day at 1M docs. 2% failure rate sounds small until you count the absolute numbers. Design the DLQ and human review queue from day one; they are not optional at scale.

Model confidence is not model correctness. A system that outputs 0.90 confidence must be empirically calibrated — correct ~90% of the time at that threshold. Uncalibrated logprobs routinely report 0.95 on out-of-distribution document formats. Calibration requires labeled ground-truth samples covering the real distribution of incoming documents, including adversarial cases.

Lambda default timeout of 3 seconds is fatal for document processing. AWS documentation on the intelligent document processing reference architecture explicitly calls out that Lambda timeout must be raised to at least 2 minutes. Standard moderately-complex PDFs fail silently at default timeout.

LayoutLMv2 and LayoutLMv3 carry CC BY-NC-SA 4.0 licenses. Despite being state-of-the-art document understanding models, both prohibit commercial use. Teams that build on them discover this late, after significant fine-tuning investment. For commercial products, use MIT-licensed LayoutLM v1 or Apache 2.0 alternatives.

Line-item accuracy can be 6–40+ points below header accuracy depending on the model. Benchmarking on overall accuracy and missing the line-item failure mode is the most common evaluation mistake. If the use case is AP automation, test on line items specifically.

LLM hallucination passes schema validation. A hallucinated total of 4712.50 passes schema validation if the actual total is 4712.51 (rounding). Cross-field arithmetic validation (line_items.sum == total_amount) is the only mechanical catch. High-value fields require human audit regardless of confidence.

Schema drift without versioning breaks downstream consumers. When the extraction schema evolves — a new field added, a type changed — downstream ERP or data warehouse systems receive unexpected JSON shapes. Semantic versioning of schemas and a schema registry are required in multi-team deployments. Pin the schema_name in the tracking DB to enable historical reconstruction.

Poor scan quality is the largest single accuracy driver. A 300 DPI floor is not a nice-to-have; below 200 DPI, CER degrades 3–5x. Monitor scan quality metrics independently of extraction accuracy — when accuracy drops, you need to know whether the model degraded or the input quality degraded.

Trade-offs to discuss in an interview

OCR pipeline vs. end-to-end VLM. Traditional multi-stage (rasterize → OCR → layout → LLM) gives lower per-page inference cost, deterministic bounding boxes, and predictable latency, but accumulates errors across stages. End-to-end VLMs (Mistral OCR 4, Qwen2.5-VL) eliminate error propagation and handle tables natively, but cost more per page and provide less transparent provenance. The hybrid is current production consensus.

Specialized model routing vs. single general model. Routing document types to specialized models (Textract AnalyzeExpense for invoices, TATR for tables, custom LayoutLM for forms) beats a general model on accuracy and cost, but increases operational complexity and version management surface area. A general model is operationally simpler but leaves accuracy on the table.

Sync vs. async processing. Sync APIs support real-time UX (user uploads and sees results immediately) but require complex timeout handling and don't scale to batch volumes. Async queues are operationally cleaner and scale well but require polling or callbacks, which adds integration complexity for callers.

Build vs. buy. Managed APIs reduce operational burden but impose hard quota limits, cost 10–50x more at scale, and create cloud-provider lock-in. The break-even for self-hosting is roughly 100K–500K pages/month. Below that threshold, managed APIs are cheaper when engineering time is included.

Confidence threshold calibration. A single global threshold across all field types is simple but suboptimal — date fields in a known format should have a tighter threshold than free-text vendor addresses. Per-field-type thresholds reduce review burden at equivalent accuracy but require more labeled data to calibrate and more monitoring to maintain.

Full review vs. sampled audit of high-confidence output. Reviewing every document doesn't scale. Reviewing only flagged documents risks systematic errors in high-confidence documents that are wrong for a novel reason. A sampled audit of high-confidence outputs provides statistical coverage without full review — the standard practice in production financial automation.

Things you should now be able to answer

Why does an intelligent document processing pipeline need an async work queue rather than a synchronous API, even for small documents?
What are the three signals that compose a per-field confidence score, and why is calibration required for any of them to be operationally useful?
How does schema enforcement — via retry-based validation (Instructor) or generation-time constrained decoding (tool-use JSON schema / OpenAI Structured Outputs) — change the extraction architecture compared to prompt-only LLM calls?
Why does line-item accuracy drop so dramatically compared to header field accuracy, and which model architectures specifically address table extraction?
What is the SQS visibility timeout pitfall, and what happens when the timeout is shorter than P99 processing time?
At 1 million pages/month, roughly how much does each managed API tier cost, and where is the break-even point for self-hosted GPU?
How do bounding-box coordinates flow from the OCR engine through the LLM extraction step to the final field output, and why is this provenance non-negotiable for compliance use cases?
What is straight-through processing rate, and what happens to system latency when it drops below 60%?
What preprocessing steps are required before running OCR on real-world scans, and which specific failure does each step prevent?
How does the feedback loop from human corrections to model retraining work, and what prevents it from becoming a bottleneck?

Frequently asked questions

▸When should you use an end-to-end vision-language model instead of a traditional OCR pipeline?

End-to-end VLMs (GOT-OCR 2.0, Mistral OCR 4, Qwen2.5-VL-72B) skip the multi-stage rasterize-deskew-OCR-layout-LLM chain and handle everything in one forward pass, eliminating cascading error propagation. They shine on handwriting, complex tables, and documents where layout context is essential. The trade-off: they cost more per page, provide less deterministic bounding-box provenance, and are harder to debug when they fail. Most production systems use a hybrid: OCR for bulk text and structured fields, VLM for problem cases like tables or handwritten forms flagged by a quality signal.

▸What does confidence scoring actually measure and why is calibration critical?

Confidence in a document extraction system is typically a composite of three signals: OCR engine character-level probability, LLM log-probability of generated tokens, and downstream validation rule pass/fail. A score of 0.90 should mean the extracted field is correct roughly 90% of the time — that is calibration. Uncalibrated logprobs often overstate confidence on out-of-distribution document formats; a model that has never seen a particular vendor invoice layout may output 0.95 confidence on a completely wrong value. Calibration requires empirical testing against a labeled ground-truth sample across the real distribution of incoming documents.

▸How do you handle multi-page documents, rotated scans, and poor image quality?

Preprocessing must run before any model sees the document. At minimum: rasterize PDFs at 300 DPI (below 200 DPI, Tesseract CER degrades 3–5x), deskew using a Hough transform (even 2–3° rotation cuts recognition by ~10%), contrast-normalize, and detect 90/180/270° rotations with a fast classifier before running OCR. For genuinely low-quality scans, a restoration step using diffusion-based image enhancement (PreP-OCR pipeline) reduces character error rate by 63–70% before text recognition even runs. Page count is handled by splitting the document at ingest into per-page units, processing pages in parallel, then assembling ordered results by page index.

▸What is the build-vs-buy decision point for managed document AI APIs?

Managed APIs (AWS Textract, Google Document AI, Azure Document Intelligence) cost $1.50–$65 per 1,000 pages depending on feature tier. Self-hosted open-source stacks (DocTR, Docling, GOT-OCR, Mistral OCR 4 in a container) require GPU infrastructure investment but cut per-page cost 10–50x. At 1 million pages/month processing forms and tables, AWS Textract costs ~$65,000/month while Mistral OCR 4 batch API costs ~$4,000/month ($4/1,000 pages) — a ~16x spread. The crossover point where self-hosting pays off is roughly 100K–500K pages/month for a team that can manage GPU infra. Below that, managed APIs are cheaper when you include engineering time.

▸Why does line-item extraction accuracy drop so dramatically compared to header field accuracy?

Header fields like invoice_date and vendor_name appear once, in predictable locations, often with labeled keys nearby. Line items are a table: variable number of rows, inconsistent column widths, cells that span rows, totals that repeat information from rows above. Most models show a significant accuracy gap between headers and line items — anywhere from 6 points (Azure Document Intelligence: 93% overall vs 87% on line items) up to 40 points or more (GPT-4o+OCR: 98% overall vs 57% on line items; Google Document AI: 82% overall vs 40% on line items). The implication: do not benchmark only on header fields and assume line items will follow. If your use case requires line-item accuracy, test specifically for it and expect to use a specialized table extraction model alongside your general extractor.

← previous

Design a Feature Store

Design a Customer-Support AI Assistant

// RELATED