The legacy approach: OCR first, then rules
Traditional pipelines run optical character recognition (OCR) over the document to get a flat block of text, then apply regular expressions or template-based rule engines to find specific fields. This works for tightly controlled inputs — a single vendor's invoice template that never changes — but breaks the moment a new layout, language, or scan quality appears.
It also loses structure. An OCR-flat document doesn't know that a number sits inside a table cell, that a checkbox is ticked, or that a signature is in the right place.
The modern approach: multi-modal models
Multi-modal AI models read the document as an image and as text simultaneously. They see the layout — columns, tables, checkboxes, signatures, stamps — and they read the values in context. The same model that recognises that a PDF is an invoice can immediately fill the fields you ask for.
DocParse is built on this approach. There is no separate OCR pre-processor and no separate rule layer — the model is asked, in a structured prompt, to return the fields you defined, and it returns valid JSON.
How DocParse handles structure
DocParse exposes seven document option flags you can toggle per extraction. Each one shapes how the model interprets the page:
- tables — return rows as nested arrays of objects, preserving column relationships
- charts — extract values from bars, lines, and pies, not just the surrounding text
- checkboxes — return checked / unchecked state for each labelled box
- handwritten — bias the model toward freehand recognition (legibility still matters)
- multi-page — extract across the full document instead of just the first page
- split-PDF — split a multi-doc PDF into individual extractions
- specific-pages — restrict extraction to a page range you specify
What about classification?
Recognition and classification are related but distinct: recognition is about reading what's on the page, classification is about deciding what kind of document it is. DocParse ships both — the Document Classification module lets you define categories with names, descriptions, and keywords, and optionally route each category to a target extraction.
Common pattern: route incoming Gmail attachments through a classifier first to label them as invoice / receipt / contract, then run each through the matching extraction template.
File formats
DocParse accepts PDF, PNG, JPG / JPEG, WEBP, DOCX, and plain text. DOCX is converted to text before the model sees it (preserving paragraph structure). PDFs and images go directly to the multi-modal model. Hard limits: 25 MB per file, 30 files per batch.