What it actually means
Document information extraction (DIE, sometimes called intelligent document processing or IDP) covers any technique that takes an unstructured document — PDF, image, scan, email, Word file — and returns a structured representation of the values inside it.
Historically, teams approached this with regular expressions, rule engines, or per-template OCR pipelines. Those approaches work on a narrow set of documents and break the moment a new vendor shows up or a layout changes.
Why modern AI changes the game
Vision-language models (the same family of models behind tools like ChatGPT and Gemini Vision) can read a document the way a person does — looking at the layout, the visual cues, the table structure, and the surrounding text — and answer questions about it. That removes the need for per-template tuning entirely.
DocParse uses Gemini 2.5 Flash as its default extraction engine (with OpenAI available as a swappable provider). Both are frontier multi-modal models that accept the file directly and return JSON in the schema you specify.
How DocParse does it
The DocParse workflow has three steps, and they are the same whether you use the dashboard or the REST API:
- Define the fields you need (name, type, optional description) — either pick a built-in template or define your own custom schema
- Upload PDFs, JPGs, PNGs, WEBP, or DOCX files (up to 25 MB per file, 30 files per batch)
- Receive structured JSON back — export from the dashboard, poll the REST API, or have signed webhooks push deliveries to your endpoint
When extraction needs help
Some documents have structure that benefits from an extra hint. DocParse exposes seven document options you can toggle per extraction: tables, charts, checkboxes, handwritten, multi-page, split-PDF, and specific-pages. Each one tells the model to look for that specific structure, so a multi-page bank statement returns the full transaction list instead of just the summary on page one.
Languages and scripts
DocParse does not maintain a fixed list of supported languages. The underlying multi-modal model is multilingual by default, so a mixed-script document (English headers, Japanese values, an Arabic stamp) usually works without extra configuration. You can also pin a specific language per extraction if you want the model to bias for it.
Handwriting works the same way — the model recognises it as part of normal reading, and you can use the handwritten document option to nudge it for low-quality scans.
Getting the data out
Once the extraction completes, the JSON is available three ways:
- Dashboard — view, edit, export as JSON or CSV
- REST API — GET /api/v1/getBatchResults
- Signed webhook — HMAC-SHA256 over the delivery body, per the Standard Webhooks spec