All resources
How-to·6 min read·Updated June 2026

PDFs in. Schema-shaped JSON out. Over HTTP.

If you're building a product or pipeline that ingests documents, you don't want a dashboard — you want an endpoint: POST a PDF, get back JSON in your schema. Here's how that works with a modern extraction API, the design decisions that matter, and the verification patterns that keep bad data out of your system.

The shape of the problem

Generic PDF-to-text libraries give you strings; what your application needs is typed structure — { "invoice_no": "INV-201", "total": 1842.5, "line_items": [...] }. The gap between those two is exactly what an extraction API closes: you declare the schema, the API reads the document and returns data in that shape.

With DocParse's REST API the schema is the extraction definition — field names, types (string, number, object, list), and natural-language descriptions that steer the model. Define once in the dashboard or via API, then every upload against it returns the same JSON shape.

The integration loop

Three calls cover the whole lifecycle:

  • Upload — POST your PDF(s) to the extraction as a batch; the API acknowledges immediately and processes asynchronously
  • Poll — GET the batch results endpoint until status is processed; or skip polling entirely with…
  • Webhooks — register an endpoint and receive a signed delivery (HMAC-SHA256, Standard Webhooks spec) the moment each document finishes

Why signed webhooks beat polling

Polling works for scripts; webhooks are the production pattern. Each DocParse delivery is signed with a per-endpoint secret and timestamped against replay, so your handler can cryptographically verify the payload came from us before acting on it. Verification is a dozen lines in any language — the docs include working examples.

Designing schemas that extract well

Three field-design habits noticeably improve output quality:

  • Describe fields like you'd brief a person: "the issuing date / billing date / transaction date" beats "date"
  • Use list-of-object fields for tables — the model returns proper arrays, not concatenated text
  • Provide example values where the format matters ("2022-01-01", "19%") — the model respects them

Keeping bad data out

APIs amplify both good and bad data. Pair the endpoint with server-side validation rules (required fields, ranges, patterns) so a document that extracts poorly is marked needs_review instead of processed — your integration can branch on that status and route exceptions to a human queue while clean documents flow straight through. Failed processing is never billed.

For quick experiments before writing any code: the dashboard's sample invoice and free 100 pages let you confirm the JSON shape in minutes, then wire the API knowing exactly what comes back.

Frequently asked questions

Does the API handle scanned PDFs and images?

Yes — PDF, PNG, JPG, WEBP and DOCX all go through the same endpoint, scans and photos included, since the model reads documents visually.

How do I authenticate?

API keys, created in the dashboard. Keys are stored as SHA-256 hashes server-side and can be rotated or revoked instantly, with usage visible per key.

Is there a Zapier option for no-code pipelines?

Yes — a native Zapier app triggers on completed extractions, so you can push JSON into 6,000+ apps without running your own webhook receiver.

Ship the integration this week.

100 free pages, full API and webhook access on every account.