All resources
How-to·6 min read·Updated June 2026

Extract tables from PDFs that actually parse.

Tables are the hardest structure in document extraction. A human sees rows and columns; the PDF sees positioned text fragments with, at best, some lines drawn nearby. Borderless tables, merged cells, tables that continue across pages, and scanned tables each defeat a different class of tool. Here's what works.

Why PDF tables break naive tools

A PDF stores characters and coordinates, not table semantics. Extraction tools must reconstruct the grid — and each technique has a failure mode. Line-detection (lattice) methods need visible borders; whitespace methods misjudge column boundaries when cells contain wrapped text; both fail completely on scans, where there are no characters at all until OCR runs, and OCR output loses positioning fidelity.

This is why the classic Python answer — camelot for bordered tables, tabula or pdfplumber otherwise, OCR first for scans, custom settings per layout — works, but only as an engineering project per document source.

Method comparison

Table typeExcel 'From PDF'Python librariesAI extraction
Bordered, digitalUsually worksWorks (camelot/tabula)Works
Borderless, digitalHit or missNeeds per-layout tuningWorks
Multi-page continuationBreaks at page joinsManual stitchingWorks with multi-page option
Scanned/photographedFailsOCR pre-step, fragileWorks (reads the image)
Merged cells / nested headersMangledOften mangledUsually correct

The AI approach: describe the rows you want

Multi-modal models sidestep grid reconstruction: they read the page visually and return the data in the schema you describe. In DocParse you define a list-of-objects field — say line_items with item name, quantity, unit price, amount — and enable the tables document option. The model returns an array of structured rows regardless of whether the table had borders, page breaks or a scan behind it.

On export, list fields expand: a 60-row table becomes 60 spreadsheet rows in the Excel/CSV download, or a JSON array via the API. For documents that are mostly one big table — price lists, transaction registers, inventory counts — this is the difference between an afternoon of cleanup and a download.

Hard cases worth knowing about

Tables continuing across pages: enable the multiple-pages option so the model treats the document as one context — otherwise any tool may return only page one's rows. Numeric integrity: add a validation rule on totals (e.g., amount must be present and numeric) so OCR-grade mistakes get flagged into the review queue. Very wide tables: define every column you need explicitly in the schema; unnamed columns are the ones tools improvise on.

Frequently asked questions

How do I extract a table from a scanned PDF?

Use AI extraction — multi-modal models read the scan as an image, so there's no OCR pre-step. Define your columns as a list-of-objects field and enable the tables option.

Can I extract multiple tables from one PDF?

Yes — define a separate list field per table (e.g., line_items and tax_breakdown) and the model fills each from the corresponding table in the document.

What's the best Python library for PDF table extraction?

For digital PDFs: camelot for bordered tables, pdfplumber for fine control, tabula-py for quick jobs. All need per-layout tuning and an OCR step for scans — which is why varied or scanned documents are usually better served by AI extraction.

Tables out, intact.

Upload a gnarly multi-page table and check the rows. 100 free pages on signup.