Why PDF tables break naive tools
A PDF stores characters and coordinates, not table semantics. Extraction tools must reconstruct the grid — and each technique has a failure mode. Line-detection (lattice) methods need visible borders; whitespace methods misjudge column boundaries when cells contain wrapped text; both fail completely on scans, where there are no characters at all until OCR runs, and OCR output loses positioning fidelity.
This is why the classic Python answer — camelot for bordered tables, tabula or pdfplumber otherwise, OCR first for scans, custom settings per layout — works, but only as an engineering project per document source.
Method comparison
| Table type | Excel 'From PDF' | Python libraries | AI extraction |
|---|---|---|---|
| Bordered, digital | Usually works | Works (camelot/tabula) | Works |
| Borderless, digital | Hit or miss | Needs per-layout tuning | Works |
| Multi-page continuation | Breaks at page joins | Manual stitching | Works with multi-page option |
| Scanned/photographed | Fails | OCR pre-step, fragile | Works (reads the image) |
| Merged cells / nested headers | Mangled | Often mangled | Usually correct |
The AI approach: describe the rows you want
Multi-modal models sidestep grid reconstruction: they read the page visually and return the data in the schema you describe. In DocParse you define a list-of-objects field — say line_items with item name, quantity, unit price, amount — and enable the tables document option. The model returns an array of structured rows regardless of whether the table had borders, page breaks or a scan behind it.
On export, list fields expand: a 60-row table becomes 60 spreadsheet rows in the Excel/CSV download, or a JSON array via the API. For documents that are mostly one big table — price lists, transaction registers, inventory counts — this is the difference between an afternoon of cleanup and a download.
Hard cases worth knowing about
Tables continuing across pages: enable the multiple-pages option so the model treats the document as one context — otherwise any tool may return only page one's rows. Numeric integrity: add a validation rule on totals (e.g., amount must be present and numeric) so OCR-grade mistakes get flagged into the review queue. Very wide tables: define every column you need explicitly in the schema; unnamed columns are the ones tools improvise on.
Frequently asked questions
How do I extract a table from a scanned PDF?
Use AI extraction — multi-modal models read the scan as an image, so there's no OCR pre-step. Define your columns as a list-of-objects field and enable the tables option.
Can I extract multiple tables from one PDF?
Yes — define a separate list field per table (e.g., line_items and tax_breakdown) and the model fills each from the corresponding table in the document.
What's the best Python library for PDF table extraction?
For digital PDFs: camelot for bordered tables, pdfplumber for fine control, tabula-py for quick jobs. All need per-layout tuning and an OCR step for scans — which is why varied or scanned documents are usually better served by AI extraction.