End-to-end solutions for on-prem PDF ingestion
Making PDFs machine-processable remains a challenging task due to their variety in formats and optimization for printing by removing srtucture information and metadata. We have evaluated a set of libraries as potential candidates for on-prem PDF ingestion. From all investigated solutions, Docling performs best in most test scenarios and is therefore our recommended solution.
While most of the libraries can extract text from PDFs fairly well, only a few can extract table structures including gridlines (Docling, Parsr, pymupdf4LLM). Documents often contain figures or graphs, especially in the financial service industry to document market trends, for example. By detecting figures, these document elements could be further analyzed with vision capabilities to capture the context and make the document representation even more complete. However, only one solution can detect figures. PDFs come in two sorts: PDFs with an extractable, underlying text layer (e.g., direct export of a docx file) and image-like PDFs (e.g., scans) that provide no access to PDFs content. The text can be extracted from the former type in an easier way than the latter type which requires OCR techniques. However, PDFs do not provide any structure information of the content elements. Hence, layout detection techniques are required to understand the layout and elements first before proceeding with the handling of each element. Only Docling can extract some information from an image like PDF, while the others don't.
Name | License | Capabilities | Resource requirements | Input | Output | Evaluation result |
---|---|---|---|---|---|---|
Docling by IBM | MIT | Text extraction, Layout detection, Table structure analysis, OCR, Figure extraction |
| PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc and Markdown | Markdown, JSON, Figure / Table export | Recommended option |
Mixed | Text extraction, Hierarchy detection, Table structure analysis, OCR, Figure detection |
| PDF, DOCX, Image, EML | Markdown, JSON, Raw Text, CSV | Performs well across test cases! It is well documented but it was fairly difficult to make it work. It seems not be under active development. | |
GPL 3.0 | Text extraction, Layout detection, Table structure analysis, OCR |
| Various | Markdown | Performs good for simple text and table extraction but makes mistakes when extracting more complex table structures. | |
Apache 2.0 | Text extraction |
| Various | Text | Extracts only text | |
MIT, Apache 2.0, GPL 3.0 | Text extraction |
| Text | Extracts only text | ||
MIT | Text extraction |
| Text | Extracts only text |
We evaluated the libraries using the most common document element and layout arrangements:
vertical / horizontal orientation
one to multiple columns
tables with merged columns, no grid lines (colored), checkmark values, color coded values
with and without figures
mix of all of the above
Please reach out if you would like to have the direct outputs of the test runs.
Name and link | Time to process one page | Assessment | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Text single column
|
Text two columns
|
|
|
|
|
|
|
|
|
| ||
3.4s |
|
|
|
|
|
| Columns wrongly assigned |
|
| One column header missing | ||
5.2s |
|
|
|
|
|
|
| No image placeholder | No image placeholder | |||
| 0.1s |
|
|
|
| Mixed up values | Mixed up values |
| Tables replicated, chart not detected | Tables replicated, chart not detected | ||
4.1s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | |||
| 0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
| 0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
| 0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
| 0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
| 0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
| 3.2s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected | ||
0.0s |
|
|
| No gridlines | No gridlines | No gridlines | No gridlines | No gridlines, chart not detected | No gridlines, chart not detected |
Author | @Martin Fadler |
---|
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service