End-to-end solutions for on-prem PDF ingestion

Making PDFs machine-processable remains a challenging task due to their variety in formats and optimization for printing by removing structure information and metadata. We have evaluated a set of libraries as potential candidates for on-prem PDF ingestion. From all investigated solutions, Docling performs best in most test scenarios and is therefore our recommended solution.

While most of the libraries can extract text from PDFs fairly well, only a few can extract table structures including gridlines (Docling, Parsr, pymupdf4LLM). Documents often contain figures or graphs, especially in the financial service industry to document market trends, for example. By detecting figures, these document elements could be further analyzed with vision capabilities to capture the context and make the document representation even more complete. However, only one solution can detect figures. PDFs come in two sorts: PDFs with an extractable, underlying text layer (e.g., direct export of a docx file) and image-like PDFs (e.g., scans) that provide no access to PDFs content. The text can be extracted from the former type in an easier way than the latter type which requires OCR techniques. However, PDFs do not provide any structure information of the content elements. Hence, layout detection techniques are required to understand the layout and elements first before proceeding with the handling of each element. Only Docling can extract some information from an image like PDF, while the others don't.

Name	License	Capabilities	Resource requirements	Input	Output	Evaluation result

Name	License	Capabilities	Resource requirements	Input	Output	Evaluation result
Docling by IBM	MIT	Text extraction, Layout detection, Table structure analysis, OCR, Figure extraction	Requires a larger machine (to be tested) Runs on CPU and GPU Requires pre-trained models for layout detection and table structure analysis (approx. 400Mb)	PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc and Markdown	Markdown, JSON, Figure / Table export	Recommended option Performs well across test cases! It is well documented, easy to setup and under active development.
Parsr	Mixed	Text extraction, Hierarchy detection, Table structure analysis, OCR, Figure detection	Runs on CPU Most likely requires also a larger machine	PDF, DOCX, Image, EML	Markdown, JSON, Raw Text, CSV	Performs well across test cases! It is well documented but it was fairly difficult to make it work. It seems not be under active development.
pymupdf4LLM	GPL 3.0	Text extraction, Layout detection, Table structure analysis, OCR	Runs on CPU Small machine sufficient (to be tested)	Various	Markdown	Performs good for simple text and table extraction but makes mistakes when extracting more complex table structures.
Unstructured	Apache 2.0	Text extraction	Runs on CPU Small machine sufficient (to be tested)	Various	Text	Extracts only text
Langchain DocumentLoader	MIT, Apache 2.0, GPL 3.0	Text extraction	Runs on CPU Small machine sufficient (to be tested)	PDF	Text	Extracts only text
LlamaIndex	MIT	Text extraction	Runs on CPU Small machine sufficient (to be tested)	PDF	Text	Extracts only text

We evaluated the libraries using the most common document element and layout arrangements:

vertical / horizontal orientation
one to multiple columns
tables with merged columns, no grid lines (colored), checkmark values, color coded values
with and without figures
mix of all of the above

Please reach out if you would like to have the direct outputs of the test runs.

Name and link	Time to process one page	Assessment
Name and link	Time to process one page	Text single column	Text two columns	Text three columns	Simple table	Table with merged cells	Table with checkmarks	Table colored and no gridlines	Table with color coding	Mixed two columns with figure	Mixed four columns with figure (horizontal)	Image like pdf
Docling	3.4s							Columns wrongly assigned				One column header missing
Parsr	5.2s									No image placeholder	No image placeholder
pymupdf4LLM	0.1s					Mixed up values	Mixed up values			Tables replicated, chart not detected	Tables replicated, chart not detected
Unstructured	4.1s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
Langchain DocumentLoader
pdfium2 (Apache 2.0)	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
pdfminer (MIT)	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
pdfplumber (MIT)	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
pymupdf (GPL 3.0)	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
pypdf (Custom)	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
unstructured (Apache 2.0)	3.2s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected
LlamaIndex	0.0s				No gridlines	No gridlines	No gridlines	No gridlines		No gridlines, chart not detected	No gridlines, chart not detected

Author	@Martin Fadler

Public Documentation

End-to-end solutions for on-prem PDF ingestion

Related content