End-to-end solutions for on-prem PDF ingestion

Making PDFs machine-processable remains a challenging task due to their variety in formats and optimization for printing by removing srtucture information and metadata. We have evaluated a set of libraries as potential candidates for on-prem PDF ingestion. From all investigated solutions, Docling performs best in most test scenarios and is therefore our recommended solution.

While most of the libraries can extract text from PDFs fairly well, only a few can extract table structures including gridlines (Docling, Parsr, pymupdf4LLM). Documents often contain figures or graphs, especially in the financial service industry to document market trends, for example. By detecting figures, these document elements could be further analyzed with vision capabilities to capture the context and make the document representation even more complete. However, only one solution can detect figures. PDFs come in two sorts: PDFs with an extractable, underlying text layer (e.g., direct export of a docx file) and image-like PDFs (e.g., scans) that provide no access to PDFs content. The text can be extracted from the former type in an easier way than the latter type which requires OCR techniques. However, PDFs do not provide any structure information of the content elements. Hence, layout detection techniques are required to understand the layout and elements first before proceeding with the handling of each element. Only Docling can extract some information from an image like PDF, while the others don't.

Name

License

Capabilities

Resource requirements

Input

Output

Evaluation result

Name

License

Capabilities

Resource requirements

Input

Output

Evaluation result

Docling by IBM

MIT

Text extraction, Layout detection, Table structure analysis, OCR, Figure extraction

  • Requires a larger machine (to be tested)

  • Runs on CPU and GPU

  • Requires pre-trained models for layout detection and table structure analysis (approx. 400Mb)

PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc and Markdown

Markdown, JSON, Figure / Table export

Recommended option
Performs well across test cases! It is well documented, easy to setup and under active development.

Parsr

Mixed

Text extraction, Hierarchy detection, Table structure analysis, OCR, Figure detection

  • Runs on CPU

  • Most likely requires also a larger machine

PDF, DOCX, Image, EML

Markdown, JSON, Raw Text, CSV

Performs well across test cases! It is well documented but it was fairly difficult to make it work. It seems not be under active development.

pymupdf4LLM

GPL 3.0

Text extraction, Layout detection, Table structure analysis, OCR

  • Runs on CPU

  • Small machine sufficient (to be tested)

Various

Markdown

Performs good for simple text and table extraction but makes mistakes when extracting more complex table structures.

Unstructured

Apache 2.0

Text extraction

  • Runs on CPU

  • Small machine sufficient (to be tested)

Various

Text

Extracts only text

Langchain DocumentLoader

MIT, Apache 2.0, GPL 3.0

Text extraction

  • Runs on CPU

  • Small machine sufficient (to be tested)

PDF

Text

Extracts only text

LlamaIndex

MIT

Text extraction

  • Runs on CPU

  • Small machine sufficient (to be tested)

PDF

Text

Extracts only text

We evaluated the libraries using the most common document element and layout arrangements:

  • vertical / horizontal orientation

  • one to multiple columns

  • tables with merged columns, no grid lines (colored), checkmark values, color coded values

  • with and without figures

  • mix of all of the above

Please reach out if you would like to have the direct outputs of the test runs.

Name and link

Time to process one page

Assessment

 

image-20241204-160536.png
Text single column

 

 

image-20241204-160630.png
Text two columns

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Docling

3.4s

Columns wrongly assigned

One column header missing

Parsr

5.2s

No image placeholder

No image placeholder

pymupdf4LLM

 

 

0.1s

Mixed up values

Mixed up values

Tables replicated, chart not detected

Tables replicated, chart not detected

Unstructured

4.1s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

Langchain DocumentLoader

  • pdfium2 (Apache 2.0)

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

  • pdfminer (MIT)

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

  • pdfplumber (MIT)

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

  • pymupdf (GPL 3.0)

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

  • pypdf (Custom)

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

  • unstructured (Apache 2.0)

3.2s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

LlamaIndex

0.0s

No gridlines

No gridlines

No gridlines

No gridlines

No gridlines, chart not detected

No gridlines, chart not detected

 


Author

@Martin Fadler

 

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service