Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

Version 1 Next »

The Unique platform offers the possibility to use different services for ingesting PDF documents:

  • Default Unique Ingestion Service

  • Docling

  • Microsoft Document Intelligence (MDI in the following)

  • MDI with Image Content Extraction

General recommendation:

  • On-prem customers should deploy Docling for PDF ingestion as the Unique’s Default ingestion service is not capable of processing multi-column layouts in a performant way.

  • Cloud customers should use MDI as a default ingestion service as it extracts information from PDFs with a higher precision than Docling such as tables that are missing grid lines. Customers that have PDF documents with charts or table-like structures, the MDI with Image Content Extraction should be used as it will make almost all the content that can be contained in a PDF document searchable and accessible to a language model.

Ingestion service

Capabilities

Performance

Additional costs

Structured PDFs

Unstructured PDFs

One-Column Layout

Multi-Column Layout

Extracts Tables

Detects Images

Extracts Image Content

On-prem deployment

Default

(tick)

(error)

(tick)

(error)

(error)

(error)

(error)

(tick)

10-15s per page

None

Docling

(tick)

(error)

(tick)

(tick)

🟡

(tick)

(error)

(tick)

10-20s per page

Azure infra Costs

MDsI

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

(error)

(error)

10-20s per page

1.6 cents per page

MDI with Image Content Extraction

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

(error)

20-30s per page

3 cents per page

Assumption:

  • 1.6 cents for MDI

  • 1.4 cents for 5k tokens (vision model GPT4o) per image per page (assuming 1 image per page)

  • No labels