Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Unique platform offers the possibility to use different supports multiple services for ingesting PDF documents:

  • Default Unique Ingestion Service

  • Docling

  • Microsoft Document Intelligence (MDI in the following)

  • MDI with Image Content Extraction

Each service can parse structured PDFs with a single-column layout and extract simple tables. However, their capabilities vary when handling more complex documents:

  • Image-based PDFs: Scanned or printed PDFs lack structured content, requiring OCR techniques for extraction.

  • Multi-Column Layout: PDFs with multiple columns, charts, tables, and text need pre-trained layout detection models to identify page elements and preserve logical content flow.

  • Complex Tables Detection: Extracting tables with merged cells, missing borders, or checkmarks requires specialized AI models to recognize different table components.

  • Image Content Extraction: Many PDFs contain unstructured visual elements like charts, logos, or photos. AI models with image-to-text capabilities are needed to extract this content in a searchable form.

  • On-Prem Deployment: The service can operate in a closed environment without internet access.

General recommendation:

  • On-prem customers should deploy Prem Customers: Use Docling for PDF ingestion, as the Unique’s Default ingestion service is not capable of processing Default Unique Ingestion Service lacks efficient support for multi-column layouts in a performant way.

  • Cloud customers should use Customers: Use MDI as a the default ingestion service, as it extracts information from PDFs with a higher precision than Docling such as tables that are missing grid lines. Customers that have PDF documents with provides higher accuracy than Docling, particularly for tables without grid lines.

    • If PDFs contain charts or table-like structures,

    the
    • MDI with Image Content Extraction

    should be used as it will make almost all the content that can be contained in a PDF document
    • is recommended for making all document content searchable and accessible to

    a
    • language

    model.
    • models.

(tick) - fully supported 🟡 - partially supported (error) - not supported

Ingestion service

Capabilities

Performance

Additional costs

Structured PDFs

Unstructured PDFs

One-Column LayoutImage-based PDFs

Multi-Column LayoutLayouts

Extracts Complex Tables Detection

Detects Images

Extracts Image Content Extraction

On-prem deploymentPrem Deployment

Default

(tick)

(error)

(tick)

(error)

(error)

(error)

(error)

(tick)

10-15s per page

None

Docling

(tick) 🟡

(error)

(tick)

(tick)

🟡(tick)

(error)

(tick)

10-20s per page

Azure infra Costs

MDsI

(tick)

(tick)

MDI

(tick)

(tick)

(tick)

(tick)

(error)

(error)

10-20s per page

1.6 cents per page

MDI with Image Content Extraction

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

(tick)

(error)

20-30s per page

3 cents per page

Assumption:

  • 1.6 cents for MDI

  • 1.4 cents for 5k tokens (vision model GPT4o) per image per page (assuming 1 image per page)

...