Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 2 Current »

The Unique platform supports multiple services for ingesting PDF documents:

  • Default Unique Ingestion Service

  • Docling

  • Microsoft Document Intelligence (MDI in the following)

  • MDI with Image Content Extraction

Each service can parse structured PDFs with a single-column layout and extract simple tables. However, their capabilities vary when handling more complex documents:

  • Image-based PDFs: Scanned or printed PDFs lack structured content, requiring OCR techniques for extraction.

  • Multi-Column Layout: PDFs with multiple columns, charts, tables, and text need pre-trained layout detection models to identify page elements and preserve logical content flow.

  • Complex Tables Detection: Extracting tables with merged cells, missing borders, or checkmarks requires specialized AI models to recognize different table components.

  • Image Content Extraction: Many PDFs contain unstructured visual elements like charts, logos, or photos. AI models with image-to-text capabilities are needed to extract this content in a searchable form.

  • On-Prem Deployment: The service can operate in a closed environment without internet access.

General recommendation:

  • On-Prem Customers: Use Docling for PDF ingestion, as the Default Unique Ingestion Service lacks efficient support for multi-column layouts.

  • Cloud Customers: Use MDI as the default ingestion service, as it provides higher accuracy than Docling, particularly for tables without grid lines.

    • If PDFs contain charts or table-like structures, MDI with Image Content Extraction is recommended for making all document content searchable and accessible to language models.

(tick) - fully supported 🟡 - partially supported (error) - not supported

Ingestion service

Capabilities

Performance

Additional costs

Image-based PDFs

Multi-Column Layouts

Complex Tables Detection

Image Content Extraction

On-Prem deployment

Default

(error)

(error)

(error)

(error)

(tick)

10-15s per page

None

Docling

🟡

(tick)

🟡

(error)

(tick)

10-20s per page

Azure infra Costs

MDI

(tick)

(tick)

(tick)

(error)

(error)

10-20s per page

1.6 cents per page

MDI with Image Content Extraction

(tick)

(tick)

(tick)

(tick)

(error)

20-30s per page

3 cents per page

Assumption:

  • 1.6 cents for MDI

  • 1.4 cents for 5k tokens (vision model GPT4o) per image per page (assuming 1 image per page)

  • No labels