The Unique platform offers the possibility to use different supports multiple services for ingesting PDF documents:
Default Unique Ingestion Service
Docling
Microsoft Document Intelligence (MDI in the following)
MDI with Image Content Extraction
Each service can parse structured PDFs with a single-column layout and extract simple tables. However, their capabilities vary when handling more complex documents:
Image-based PDFs: Scanned or printed PDFs lack structured content, requiring OCR techniques for extraction.
Multi-Column Layout: PDFs with multiple columns, charts, tables, and text need pre-trained layout detection models to identify page elements and preserve logical content flow.
Complex Tables Detection: Extracting tables with merged cells, missing borders, or checkmarks requires specialized AI models to recognize different table components.
Image Content Extraction: Many PDFs contain unstructured visual elements like charts, logos, or photos. AI models with image-to-text capabilities are needed to extract this content in a searchable form.
On-Prem Deployment: The service can operate in a closed environment without internet access.
General recommendation:
On-prem customers should deploy Prem Customers: Use Docling for PDF ingestion, as the Unique’s Default ingestion service is not capable of processing Default Unique Ingestion Service lacks efficient support for multi-column layouts in a performant way.
Cloud customers should use Customers: Use MDI as a the default ingestion service, as it extracts information from PDFs with a higher precision than Docling such as tables that are missing grid lines. Customers that have PDF documents with provides higher accuracy than Docling, particularly for tables without grid lines.
If PDFs contain charts or table-like structures,
MDI with Image Content Extraction
is recommended for making all document content searchable and accessible to
language
models.
- fully supported 🟡 - partially supported
- not supported
Ingestion service | Capabilities | Performance | Additional costs | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Structured PDFs | Unstructured PDFs | One-Column LayoutImage-based PDFs | Multi-Column LayoutLayouts | Extracts Complex Tables Detection | Detects Images | Extracts Image Content Extraction | On-prem Prem deployment | |||
Default |
| ![]() |
| 10-15s per page | None | |||||
Docling |
|
|
| 🟡 | 10-20s per page | Azure infra Costs | ||||
MDsI |
| ![]() MDI |
| 10-20s per page | 1.6 cents per page | |||||
MDI with Image Content Extraction |
|
| ![]() | 20-30s per page | 3 cents per page Assumption:
|
...