The Unique platform offers the possibility to use different services for ingesting PDF documents:
Default Unique Ingestion Service
Docling
Microsoft Document Intelligence (MDI in the following)
MDI with Image Content Extraction
General recommendation:
On-prem customers should deploy Docling for PDF ingestion as the Unique’s Default ingestion service is not capable of processing multi-column layouts in a performant way.
Cloud customers should use MDI as a default ingestion service as it extracts information from PDFs with a higher precision than Docling such as tables that are missing grid lines. Customers that have PDF documents with charts or table-like structures, the MDI with Image Content Extraction should be used as it will make almost all the content that can be contained in a PDF document searchable and accessible to a language model.
Ingestion service | Capabilities | Performance | Additional costs | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Structured PDFs | Unstructured PDFs | One-Column Layout | Multi-Column Layout | Extracts Tables | Detects Images | Extracts Image Content | On-prem deployment | |||
Default |
|
|
|
|
|
|
|
| 10-15s per page | None |
Docling |
|
|
|
| 🟡 |
|
|
| 10-20s per page | Azure infra Costs |
MDsI |
|
|
|
|
|
|
|
| 10-20s per page | 1.6 cents per page |
MDI with Image Content Extraction |
|
|
|
|
|
|
|
| 20-30s per page | 3 cents per page Assumption:
|