Ingestion Configuration: MS Document Intelligence GA Version
Analogous to https://unique-ch.atlassian.net/wiki/spaces/SD/pages/353140836, the following custom single page ingestion service enables the use of the GA version 2023-07-31 of Microsoft’s Document Intelligence layout service, formerly called Form Recognizer.
Key capabilities:
Leading document ingestion service
Extracts tabular data
Parses multiple column layouts
Enhances search results for complex documents
Can be deployed in Switzerland
Enable for Scope In Knowledge Base
To use this custom PDF page processing for a specific Scope or Content in the Knowledge Base, the ingestion config of the content needs to be adjusted. This is an example curl for that:
curl --location --request POST 'http://https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
"properties": {
"ingestionConfig": {
"pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
"customApiIdentifier": "Unique Text Extraction API"
}
},
"applyToSubScopes": true
}'
Enable for Upload in Chat
To use the custom PDF page processing in specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:
{
...
"ingestionConfig": {
"pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
"customApiIdentifier": "Unique Text Extraction API"
},
...
}
Limitations and Considerations
The MS Document Intelligence Service costs approx. 1 cent per page and has limited throughput. These costs might be charged additionally by Unique as it is not covered by the Ada Tokens.
The MS Document Intelligence Service can be deployed in Switzerland.
Activation
Before being able to use MDI, the service must be deployed within a tenant. Depending on your Deployment models one of the following processes must be chosen.
| PaaS | Single Tenant | Customer Managed | On Premise |
---|---|---|---|---|
Config options | only via API for a scope | via API for a scope or via environment variable via Customer Success | Customer must manage it themselves | MDI is not available |
Request | already deployed | via Customer Success considering the impact described above | Customer must deploy the service by themselves |
Authentication Methods
MS Document Intelligence can run in two modes:
Key-based authentication (taking it from the env variables (see code), used in dev)
Via Workload Identity in production
Unique uses only Workload Identity on all its Deployment models
Author | @Martin Fadler |
---|
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service