Ingestion Configuration: MS Document Intelligence (Layout) Ingestion
The default pipeline currently in place may not adequately process certain PDF and Word documents, particularly when encountering improperly formatted data (e.g., tables in financial documents, images with text).
Microsoft Document Intelligence (MDI ff.) can enhance Unique's capability to accurately ingest documents that contain complex tables and graphics.
The MDI Service is currently only available in Preview
in the Netherlands (westeurope
) and not in Switzerland!
If a customer decides to use the MDI Service he consents to be fully aware of the consequences and exclusions of using a Microsoft service outside of Switzerland that is in Preview
.
Ingestion Modes
PDFs on Unique are ingested page by page.
There are 2 modes of ingesting documents as it is implemented:
PDFTODOCX_ONLY
: Use our default library: PDFs are converted using pdf2docx (default)DOC_INTELLIGENCE_DEFAULT
: Use MDI on all pages of the document
Word
The default process directly extracts the content of a Word file, including text and tables with their underlying formatting. However, it does not extract content from images (e.g., if a table is embedded as an image in the Word file). There is an option to use the MDI service for Word files, which can also extract text from images. This process first converts the Word file to a PDF to utilize the full capabilities of the MDI service:
WORD_DEFAULT_INGESTION
: Use the default Word ingestion mechanism (without MDI)DOC_INTELLIGENCE_DEFAULT
: Use MDI on the Word document
Enable MDI
MS document intelligence can be switched on per scope/folder. (Even hierarchical scopes with inheritance)
On the ingestion service per scopeId. Replace the placeholders:
<scopeId>
<baseUrl> (e.g. *.unique.app)
<yourToken>
Check here how to get a token: Managing scopes & access via API
curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
"properties": {
"ingestionConfig": {
"pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
"wordReadMode": "DOC_INTELLIGENCE_DEFAULT"
}
},
"applyToSubScopes": true
}'
Enable for Upload in Chat
To use the MDI processing in a specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:
{
...
"ingestionConfig": {
"pdfReadMode": "DOC_INTELLIGENCE_DEFAULT",
"wordReadMode": "DOC_INTELLIGENCE_DEFAULT"
},
...
}
Limitations and Considerations
The MS Document Intelligence Service costs approx. 1.6 cents per page and has limited throughput. These costs might be charged additionally by Unique as it is not covered by the Ada Tokens.
Service is still in
Preview
and only available in West Europe. There are the following points to consider services which are only available inPreview
:In the DPA of Microsoft (Nov 2023 version) it is stated that in
Preview
mode they may employ lesser or different privacy and security measures than those typically present in the Products and Services.Even though we have activated the opt-out of processing data for all versions and subscriptions on your tenant (incl.
Preview
). Microsoft still reserves the right for preview services to store and access output and prompts for harmful content despite this opt-out.Some other limitations are that preview services are not covered by the SLA and do not offer European Data Boundary Service but are not as critical.
Technical
Activation
Before being able to use MDI, the service must be deployed within a tenant. Depending on your Deployment models one of the following processes must be chosen.
| PaaS | Single Tenant | Customer Managed | On Premise |
---|---|---|---|---|
Config options | only via API for a scope | via API for a scope or via environment variable via Customer Success | Customer must manage it themselves | MDI is not available |
Request | already deployed | via Customer Success considering the impact described above | Customer must deploy the service by themselves |
Authentication Methods
MS Document Intelligence can run in two modes:
Key-based authentication (taking it from the env variables (see code), used in dev)
Via Workload Identity in production
Unique uses only Workload Identity on all its Deployment models
Author | @Adrian Gugger |
---|
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service