File Ingestion

File Ingestion

 

Overview

The ingestion service powers Unique's knowledge base by converting uploaded files into vector embeddings for use by the platform's LLMs. It supports various file types, including PDF, Word, Excel, and others with flexible ingestion methods to meet diverse needs. Administrators can use different ingestion methods and customize ingestion settings and chunking strategies for each knowledge base folder, optimizing processing for specific document sets.

Who is it for

This documentation focuses specifically on administrators who need to manage and configure advanced ingestion capabilities for their organizations. Administrators who require granular control over document processing, cost management, and optimization of search results will benefit most from understanding the comprehensive ingestion options, customization features available within the platform.

Permission Requirements: Users must have the Can manage permission for a folder to access its ingestions configuration. Only users with the role knowledge.write can be permissioned the can manage role. See more details here: https://unique-ch.atlassian.net/wiki/spaces/PUBDOC/pages/1414135918

Benefits and Use Cases

The ingestion service processes diverse company documents through a unified platform, offering flexible customization to optimize results. Folder-level settings allow organizations to tailor processing methods to specific document types, improving search accuracy and relevance.

By combining the default service, specialized MDI tools, and emerging vision-enhanced features, the platform delivers scalable, high-quality solutions that adapt to evolving document management needs.

The Unique platform supports multiple services for ingesting PDF documents:

  • Unique Ingestion

  • Docling

  • Microsoft Document Intelligence (MDI in the following)

  • MDI with Image Content Extraction

Please find the details about the supported file types here: https://unique-ch.atlassian.net/wiki/spaces/PUBDOC/pages/1405452306

Ingestion Overview, Default and Recommendations

Each service can parse structured PDFs with a single-column layout and extract simple tables. However, their capabilities vary when handling more complex documents:

  • Image-based PDFs: Scanned or printed PDFs lack structured content, requiring OCR techniques for extraction.

  • Multi-Column Layout: PDFs with multiple columns, charts, tables, and text need pre-trained layout detection models to identify page elements and preserve logical content flow.

  • Complex Tables Detection: Extracting tables with merged cells, missing borders, or checkmarks requires specialized AI models to recognize different table components.

  • Image Content Extraction: Many PDFs contain unstructured visual elements like charts, logos, or photos. AI models with image-to-text capabilities are needed to extract this content in a searchable form.

  • On-Prem Deployment: The service can operate in a closed environment without internet access

Ingestion service

Capabilities

Available Regions

Performance

Image-based PDFs

Multi-Column Layouts

Complex Tables Detection

Image Content Extraction

On-Prem Deploy-ment

Base Unique Ingestion

All regions

10-15s per page

Docling

All regions

10-20s per page

MDI

Check here for Azure AI Document Intelligencee

10-20s per page

MDI with Image Content Extraction


EXPERIMENTAL

Check here for Azure AI Document Intelligencee and

20-30s per page

- fully supported - partially supported - not supported

 

  • On-Prem Customers: Use Docling for PDF ingestion, as the Default Unique Ingestion Service lacks efficient support for multi-column layouts.

  • Cloud Customers: Use MDI as the default ingestion service, as it provides higher accuracy than Docling, particularly for tables without grid lines.

Please also consider price differences for the different ingestion methods (e.g., 0.024$ per page for MDI)

Step-by-Step Guide

Step 1: Open the Ingestion Configuration

Navigate to the folder in the knowledge base that you want to review or set up for ingestion. Then, click the ‘Configure Files Ingestion’ button located on the right.

This button will only be visible if you have the Can manage permission for the folder.

image-20250731-215549.png

 

Step 2: Change the Ingestion Configuration

 

image-20250731-221506.png

PDF

PDFs on Unique are ingested page by page.

There are 2 modes of ingesting documents as it is implemented:

  • PDFTODOCX_ONLY: Use our default library: PDFs are converted using pdf2docx (default)

  • DOC_INTELLIGENCE_DEFAULT: Use MDI on all pages of the document

Word

The default process directly extracts the content of a Word file, including text and tables with their underlying formatting. However, it does not extract content from images (e.g., if a table is embedded as an image in the Word file). There is an option to use the MDI service for Word files, which can also extract text from images. This process first converts the Word file to a PDF to utilize the full capabilities of the MDI service:

  • WORD_DEFAULT_INGESTION: Use the default Word ingestion mechanism (without MDI)

  • INGEST_WORD_AS_PDF: Convert the Word document to PDF and use the PDF ingestion service on the Word document

 

Enable MDI

The default pipeline currently in place may not adequately process certain PDF and Word documents, particularly when encountering improperly formatted data (e.g., tables in financial documents, images with text).

Microsoft Document Intelligence (MDI ff.) can enhance Unique's capability to accurately ingest documents that contain complex tables and graphics. The latest GA version 2024-11-30 of Microsoft’s Document Intelligence is used to ingest documents.

Microsoft Document Intelligence can be activated on a per-scope or per-folder basis, including hierarchical scopes with inheritance. For single-tenant setups, ensure that the service is fully provisioned before enabling it.

On the ingestion service per scopeId. Replace the placeholders:

  • <scopeId>

  • <baseUrl> (e.g. *.unique.app)

  • <yourToken>

Multitenant Region URLs

Use the correct tenant based on your deployment region:

Gateway & Chat APIs

  • 🇪🇺 Europe: <https://gateway.unique.app>

  • 🇺🇸 US: <https://gateway.us.unique.app>

Identity (OAuth / Login)

  • 🇪🇺 Europe: <https://id.unique.app>

  • 🇺🇸 US: <https://id.us.unique.app>

 

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \ --header 'Authorization: Bearer <yourToken>' \ --header 'Content-Type: application/json' \ --data-raw '{ "properties": { "ingestionConfig": { "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT", "wordReadMode": "INGEST_WORD_AS_PDF" } }, "applyToSubScopes": true }'

The MDI can also be turned on via an environment variable as a default (service: ingestion-worker):

  • PDF_READ_MODE=DOC_INTELLIGENCE_DEFAULT

Clients can also request Unique to enable this on their Single Tenant but must know the Considerations below.

Environment variable switching is not available on PaaS.

Enable MDI for Upload in Chat

To use the MDI processing in a specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:

{ ... "ingestionConfig": { "pdfReadMode": "DOC_INTELLIGENCE_DEFAULT", "wordReadMode": "INGEST_WORD_AS_PDF" }, ...

Enable Agentic Document Ingestion

This service is deprecated and no longer supported. The configuration should be removed as early as possible.

Analogous to https://unique-ch.atlassian.net/wiki/spaces/PUBDOC/pages/edit-v2/1395982370#Enable-MDI, the following custom single page ingestion service combines the latest GA version 2024-11-30 of Microsoft’s Document Intelligence layout service with the AZURE_GPT_4o_2024_0806 vision model to extract content from images and further optimize page content.

Key capabilities:

  • Leading document ingestion service

  • Extracts tabular data

  • Parses multiple column layouts

  • Enhances search results for complex documents

  • Can be deployed in Switzerland

  • Detects and extracts content from figures:

    • Charts and table-like images are transformed into a table and a searchable description is added

    • Logos are translated to the brand name / text

    • For other images a searchable description is added

  • Further optimizes extracted page content (optional)

The service can run the extraction with three methods:

  • MDI: Uses MDI to extract page content and optionally performs an optimization with the Vision model.

  • MDI + Vision: Uses MDI to extract page content and a Vision model to extract the content from each detected image in parallel.

  • Vision: Uses only the Vision model to extract page content.

Each extraction methods can apply an additional Page Content Optimizer step that will evaluate the extracted page content and further improve it using a Vision model.

image-20250228-155417.png
Agentic Document Ingestion Overview

To use this custom PDF page processing for a specific Scope or Content in the Knowledge Base, the ingestion config of the content needs to be adjusted.

Enable via Ingestion Config UI

  1. Click “Configure File Ingestion” for the scope of interest

  2. Select “Custom Single Page API” for PDF ingestion

  3. Enter “Unique Text and Image Extraction API” in API Identifier

  4. Enter API Payload when you intend to change the default configuration, see below

image-20250320-135136.png
image-20250321-151939.png

Via API Call

The ingestionBaseUrl is different depending on where your Unique instance is hosted.

  • Multitenant (Europe): gateway.unique.app/ingestion-gen2

  • Multitenant (US): gateway.us.unique.app/ingestion

  • Single Tenant: <backendBaseUrl>/ingestion - the backendBaseUrl part depends on the tenant configuration (contact Unique if unknown)

  • Customer Managed Tenant: <backendBaseUrl>/ingestion - the backendBaseUrl part depends on your tenant configuration

curl --location --request POST 'https://<ingestionBaseUrl>/v1/folder/<scopeId>/properties' \ --header 'Authorization: Bearer <yourToken>' \ --header 'Content-Type: application/json' \ --data-raw '{ "properties": { "ingestionConfig": { "pdfReadMode": "CUSTOM_SINGLE_PAGE_API", "customApiOptions": [{ "customisationType": "CUSTOM_SINGLE_PAGE_API", "apiIdentifier": "Unique Text and Image Extraction API", "apiPayload": "{}" }] } }, "applyToSubScopes": true }'

By default, the MDI_VISION extraction method is used, see the details how to change and further configure the extraction method below.

Enable for Upload in Chat

To use the custom PDF page processing in specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:

{ ... "ingestionConfig": { "pdfReadMode": "CUSTOM_SINGLE_PAGE_API", "customApiOptions": [{ "customisationType": "CUSTOM_SINGLE_PAGE_API", "apiIdentifier": "Unique Text and Image Extraction API", "apiPayload": "{}" }] } ... }

By default, the MDI_VISION extraction method is used, see the details how to change and further configure the extraction method below.

Changing the extraction method with the apiPayload

apiPayloads must be provided as a JSON compatible string. The below JSON objects must therefore be converted to strings.

Through the optional apiPayload string parameter, the different extraction methods can be configured. By default, the MDI_VISION extraction method is used. To change the extractionMethod set the payload to the corresponding values:

  • "{ \"extractionMethod\": \"MDI\"}"

  • "{ \"extractionMethod\": \"MDI_VISION\"}"

  • "{ \"extractionMethod\": \"VISION\"}"

The page content optimization step is disabled by default. In order to enable it, adapt the apiPayload as follows:

  • "{\"pageContentOptimizerConfig\": { \"apply\": true }, \"extractionMethod\": \"MDI_VISION\"}"

Complete Custom Configurations

Each extraction method has further configuration options, see below. Make sure to provide the JSON object as a string for the apiPayload:

{ "extractionMethod": "MDI", "languageModel": "AZURE_GPT_4o_2024_0806", "pageContentExtractorMdiConfig": { "useHighResolution": true }, "pageContentOptimizerConfig": { "apply": false, "maxLoops": 2, "scoreThreshold": 0.95, "evaluatorSystemPrompt": "\nYou are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n", "evaluatorUserPrompt": "\nPlease evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n", "generatorSystemPrompt": "\nYou are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n", "generatorUserPrompt": "\nOriginal extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n" } }
{ "extractionMethod": "MDI_VISION", "languageModel": "AZURE_GPT_4o_2024_0806", "pageContentExtractorMdiConfig": { "useHighResolution": true }, "imageContentExtractorConfig": { "imagesInParallel": 3, "classifierSystemPrompt": "You are an image classifier assistant and help to classify the contents of a cropped image of a document page.\n\nYou are given the whole document page as a reference and the cropped image that you should classify.\n", "classifierUserPrompt": "First locate the cropped image within the document page. Only then classify the cropped image into one of the following categories: \n- chart_with_numerical_values: A chart in which numerical text values are present (do not consider the axis values) and can be extracted with high accuracy.\n- chart_without_numerical_values: A chart in which numerical text values are not present (do not consider the axis values) and cannot be extracted with high accuracy.\n- table_structure: A structure that displays data in a tabular format with headers, rows, columns and cells.\n- mixed_content: A combination of different content types, e.g., charts and tables and logos.\n- logo: A logo of a company or brand.\n- icon: A single icon that is a symbol for a tool, product, service, etc, e.g., a tool icon.\n- illustrative_picture: An illustrative picture that only serves to illustrate the text and does not contain any useful, related information.\n\nIn addition to the category, explain your reasoning why you chose the category.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you classify the image. Keep it short but complete.\",\n \"category\": \"Here the category\"\n}\n", "documentReferencePrompt": "Here is the whole document page as a reference:\n", "extractorCategoryToSystemPrompts": { "chart_with_numerical_values": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a chart with numerical values.\n", "chart_without_numerical_values": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a chart without numerical values.\n", "default": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page.\n", "logo": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a logo.\n", "mixed_content": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains mixed content, e.g., diagram and table.\n", "table_structure": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a table like structure.\n" }, "extractorCategoryToUserPrompts": { "chart_with_numerical_values": "Extract the chart data and structure from the image as a html table and explain your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the chart and what does not using the document image as a reference.\n2. Only consider what belongs to the chart and exclude any information that does not belong to it.\n3. Extract a maximum of ten question and answer pairs about the charts content.\n4. Then combine the found answers to a description. Do not include the questions in the description. Only describe what the chart is about or describes, not the technical elements of the chart, e.g., \"the chart has a x-axis and a y-axis\".\n5. Represent the text and numerical values of the chart as a table. Do not approximate values or make assumptions.\n6. When color coded values are present in the chart, represent them in the table as text values.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the html table and the description. Keep it short but complete.\",\n \"image_content\": \"Here the html table and the description of the chart\"\n}\n", "chart_without_numerical_values": "Describe the chart in a meaningful way and describe your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the chart and what does not using the document image as a reference.\n2. Only consider what belongs to the chart and exclude any information that does not belong to it.\n3. Extract a maximum of ten question and answer pairs about the charts content. Do not approximate values or make assumptions.\n4. Then combine the found answers to a description. Do not include the questions in the description. Only describe what the chart is about or describes, not the technical elements of the chart, e.g., \"the chart has a x-axis and a y-axis\".\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the description. Keep it short but complete.\",\n \"image_content\": \"Here the description\"\n}\n", "default": "Extract a maximum of ten text question and answer pairs from the image. Then combine the found answers to a description. Do not include the questions in the description. \n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the description\"\n}\n", "logo": "Output the company or brand name from the image. Output only the name and nothing else. If the company or company name is unknown to you, then output only the text if possible otherwise nothing.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the logo. Keep it short but complete.\",\n \"image_content\": \"Here the company or brand name\"\n}\n", "mixed_content": "First identify all the different elements, e.g., charts or diagrams. Then extract all content for each element in a structured way. Use an html as structure where possible or use markdown. Make sure to preserve the information structure and the original text. Explain your reasoning for extracting the content.\n\nFollow these steps:\n1. Identify all elements in the image, e.g., charts, tables, logos, etc.\n2. Analyze which information belongs together and must be clustered.\n3. Then extract all content for each cluster in a structured way. \n4. Ouput the image content in html where possible, otherwise use markdown.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the extracted content\"\n}\n", "table_structure": "Extract the table like structure from the image as a html table and explain your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the table and what does not using the document image as a reference.\n2. Carefully think about the structure of the table. \n3. Extract the headers (columns/rows) first.\n4. Then assign the cells to the headers and make sure to merge cells whenever they span multiple columns/rows.\n5. Correctly extract the values in the cells and align them with the extracted structure.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the html table. Keep it short but complete.\",\n \"image_content\": \"Here the html table\"\n}\n" }, "noExtractionForCategories": [ "illustrative_picture", "icon" ] }, "pageContentOptimizerConfig": { "apply": false, "maxLoops": 2, "scoreThreshold": 0.95, "evaluatorSystemPrompt": "\nYou are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n", "evaluatorUserPrompt": "\nPlease evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n", "generatorSystemPrompt": "\nYou are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n", "generatorUserPrompt": "\nOriginal extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n" } }
{ "extractionMethod": "VISION", "languageModel": "AZURE_GPT_4o_2024_0806", "pageContentExtractorVisionConfig": { "systemPrompt": "You are a helpful assistant that extracts content from an image.", "userPrompt": "First identify all the different elements, e.g., charts or diagrams. Then extract all content for each element in a structured way. Use an html as structure where possible or use markdown. Make sure to preserve the information structure and the original text. Explain your reasoning for extracting the content.\n\nFollow these steps:\n1. Identify all elements in the image, e.g., charts, tables, logos, etc.\n2. Analyze which information belongs together and must be clustered.\n3. Then extract all content for each cluster in a structured way. \n4. Convert charts into tables when numerical values are present.\n5. Convert color coded values into text.\n6. Extract information from legends and assign it to the corresponding values.\n7. Ouput the image content in html where possible, otherwise use markdown.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the extracted content\"\n}\n" }, "pageContentOptimizerConfig": { "apply": false, "maxLoops": 2, "scoreThreshold": 0.95, "evaluatorSystemPrompt": "You are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n", "evaluatorUserPrompt": "Please evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n", "generatorSystemPrompt": "You are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n", "generatorUserPrompt": "Original extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n" } }

Tips & Tricks

Improving Document Ingestion Quality

Converting a document from PDF to a format that can be read by large language models (LLMs), such as Markdown, is a critical step for ensuring the high quality of outputs from Unique AI. Maintaining the integrity of text structures, especially those within tables and complex structures, is a particularly challenging task that is currently the focus of extensive research.

Layout

  1. Structured Data: Text should have clear headings, subheadings, bullet points or numbers. Clear and distinct paragraphs focused on a single topic with consistent spacing between lines and paragraphs.

  2. Footnotes: The link between footnote and usage in paragraph can get lost. Include information in paragraphs if possible.

  3. Links: Avoid links to other paragraphs (e.g. 'as discussed in section 3') as this connection might not be understood

  4. Multi Columns: Avoid multi-column layouts if possible.

  5. Remove Noise: Footnotes, page numbers, headers, and footers should be eliminated if they do not contribute to the content’s meaning.

Text content

  1. Language and Style: The document should be written in clear, concise sentences, provide enough context for understanding.

  2. Metadata: The document should include a descriptive title.

  3. Contextual Clues: Context should be provided where necessary, especially when introducing acronyms, technical terms, or jargon.

Tables

  1. Headings: Tables should be simply formatted with clear headings for rows and columns for easy parsing by the model.

  2. Borders: Tables should have full borders.

  3. Multipage tables: Avoid multi-page tables. If a table goes across multiple pages, the headings have to repeated on each page. Content of one single cell must be within one page (no breaking on multiple pages)

Ingestion Error Codes

Please always check the ingestion state in the knowledge center to see if the file ingestion has been successful or not.

Display Name

Error State

Description

Display Name

Error State

Description

Ingestion failed (General Error)

FAILED

Generic failure state when the ingestion process encounters an unspecified error

Ingestion failed (Images are not supported)

FAILED_IMAGE

File ingestion failed because the uploaded file is an image format, which is not supported for text processing

Ingestion failed (While creating chunks)

FAILED_CREATING_CHUNKS

Error occurred during the chunking phase where the document content is split into smaller segments

Ingestion failed (While creating embedding)

FAILED_EMBEDDING

Failure during the embedding generation process where text is converted to vector representations

Ingestion failed (While fetching the file)

FAILED_GETTING_FILE

Error retrieving the file from its source location during the ingestion process

Ingestion failed (While parsing the text from the document)

FAILED_PARSING

Document parsing failed, unable to extract readable text content from the file

Ingestion failed (General error or time limit exceeded)

FAILED_REDELIVERED

Ingestion failed after retry attempts, either due to persistent errors or timeout

Ingestion failed (Could not parse a lot of text. Document might have no meaning)

FAILED_TOO_LESS_CONTENT

Document contains insufficient meaningful text content for successful ingestion

Rejected by malware scanner

FAILED_MALWARE_FOUND

File was rejected during security scanning due to detected malware or suspicious content

Metadata validation failed

FAILED_METADATA_VALIDATION

File metadata does not meet validation requirements or contains invalid data, e.g., it did not contain a required sensitivity label.


API

Documentation for the API can be found in the developer section: https://unique-ch.atlassian.net/wiki/spaces/PUBDOC/pages/1384776553


Limitations

  • Only supported file types can be ingested.

  • Take into account price variations across different ingestion methods.

    • MDI

    • MDI + Vision

  • Be mindful of region availability for each ingestion method.

    • MDI

  • Consider rate limits associated with various ingestion methods, which may result in delays or errors

 


Author

@Martin Fadler

 

© 2026 Unique AG. All rights reserved. Privacy PolicyTerms of Service