Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
The ingestion service powers Unique's knowledge base by converting uploaded files into vector embeddings for use by the platform's LLMs. It supports various file types, including PDF, Word, Excel, and others with flexible ingestion methods to meet diverse needs. Administrators can use different ingestion methods and customize ingestion settings and chunking strategies for each knowledge base folder, optimizing processing for specific document sets.
Who is it for
This documentation focuses specifically on administrators who need to manage and configure advanced ingestion capabilities for their organizations. Administrators who require granular control over document processing, cost management, and optimization of search results will benefit most from understanding the comprehensive ingestion options, customization features available within the platform.
Permission Requirements: Users must have the Can manage permission for a folder to access its ingestions configuration. Only users with the role knowledge.write can be permissioned the can manage role. See more details here: https://unique-ch.atlassian.net/wiki/spaces/PUBDOC/pages/1414135918
Benefits and Use Cases
The ingestion service processes diverse company documents through a unified platform, offering flexible customization to optimize results. Folder-level settings allow organizations to tailor processing methods to specific document types, improving search accuracy and relevance.
By combining the default service, specialized MDI tools, and emerging vision-enhanced features, the platform delivers scalable, high-quality solutions that adapt to evolving document management needs.
The Unique platform supports multiple services for ingesting PDF documents:
Unique Ingestion
Docling
Microsoft Document Intelligence (MDI in the following)
Each service can parse structured PDFs with a single-column layout and extract simple tables. However, their capabilities vary when handling more complex documents:
Image-based PDFs: Scanned or printed PDFs lack structured content, requiring OCR techniques for extraction.
Multi-Column Layout: PDFs with multiple columns, charts, tables, and text need pre-trained layout detection models to identify page elements and preserve logical content flow.
Complex Tables Detection: Extracting tables with merged cells, missing borders, or checkmarks requires specialized AI models to recognize different table components.
Image Content Extraction: Many PDFs contain unstructured visual elements like charts, logos, or photos. AI models with image-to-text capabilities are needed to extract this content in a searchable form.
On-Prem Deployment: The service can operate in a closed environment without internet access
Check here for Azure AI Document Intelligencee and
20-30s per page
- fully supported - partially supported - not supported
On-Prem Customers: Use Docling for PDF ingestion, as the Default Unique Ingestion Service lacks efficient support for multi-column layouts.
Cloud Customers: Use MDI as the default ingestion service, as it provides higher accuracy than Docling, particularly for tables without grid lines.
Please also consider price differences for the different ingestion methods (e.g., 0.024$ per page for MDI)
Step-by-Step Guide
Step 1: Open the Ingestion Configuration
Navigate to the folder in the knowledge base that you want to review or set up for ingestion. Then, click the ‘Configure Files Ingestion’ button located on the right.
This button will only be visible if you have the Can manage permission for the folder.
Step 2: Change the Ingestion Configuration
PDF
PDFs on Unique are ingested page by page.
There are 2 modes of ingesting documents as it is implemented:
PDFTODOCX_ONLY: Use our default library: PDFs are converted using pdf2docx (default)
DOC_INTELLIGENCE_DEFAULT: Use MDI on all pages of the document
Word
The default process directly extracts the content of a Word file, including text and tables with their underlying formatting. However, it does not extract content from images (e.g., if a table is embedded as an image in the Word file). There is an option to use the MDI service for Word files, which can also extract text from images. This process first converts the Word file to a PDF to utilize the full capabilities of the MDI service:
WORD_DEFAULT_INGESTION: Use the default Word ingestion mechanism (without MDI)
INGEST_WORD_AS_PDF: Convert the Word document to PDF and use the PDF ingestion service on the Word document
Enable MDI
The default pipeline currently in place may not adequately process certain PDF and Word documents, particularly when encountering improperly formatted data (e.g., tables in financial documents, images with text).
Microsoft Document Intelligence can be activated on a per-scope or per-folder basis, including hierarchical scopes with inheritance. For single-tenant setups, ensure that the service is fully provisioned before enabling it.
On the ingestion service per scopeId. Replace the placeholders:
To use the MDI processing in a specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:
Charts and table-like images are transformed into a table and a searchable description is added
Logos are translated to the brand name / text
For other images a searchable description is added
Further optimizes extracted page content (optional)
The service can run the extraction with three methods:
MDI: Uses MDI to extract page content and optionally performs an optimization with the Vision model.
MDI + Vision: Uses MDI to extract page content and a Vision model to extract the content from each detected image in parallel.
Vision: Uses only the Vision model to extract page content.
Each extraction methods can apply an additional Page Content Optimizer step that will evaluate the extracted page content and further improve it using a Vision model.
Agentic Document Ingestion Overview
To use this custom PDF page processing for a specific Scope or Content in the Knowledge Base, the ingestion config of the content needs to be adjusted.
Enable via Ingestion Config UI
Click “Configure File Ingestion” for the scope of interest
Select “Custom Single Page API” for PDF ingestion
Enter “Unique Text and Image Extraction API” in API Identifier
Enter API Payload when you intend to change the default configuration, see below
Via API Call
The ingestionBaseUrl is different depending on where your Unique instance is hosted.
By default, the MDI_VISION extraction method is used, see the details how to change and further configure the extraction method below.
Enable for Upload in Chat
To use the custom PDF page processing in specific space when uploading a document to the chat, the ingestion config in the Advanced Settings in the space management must be changed as follows:
By default, the MDI_VISION extraction method is used, see the details how to change and further configure the extraction method below.
Changing the extraction method with the apiPayload
apiPayloads must be provided as a JSON compatible string. The below JSON objects must therefore be converted to strings.
Through the optional apiPayload string parameter, the different extraction methods can be configured. By default, the MDI_VISION extraction method is used. To change the extractionMethod set the payload to the corresponding values:
"{ \"extractionMethod\": \"MDI\"}"
"{ \"extractionMethod\": \"MDI_VISION\"}"
"{ \"extractionMethod\": \"VISION\"}"
The page content optimization step is disabled by default. In order to enable it, adapt the apiPayload as follows:
Each extraction method has further configuration options, see below. Make sure to provide the JSON object as a string for the apiPayload:
{
"extractionMethod": "MDI",
"languageModel": "AZURE_GPT_4o_2024_0806",
"pageContentExtractorMdiConfig": {
"useHighResolution": true
},
"pageContentOptimizerConfig": {
"apply": false,
"maxLoops": 2,
"scoreThreshold": 0.95,
"evaluatorSystemPrompt": "\nYou are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n",
"evaluatorUserPrompt": "\nPlease evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n",
"generatorSystemPrompt": "\nYou are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n",
"generatorUserPrompt": "\nOriginal extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n"
}
}
{
"extractionMethod": "MDI_VISION",
"languageModel": "AZURE_GPT_4o_2024_0806",
"pageContentExtractorMdiConfig": {
"useHighResolution": true
},
"imageContentExtractorConfig": {
"imagesInParallel": 3,
"classifierSystemPrompt": "You are an image classifier assistant and help to classify the contents of a cropped image of a document page.\n\nYou are given the whole document page as a reference and the cropped image that you should classify.\n",
"classifierUserPrompt": "First locate the cropped image within the document page. Only then classify the cropped image into one of the following categories: \n- chart_with_numerical_values: A chart in which numerical text values are present (do not consider the axis values) and can be extracted with high accuracy.\n- chart_without_numerical_values: A chart in which numerical text values are not present (do not consider the axis values) and cannot be extracted with high accuracy.\n- table_structure: A structure that displays data in a tabular format with headers, rows, columns and cells.\n- mixed_content: A combination of different content types, e.g., charts and tables and logos.\n- logo: A logo of a company or brand.\n- icon: A single icon that is a symbol for a tool, product, service, etc, e.g., a tool icon.\n- illustrative_picture: An illustrative picture that only serves to illustrate the text and does not contain any useful, related information.\n\nIn addition to the category, explain your reasoning why you chose the category.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you classify the image. Keep it short but complete.\",\n \"category\": \"Here the category\"\n}\n",
"documentReferencePrompt": "Here is the whole document page as a reference:\n",
"extractorCategoryToSystemPrompts": {
"chart_with_numerical_values": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a chart with numerical values.\n",
"chart_without_numerical_values": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a chart without numerical values.\n",
"default": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page.\n",
"logo": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a logo.\n",
"mixed_content": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains mixed content, e.g., diagram and table.\n",
"table_structure": "You are an image content extractor and help to extract information in a structured form from a cropped image of a document page. The cropped image contains a table like structure.\n"
},
"extractorCategoryToUserPrompts": {
"chart_with_numerical_values": "Extract the chart data and structure from the image as a html table and explain your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the chart and what does not using the document image as a reference.\n2. Only consider what belongs to the chart and exclude any information that does not belong to it.\n3. Extract a maximum of ten question and answer pairs about the charts content.\n4. Then combine the found answers to a description. Do not include the questions in the description. Only describe what the chart is about or describes, not the technical elements of the chart, e.g., \"the chart has a x-axis and a y-axis\".\n5. Represent the text and numerical values of the chart as a table. Do not approximate values or make assumptions.\n6. When color coded values are present in the chart, represent them in the table as text values.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the html table and the description. Keep it short but complete.\",\n \"image_content\": \"Here the html table and the description of the chart\"\n}\n",
"chart_without_numerical_values": "Describe the chart in a meaningful way and describe your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the chart and what does not using the document image as a reference.\n2. Only consider what belongs to the chart and exclude any information that does not belong to it.\n3. Extract a maximum of ten question and answer pairs about the charts content. Do not approximate values or make assumptions.\n4. Then combine the found answers to a description. Do not include the questions in the description. Only describe what the chart is about or describes, not the technical elements of the chart, e.g., \"the chart has a x-axis and a y-axis\".\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the description. Keep it short but complete.\",\n \"image_content\": \"Here the description\"\n}\n",
"default": "Extract a maximum of ten text question and answer pairs from the image. Then combine the found answers to a description. Do not include the questions in the description. \n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the description\"\n}\n",
"logo": "Output the company or brand name from the image. Output only the name and nothing else. If the company or company name is unknown to you, then output only the text if possible otherwise nothing.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the logo. Keep it short but complete.\",\n \"image_content\": \"Here the company or brand name\"\n}\n",
"mixed_content": "First identify all the different elements, e.g., charts or diagrams. Then extract all content for each element in a structured way. Use an html as structure where possible or use markdown. Make sure to preserve the information structure and the original text. Explain your reasoning for extracting the content.\n\nFollow these steps:\n1. Identify all elements in the image, e.g., charts, tables, logos, etc.\n2. Analyze which information belongs together and must be clustered.\n3. Then extract all content for each cluster in a structured way. \n4. Ouput the image content in html where possible, otherwise use markdown.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the extracted content\"\n}\n",
"table_structure": "Extract the table like structure from the image as a html table and explain your reasoning.\n\nFollow these steps:\n1. Clearly separate what belongs to the table and what does not using the document image as a reference.\n2. Carefully think about the structure of the table. \n3. Extract the headers (columns/rows) first.\n4. Then assign the cells to the headers and make sure to merge cells whenever they span multiple columns/rows.\n5. Correctly extract the values in the cells and align them with the extracted structure.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you create the html table. Keep it short but complete.\",\n \"image_content\": \"Here the html table\"\n}\n"
},
"noExtractionForCategories": [
"illustrative_picture",
"icon"
]
},
"pageContentOptimizerConfig": {
"apply": false,
"maxLoops": 2,
"scoreThreshold": 0.95,
"evaluatorSystemPrompt": "\nYou are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n",
"evaluatorUserPrompt": "\nPlease evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n",
"generatorSystemPrompt": "\nYou are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n",
"generatorUserPrompt": "\nOriginal extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n"
}
}
{
"extractionMethod": "VISION",
"languageModel": "AZURE_GPT_4o_2024_0806",
"pageContentExtractorVisionConfig": {
"systemPrompt": "You are a helpful assistant that extracts content from an image.",
"userPrompt": "First identify all the different elements, e.g., charts or diagrams. Then extract all content for each element in a structured way. Use an html as structure where possible or use markdown. Make sure to preserve the information structure and the original text. Explain your reasoning for extracting the content.\n\nFollow these steps:\n1. Identify all elements in the image, e.g., charts, tables, logos, etc.\n2. Analyze which information belongs together and must be clustered.\n3. Then extract all content for each cluster in a structured way. \n4. Convert charts into tables when numerical values are present.\n5. Convert color coded values into text.\n6. Extract information from legends and assign it to the corresponding values.\n7. Ouput the image content in html where possible, otherwise use markdown.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you extract the content. Keep it short but complete.\",\n \"image_content\": \"Here the extracted content\"\n}\n"
},
"pageContentOptimizerConfig": {
"apply": false,
"maxLoops": 2,
"scoreThreshold": 0.95,
"evaluatorSystemPrompt": "You are a helpful assistant that evaluates the quality of extracted content based on\na document image and the extracted content.\n",
"evaluatorUserPrompt": "Please evaluate the quality of the extracted information using the document image.\n\nExtracted information: ${current_response}\n\nYour tasks: \n1. Give instructions on how to improve the extracted information. Be as specific as possible.\n2. Assess whether the extracted information meets the following evaluation criteria:\n - Information has been completely extracted from the image\n - Information is structured logically and coherently as in the image\n - Information is accurate as represented in the image\n - Numerical values are correct and have a unit of measurement (e.g., 30% CAGR instead of 30%)\n - Charts have been converted into tables when numerical values have been extracted\n - No numerical values have been approximated or rounded or interpolated\n - No values have been added that are not represented in the image\n - Color coded values have been converted into text\n - Information from legends have been correctly assigned to the corresponding values\n3. Give a score between 0 and 1 for the quality of the extracted information (0 is bad, 1 is perfect).\n\nExample output:\n{\n \"improvement_instructions\": \"Here your specific instructions on how to improve the extracted information. Only outline the changes to be made, do not include any other text.\",\n \"meets_criteria\": false, # Assessment of the criteria listed above, only return true if all relevant criteria are met\n \"score\": 0.5 # Here the score between 0 and 1\n}\n",
"generatorSystemPrompt": "You are a helpful assistant that improves content extracted from an image based on feedback\nand the original image.\n",
"generatorUserPrompt": "Original extracted content: ${current_response}\nFeedback for improving the extracted content: ${feedback}\n\nAddress all the feedback and improve the extracted content.\nAlso explain how you addressed the feedback.\n\nExample output:\n{\n \"reasoning\": \"Explain your decisions and reasoning on how you addressed the feedback\",\n \"improved_content\": \"Here the improved extracted content\"\n}\n"
}
}
Tips & Tricks
Improving Document Ingestion Quality
Converting a document from PDF to a format that can be read by large language models (LLMs), such as Markdown, is a critical step for ensuring the high quality of outputs from Unique AI. Maintaining the integrity of text structures, especially those within tables and complex structures, is a particularly challenging task that is currently the focus of extensive research.
Layout
Structured Data: Text should have clear headings, subheadings, bullet points or numbers. Clear and distinct paragraphs focused on a single topic with consistent spacing between lines and paragraphs.
Footnotes: The link between footnote and usage in paragraph can get lost. Include information in paragraphs if possible.
Links: Avoid links to other paragraphs (e.g. 'as discussed in section 3') as this connection might not be understood
Multi Columns: Avoid multi-column layouts if possible.
Remove Noise: Footnotes, page numbers, headers, and footers should be eliminated if they do not contribute to the content’s meaning.
Text content
Language and Style: The document should be written in clear, concise sentences, provide enough context for understanding.
Metadata: The document should include a descriptive title.
Contextual Clues: Context should be provided where necessary, especially when introducing acronyms, technical terms, or jargon.
Tables
Headings: Tables should be simply formatted with clear headings for rows and columns for easy parsing by the model.
Borders: Tables should have full borders.
Multipage tables: Avoid multi-page tables. If a table goes across multiple pages, the headings have to repeated on each page. Content of one single cell must be within one page (no breaking on multiple pages)
Ingestion Error Codes
Please always check the ingestion state in the knowledge center to see if the file ingestion has been successful or not.
Display Name
Error State
Description
Display Name
Error State
Description
Ingestion failed (General Error)
FAILED
Generic failure state when the ingestion process encounters an unspecified error
Ingestion failed (Images are not supported)
FAILED_IMAGE
File ingestion failed because the uploaded file is an image format, which is not supported for text processing
Ingestion failed (While creating chunks)
FAILED_CREATING_CHUNKS
Error occurred during the chunking phase where the document content is split into smaller segments
Ingestion failed (While creating embedding)
FAILED_EMBEDDING
Failure during the embedding generation process where text is converted to vector representations
Ingestion failed (While fetching the file)
FAILED_GETTING_FILE
Error retrieving the file from its source location during the ingestion process
Ingestion failed (While parsing the text from the document)
FAILED_PARSING
Document parsing failed, unable to extract readable text content from the file
Ingestion failed (General error or time limit exceeded)
FAILED_REDELIVERED
Ingestion failed after retry attempts, either due to persistent errors or timeout
Ingestion failed (Could not parse a lot of text. Document might have no meaning)
FAILED_TOO_LESS_CONTENT
Document contains insufficient meaningful text content for successful ingestion
Rejected by malware scanner
FAILED_MALWARE_FOUND
File was rejected during security scanning due to detected malware or suspicious content
Metadata validation failed
FAILED_METADATA_VALIDATION
File metadata does not meet validation requirements or contains invalid data, e.g., it did not contain a required sensitivity label.