1 Summary
2 General Setup
3 Custom PDF Page Processing
- 3.1 Purpose
- 3.2 Ingestion Config
- 3.3 API requirements
4 Custom Chunking
- 4.1 Purpose
- 4.2 Ingestion Config
- 4.3 API requirements

Summary

This feature enables customers of Unique AI to use a custom API for doing specific stages during the Unique FGPT ingestion. Means Unique will based on the ingestion configuration of the content send at a specific stage a synchronous API call to let customers run custom logic for this stage.

General Setup

In a first version the custom APIs need to be provided to Unique so Unique can add the configuration to its workload which is handling the ingestion. This API configuration contains out of an identifier, an URL and an API key (optional). E.g.

{
  "identifier": "Custom_Ingestor",
  "url": "https://customUrl.com/pdfPageIngestor",
  "apiKey": "myAPIKey"
}

This API configuration setup is needed for all types of custom API calls during the Unique ingestion process.

Custom PDF Page Processing

Purpose

By setting up a custom PDF page processing API Unique will start the normal document ingestion process but as soon as the stage of processing the PDF page per page is reached Unique is not using its internal solution for parsing markdown text out of this PDF page. It will call an API to complete this stage.

Means for every PDF page Unique calls the API with the base64 data of this PDF page and expects a markdown text in return. After all pages has been processed via the API call Unique will continue with the standard process for chunking, storing, embedding, etc.

Ingestion Config

To use this custom PDF page processing the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "properties": {
        "ingestionConfig": {
            "pdfReadMode": "CUSTOM_SINGLE_PAGE_API",
            "customApiOptions": [{
                "customisationType": "CUSTOM_SINGLE_PAGE_API",
                "apiIdentifier": "Custom_Ingestor",
                "apiPayload": "{'stringified': 'JSON object or just a string'}"
            }]
        }
    },
    "applyToSubScopes": true
}'

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

Unique will send a POST request for each PDF page to the specified API configuration (URL and ApiKey). The body contains the following structure:

{
  "data": "<Base64EncodedPdfPage>",
  "ingestionConfiguration": {<ingestionConfig>},
  "companyId": "<companyId>",
  "chatId": "<chatId or null>",
  "pageNumber": <starting 1 -> numberOfPages>
}

Expected in return of the API is a JSON response in the following format. The extractedText should be the markdown string parsed/describing the sent PDF page.

{
  "extractedText": "Extracted text from this PDF page in markdown format. This is getting joined with all other pages and processed further."
}

Custom Chunking

Purpose

The Unique ingestion process allows customers to do a custom chunking mechanism. Before running the stage of chunking the whole markdown text into pieces Unique checks the configuration of this content. When a custom chunking configuration is set Unique will call a custom API with the whole text of the document and expects an array of ordered chunks in return. Unique will then create embeddings out of those chunks and store them into the database.

Ingestion Config

To configure this custom chunking the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \
--header 'Authorization: Bearer <yourToken>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "properties": {
        "ingestionConfig": {
            "chunkStrategy": "CUSTOM_CHUNKING_API",
            "customApiOptions": [{
                "customsationType": "CUSTOM_CHUNKING_API",
                "apiIdentifier": "Custom_Ingestor",
                "apiPayload": "{'stringified': 'JSON object or just a string'}"
            }]
        }
    },
    "applyToSubScopes": true
}'

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

Unique will send a POST request once the whole text of the document is ready to be chunked to the API configuration (URL and ApiKey). The body contains the following structure:

{
  "text": "This is my plain text parsed from the document. It will be sent as whole text string.",
  "ingestionConfiguration": {<ingestionConfig>}
}

Expected in return of the API is a JSON response in the following format. The chunks should be a string array of text chunks based on the sent text.

{
  "chunks": ["This is my plain text parsed from the document.", "It will be sent as whole text string."]
}

Author	@Adrian Gugger

3rd party APIs for customisation of ingestion

Summary

General Setup

Custom PDF Page Processing

Purpose

Ingestion Config

API requirements

Custom Chunking

Purpose

Ingestion Config

API requirements