3rd party APIs for customisation of ingestion

Summary

This feature enables customers of Unique FinanceGPT to use a custom API for doing specific stages during the Unique FGPT ingestion. Means Unique will based on the ingestion configuration of the content send at a specific stage a synchronous API call to let customers run custom logic for this stage.

General Setup

In a first version the custom APIs need to be provided to Unique so Unique can add the configuration to its workload which is handling the ingestion. This API configuration contains out of an identifier, an URL and an API key (optional). E.g.

{ "identifier": "Custom_Ingestor", "url": "https://customUrl.com/pdfPageIngestor", "apiKey": "myAPIKey" }

This API configuration setup is needed for all types of custom API calls during the Unique ingestion process.

Custom PDF Page Processing

Purpose

By setting up a custom PDF page processing API Unique will start the normal document ingestion process but as soon as the stage of processing the PDF page per page is reached Unique is not using its internal solution for parsing markdown text out of this PDF page. It will call an API to complete this stage.

Means for every PDF page Unique calls the API with the base64 data of this PDF page and expects a markdown text in return. After all pages has been processed via the API call Unique will continue with the standard process for chunking, storing, embedding, etc.

Ingestion Config

To use this custom PDF page processing the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:

curl --location --request POST 'https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \ --header 'Authorization: Bearer <yourToken>' \ --header 'Content-Type: application/json' \ --data-raw '{ "properties": { "ingestionConfig": { "pdfReadMode": "CUSTOM_SINGLE_PAGE_API", "customApiOptions": [{ "customisationType": "CUSTOM_SINGLE_PAGE_API", "apiIdentifier": "Custom_Ingestor", "apiPayload": "{'stringified': 'JSON object or just a string'}" }] } }, "applyToSubScopes": true }'

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

Unique will send a POST request for each PDF page to the specified API configuration (URL and ApiKey). The body contains the following structure:

{ "data": "<Base64EncodedPdfPage>", "ingestionConfiguration": {<ingestionConfig>}, "companyId": "<companyId>", "chatId": "<chatId or null>", "pageNumber": <starting 1 -> numberOfPages> }

Expected in return of the API is a JSON response in the following format. The extractedText should be the markdown string parsed/describing the sent PDF page.

Custom Chunking

Purpose

The Unique ingestion process allows customers to do a custom chunking mechanism. Before running the stage of chunking the whole markdown text into pieces Unique checks the configuration of this content. When a custom chunking configuration is set Unique will call a custom API with the whole text of the document and expects an array of ordered chunks in return. Unique will then create embeddings out of those chunks and store them into the database.

Ingestion Config

To configure this custom chunking the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:

Attention! Make sure you do not override some previous customised ingestionConfig. In case of doubt fetch/inspect first the current properties of the scope.

API requirements

Unique will send a POST request once the whole text of the document is ready to be chunked to the API configuration (URL and ApiKey). The body contains the following structure:

Expected in return of the API is a JSON response in the following format. The chunks should be a string array of text chunks based on the sent text.

 

 

Author

@Adrian Gugger

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service