Summary
This feature enables customers of Unique FinanceGPT to use a custom API for doing specific stages during the Unique FGPT ingestion. Means Unique will based on the ingestion configuration of the content send at a specific stage a synchronous API call to let customers run custom logic for this stage.
General Setup
In a first version the custom APIs need to be provided to Unique so Unique can add the configuration to its workload which is handling the ingestion. This API configuration contains out of an identifier, an URL and an API key (optional). E.g.
{ "identifier": "Custom PDF Page Ingestor", "url": "https://customUrl.com/pdfPageIngestor", "apiKey": "myAPIKey" }
This API configuration setup is needed for all types of custom API calls during the Unique ingestion process.
Custom PDF Page Processing
Purpose
By setting up a custom PDF page processing API Unique will start the normal document ingestion process but as soon as the stage of processing the PDF page per page is reached Unique is not using its internal solution for parsing markdown text out of this PDF page. It will call an API to complete this stage.
Means for every PDF page Unique calls the API with the base64 data of this PDF page and expects a markdown text in return. After all pages has been processed via the API call Unique will continue with the standard process for chunking, storing, embedding, etc.
Ingestion Config
To use this custom PDF page processing the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:
There is a breaking change planned for Release 2024.50 - customApiIdentifier
and customApiPayload
are getting adjusted. Reason is that we support multiple endpoints for different stages during the ingestion.
PDF page processing → Endpoint A
Custom chunking → Endpoint B
Because of the same naming of the payload and identifier this is not possible.
curl --location --request POST 'http://https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \ --header 'Authorization: Bearer <yourToken>' \ --header 'Content-Type: application/json' \ --data-raw '{ "properties": { "ingestionConfig": { "pdfReadMode": "CUSTOM_SINGLE_PAGE_API", "customApiIdentifier": "Custom PDF Page Ingestor", "customApiPayload": {<Optional - anyString>} } }, "applyToSubScopes": true }'
API requirements
Unique will send a POST request for each PDF page to the specified API configuration (URL and ApiKey). The body contains the following structure:
{ "data": "<Base64EncodedPdfPage>", "ingestionConfiguration": {<ingestionConfig>}, "companyId": "<companyId>", "chatId": "<chatId or null>", "pageNumber": <starting 1 -> numberOfPages> }
Expected in return of the API is a JSON response in the following format. The extractedText should be the markdown string parsed/describing the sent PDF page.
{ "extractedText": "Extracted text from this PDF page in markdown format. This is getting joined with all other pages and processed further." }
Custom Chunking
Purpose
The Unique ingestion process allows customers to do a custom chunking mechanism. Before running the stage of chunking the whole markdown text into pieces Unique checks the configuration of this content. When a custom chunking configuration is set Unique will call a custom API with the whole text of the document and expects an array of ordered chunks in return. Unique will then create embeddings out of those chunks and store them into the database.
Ingestion Config
To configure this custom chunking the ingestion config of the content needs to be adjusted. This is a similar workflow as using the Microsoft Document Intelligence. This ingestion config can be set either on scope level or on content directly. This is an example curl for that:
There is a breaking change planned for Release 2024.50 - customApiIdentifier
and customApiPayload
are getting adjusted. Reason is that we support multiple endpoints for different stages during the ingestion.
PDF page processing → Endpoint A
Custom chunking → Endpoint B
Because of the same naming of the payload and identifier this is not possible.
curl --location --request POST 'http://https://gateway.<baseUrl>/ingestion/v1/folder/<scopeId>/properties' \ --header 'Authorization: Bearer <yourToken>' \ --header 'Content-Type: application/json' \ --data-raw '{ "properties": { "ingestionConfig": { "chunkStrategy": "CUSTOM_CHUNKING_API", "customApiIdentifier": "Custom PDF Page Ingestor", "customApiPayload": {<Optional - anyString>} } }, "applyToSubScopes": true }'
API requirements
Unique will send a POST request once the whole text of the document is ready to be chunked to the API configuration (URL and ApiKey). The body contains the following structure:
{ "text": "This is my plain text parsed from the document. It will be sent as whole text string.", "ingestionConfiguration": {<ingestionConfig>} }
Expected in return of the API is a JSON response in the following format. The chunks
should be a string array of text chunks based on the sent text.
{ "chunks": ["This is my plain text parsed from the document.", "It will be sent as whole text string."] }
Author |
---|