Agentic Metadata-Extraction

Agentic Metadata-Extraction

Overview

Traditional document metadata relies on file properties (author, creation date) or manual tagging, which often fails to capture the semantic richness of document content. In the Financial Services Industry, where documents contain critical information like client names, account numbers, document types, and regulatory classifications, automated metadata extraction is essential for effective document management and retrieval.

AI-Powered Metadata Extraction addresses this challenge by using Large Language Models (LLMs) to intelligently analyze document content and extract structured metadata based on user-defined schemas. This approach ensures that documents are automatically tagged with meaningful, searchable metadata that enhances discoverability and enables sophisticated document workflows.

When you enable Metadata Extraction in your document processing workflow, our platform:

  • Analyzes Content: Reads the ingested document chunks to understand the document's content and context

  • Applies Schema: Uses your defined metadata schema to extract specific fields (e.g., author name, document type, dates)

  • Updates Metadata: Automatically populates the document's metadata fields for search and filtering

This intelligent approach ensures your knowledge base is enriched with accurate, consistent metadata that improves document discoverability and enables automated workflows.


Who It's For

  • Admins who configure document processing workflows to ensure consistent metadata across document collections

  • Knowledge Managers who need to organize large document repositories with structured, searchable metadata

  • Compliance Teams who require specific document classifications and attributes for regulatory purposes

  • Users who benefit from improved search and filtering capabilities in their document workflows


How It Works

Backbone Components

  • Content Client: Retrieves document chunks for analysis using the Unique SDK

  • Metadata Extractor Handler: Orchestrates the extraction process and manages token limits

  • LLM Integration: Connects to Azure OpenAI for structured output generation

  • Schema Validator: Ensures extracted metadata conforms to the defined schema types


Example Use Cases

Document Classification

Policy Documents: "Classify documents by policy type and effective date"

  • Extracts document type (policy, procedure, guideline)

  • Identifies effective dates and review periods

  • Tags with relevant regulatory frameworks

Client Documents: "Extract client information from onboarding documents"

  • Identifies client names and account numbers

  • Extracts document submission dates

  • Tags with document category (KYC, AML, etc.)


Step-by-Step Guide

1. Enable Metadata Extraction

  1. Click Configure File Ingestion in a folder of choice

  2. Locate the AI Metadata Extraction section in the configuration panel

  3. Toggle the feature ON to enable metadata extraction

image-20260204-133043.png
image-20260410-104840.png

 

2. Configure Language Model

In the metadata extraction configuration:

  1. Select your preferred Language Model from the available options:

    • GPT-4o (2024-08-06) - Recommended for complex schemas

    • GPT-4o (2024-11-20) - Latest model with improved accuracy

  2. Set the Max Input Tokens (1000-10000):

    • Lower values process faster but may miss content at the end of documents

    • Higher values capture more context but increase processing time

    • Recommended: Start with 5000 tokens and adjust based on your documents

3. Define Metadata Schema

Create a JSON schema defining the metadata fields to extract. Each field requires:

  • type: Data type ("string", "number", "boolean")

  • description: Clear description to guide the LLM extraction

  • required: Whether the field must be extracted (true or false)

  • enum (optional): Restricts the extracted value to a fixed set of allowed strings, e.g. ["invoice", "contract", "other"]

Example Schema:

{ "author_name": { "type": "string", "description": "Name of the document author or creator", "required": true }, "publication_date": { "type": "string", "description": "Publication or creation date in YYYY-MM-DD format", "required": true }, "document_type": { "type": "string", "description": "Type of document (e.g., report, memo, policy, analysis)", "required": true, "enum": ["invoice", "contract", "report", "other"] } }

Note: Metadata keys must not use any of the following reserved system keys:

  • url

  • key

  • mimeType

  • externalFileOwner

  • folderIdPath

  • companyId

  • contentId

  • folderId

  • title

If any of these keys are used, metadata for the corresponding field will not be generated.

4. Upload Documents

Upload your documents through the standard Unique AI interface. The system will automatically:

  1. Process the document through standard ingestion (reading, chunking, embedding)

  2. Trigger metadata extraction upon ingestion completion

  3. Analyze document content using the configured LLM

  4. Extract metadata according to your schema

  5. Update the document's metadata fields

5. Verify Results

Review the extracted metadata on your documents:

  1. Navigate to the folder containing your uploaded documents

  2. Select a document to view its details

  3. Check the Metadata section for extracted fields

  4. Verify accuracy and completeness of extracted values

6. Re-run Extraction on Existing Files

Enabling metadata extraction does not automatically process files already in the folder. To backfill existing documents:

  1. Check "Extract metadata on existing files in this folder"

  2. Optionally check "Also extract on files in subfolders" to include the entire folder tree

  3. Click Save

Only fully ingested documents are targeted. New uploads are processed automatically and do not require this step.


Configuration Options

Language Models

Language Models may differ across companies and environments.

Requirements: All models must be Azure OpenAI deployments with Structured Output support (e.g., GPT-4o series). This allows companies to adopt newer models as they become available by simply updating the configuration, without requiring code changes.

Current available models:

Model

Best For

Token Limit

Accuracy

Model

Best For

Token Limit

Accuracy

GPT-4o (2024-08-06)

Complex schemas, nuanced extraction

128K

High

GPT-4o (2024-11-20)

General use, latest improvements

128K

Highest

Schema Field Types

Type

Description

Example Values

Type

Description

Example Values

string

Text values

"John Smith", "2024-01-15"

number

Numeric values

42, 3.14, 1000000

boolean

True/false values

true, false

Enum Constraints & Field Validation

After the LLM responds, every extracted field is validated against your schema before being saved. Fields that fail are dropped (not saved), never causing an error on the document.

Data type validation — each field's value is checked against its declared type:

If the LLM returns...

Outcome

If the LLM returns...

Outcome

Correct type

Field saved

Wrong type (e.g. a string for a number field)

Field dropped with a warning

null on an optional field

Field skipped silently

A required field missing entirely

Field dropped with a warning

Enum constraints — add an optional enum property to restrict the LLM to a fixed set of allowed values:

{ "document_type": { "type": "string", "description": "Type of document", "required": true, "enum": ["invoice", "contract", "report", "other"] } }

Values outside the enum are dropped after extraction. For array fields, only the items within the allowed set are kept; if none remain, the field is dropped entirely.

Token Settings

Setting

Description

Default

Recommended Range

Setting

Description

Default

Recommended Range

maxInputTokens

Maximum document length to process

10000

1000-10000