Agentic Metadata-Extraction
Overview
Traditional document metadata relies on file properties (author, creation date) or manual tagging, which often fails to capture the semantic richness of document content. In the Financial Services Industry, where documents contain critical information like client names, account numbers, document types, and regulatory classifications, automated metadata extraction is essential for effective document management and retrieval.
AI-Powered Metadata Extraction addresses this challenge by using Large Language Models (LLMs) to intelligently analyze document content and extract structured metadata based on user-defined schemas. This approach ensures that documents are automatically tagged with meaningful, searchable metadata that enhances discoverability and enables sophisticated document workflows.
When you enable Metadata Extraction in your document processing workflow, our platform:
Analyzes Content: Reads the ingested document chunks to understand the document's content and context
Applies Schema: Uses your defined metadata schema to extract specific fields (e.g., author name, document type, dates)
Updates Metadata: Automatically populates the document's metadata fields for search and filtering
This intelligent approach ensures your knowledge base is enriched with accurate, consistent metadata that improves document discoverability and enables automated workflows.
Who It's For
Admins who configure document processing workflows to ensure consistent metadata across document collections
Knowledge Managers who need to organize large document repositories with structured, searchable metadata
Compliance Teams who require specific document classifications and attributes for regulatory purposes
Users who benefit from improved search and filtering capabilities in their document workflows
How It Works
Backbone Components
Content Client: Retrieves document chunks for analysis using the Unique SDK
Metadata Extractor Handler: Orchestrates the extraction process and manages token limits
LLM Integration: Connects to Azure OpenAI for structured output generation
Schema Validator: Ensures extracted metadata conforms to the defined schema types
Example Use Cases
Document Classification
Policy Documents: "Classify documents by policy type and effective date"
Extracts document type (policy, procedure, guideline)
Identifies effective dates and review periods
Tags with relevant regulatory frameworks
Client Documents: "Extract client information from onboarding documents"
Identifies client names and account numbers
Extracts document submission dates
Tags with document category (KYC, AML, etc.)
Step-by-Step Guide
1. Enable Metadata Extraction
Click Configure File Ingestion in a folder of choice
Locate the AI Metadata Extraction section in the configuration panel
Toggle the feature ON to enable metadata extraction
2. Configure Language Model
In the metadata extraction configuration:
Select your preferred Language Model from the available options:
GPT-4o (2024-08-06) - Recommended for complex schemas
GPT-4o (2024-11-20) - Latest model with improved accuracy
Set the Max Input Tokens (1000-10000):
Lower values process faster but may miss content at the end of documents
Higher values capture more context but increase processing time
Recommended: Start with 5000 tokens and adjust based on your documents
3. Define Metadata Schema
Create a JSON schema defining the metadata fields to extract. Each field requires:
type: Data type (
"string","number","boolean")description: Clear description to guide the LLM extraction
required: Whether the field must be extracted (
trueorfalse)enum (optional): Restricts the extracted value to a fixed set of allowed strings, e.g.
["invoice", "contract", "other"]
Example Schema:
{
"author_name": {
"type": "string",
"description": "Name of the document author or creator",
"required": true
},
"publication_date": {
"type": "string",
"description": "Publication or creation date in YYYY-MM-DD format",
"required": true
},
"document_type": {
"type": "string",
"description": "Type of document (e.g., report, memo, policy, analysis)",
"required": true,
"enum": ["invoice", "contract", "report", "other"]
}
}Note: Metadata keys must not use any of the following reserved system keys:
urlkeymimeTypeexternalFileOwnerfolderIdPathcompanyIdcontentIdfolderIdtitle
If any of these keys are used, metadata for the corresponding field will not be generated.
4. Upload Documents
Upload your documents through the standard Unique AI interface. The system will automatically:
Process the document through standard ingestion (reading, chunking, embedding)
Trigger metadata extraction upon ingestion completion
Analyze document content using the configured LLM
Extract metadata according to your schema
Update the document's metadata fields
5. Verify Results
Review the extracted metadata on your documents:
Navigate to the folder containing your uploaded documents
Select a document to view its details
Check the Metadata section for extracted fields
Verify accuracy and completeness of extracted values
6. Re-run Extraction on Existing Files
Enabling metadata extraction does not automatically process files already in the folder. To backfill existing documents:
Check "Extract metadata on existing files in this folder"
Optionally check "Also extract on files in subfolders" to include the entire folder tree
Click Save
Only fully ingested documents are targeted. New uploads are processed automatically and do not require this step.
Configuration Options
Language Models
Language Models may differ across companies and environments.
Requirements: All models must be Azure OpenAI deployments with Structured Output support (e.g., GPT-4o series). This allows companies to adopt newer models as they become available by simply updating the configuration, without requiring code changes.
Current available models:
Model | Best For | Token Limit | Accuracy |
|---|---|---|---|
GPT-4o (2024-08-06) | Complex schemas, nuanced extraction | 128K | High |
GPT-4o (2024-11-20) | General use, latest improvements | 128K | Highest |
Schema Field Types
Type | Description | Example Values |
|---|---|---|
| Text values |
|
| Numeric values |
|
| True/false values |
|
Enum Constraints & Field Validation
After the LLM responds, every extracted field is validated against your schema before being saved. Fields that fail are dropped (not saved), never causing an error on the document.
Data type validation — each field's value is checked against its declared type:
If the LLM returns... | Outcome |
|---|---|
Correct type | Field saved |
Wrong type (e.g. a string for a | Field dropped with a warning |
| Field skipped silently |
A required field missing entirely | Field dropped with a warning |
Enum constraints — add an optional enum property to restrict the LLM to a fixed set of allowed values:
{
"document_type": {
"type": "string",
"description": "Type of document",
"required": true,
"enum": ["invoice", "contract", "report", "other"]
}
}Values outside the enum are dropped after extraction. For array fields, only the items within the allowed set are kept; if none remain, the field is dropped entirely.
Token Settings
Setting | Description | Default | Recommended Range |
|---|---|---|---|
| Maximum document length to process | 10000 | 1000-10000 |