How We Evaluate Models
Overview
At UniqueAI, we take model quality seriously—especially because our customers rely on our platform for accurate, trustworthy and compliant financial insights. This document provides a transparent and accessible look at how we evaluate and validate AI models—and where your specific use cases and benchmarks can plug into the process before models make their way into our product.
Our goal is simple: every model you interact with should be reliable, safe, and benchmarked for your financial workflows.
Why We Benchmark
AI models vary widely in accuracy, reasoning ability, reliability, and safety. In finance, even small mistakes can have large consequences. That’s why we use a structured benchmarking process that ensures:
High‑quality reasoning in financial contexts
Factual, non‑hallucinated outputs
Reliable tool integrations (e.g. MCP, SQL, analysis, document handling)
Compliance‑aligned behaviour
The process is designed to validate models thoroughly while keeping the focus on what matters most for our customers: correctness, consistency, and trust.
The Four‑Stage Evaluation Funnel
We use a funnel‑style approach—from broad testing to increasingly specialized real‑world validation.
Stage 1 — Baseline Capability Screening
We begin by researching each model's general performance using widely accepted open‑source benchmarks reported publicly.
These include:
MMLU for broad professional knowledge
GSM8K for mathematical reasoning (crucial for financial calculations)
HumanEval for code generation (useful for SQL, Python and data workflows)
Why it matters: These public available tests help us quickly identify models that meet a minimum standard of reasoning and accuracy before we invest deeper testing.
Stage 2 — Functional Reliability Testing
Next, we verify that the model works smoothly across the features our customers rely on.
We check:
Use of tools within the platform (MCP, Internal Search, Web Search, code & report generation)
Ability to reference uploaded files correctly
Handling of large documents and long customer conversations
Stability and predictable behaviour, even with imperfect inputs
Why it matters: A model must do more than generate good text—it must function reliably in real customer workflows.
Stage 3 — Human-Run Validation on Finance-Specific Tasks
This stage focuses specifically on financial correctness, where our in‑house experts evaluate the model against real‑world financial tasks.
The evaluation relies on a validated golden set of questions that standardizes results across runs. Each item is grounded in underlying financial data and source documents, with coverage spanning asset management and legal and compliance domains.
We test for:
All numerical outputs and statements must trace directly to provided data or documents
Sound financial logic (e.g., accounting, ratio analysis, valuation principles)
Ability to process long reports, tables, and multi‑step financial enquiries
Clear, professional and compliant communication style
Consistent performance across the golden-set benchmark
Why it matters: Finance requires accuracy and accountability. Our human‑expert review is essential for verifying behaviour that automated tests can’t fully capture.
Stage 4 — Continuous Automated Regression Testing
Before any model is deployed—and throughout its lifecycle—we run automated, large-scale tests.
Our system:
Uses a curated dataset of high-quality financial Q&A (“Golden Dataset”)
Grades answers automatically using strict evaluation criteria
Flags unexpected behaviour for manual review
Tracks long-term performance trends
Why it matters: Model quality can drift over time when more data is presented. Continuous monitoring ensures the experience you get remains stable, predictable and high-quality.
This is also where you, as a consumer of Unique, become part of the process. Whenever you add new data to your knowledge base, adjust your system prompts, or want to validate a new use case, this stage gives you the framework to evaluate how your changes influence model behaviour. Because steps 1, 2, and 3 already take the heavy lifting off your shoulders, all you need to do here is test your use case on our proposed models—quickly, safely, and with clear performance insights.
To get started, you can explore and leverage our benchmarking framework here:
👉 Benchmark Documentation: Benchmarking
Where This Fits Into Your Experience
Our benchmarking pipeline directly impacts how you experience Unique AI:
Choosing the Right Models
We only deploy models that meet our quality and safety standards.
Consistent Data‑Driven Answers
By testing both reasoning and functional performance, we ensure models can:
Analyse financials accurately
Pull and reference the right data
Complete multi‑step tasks reliably
Confidence and Compliance
Every model used in our platform passes through strict factuality and safety checks before going live.
What “Good” Looks Like
These are the standards we expect from every model we deploy:
Performance & Reasoning
Strong results on industry benchmarks
Correct multi‑step logic
Reliable tool usage
Financial Accuracy
No hallucinated figures
Sound financial reasoning
Clear derivations and explanations
Operational Quality
Low error rate
Fast, predictable response times
Stable behaviour across updates
Continuous Improvement
We’re growing and refining this system as customers expand their use cases. Coming enhancements include:
Broader test datasets covering more financial domains
Better detection of subtle behavioural changes
More advanced financial judgement testing
Appendix
Author | @Jeremy Isnard |
|---|