How We Evaluate Models

How We Evaluate Models

Overview

At UniqueAI, we take model quality seriously—especially because our customers rely on our platform for accurate, trustworthy and compliant financial insights. This document provides a transparent and accessible look at how we evaluate and validate AI models—and where your specific use cases and benchmarks can plug into the process before models make their way into our product.

Our goal is simple: every model you interact with should be reliable, safe, and benchmarked for your financial workflows.


Why We Benchmark

AI models vary widely in accuracy, reasoning ability, reliability, and safety. In finance, even small mistakes can have large consequences. That’s why we use a structured benchmarking process that ensures:

  • High‑quality reasoning in financial contexts

  • Factual, non‑hallucinated outputs

  • Reliable tool integrations (e.g. MCP, SQL, analysis, document handling)

  • Compliance‑aligned behaviour

The process is designed to validate models thoroughly while keeping the focus on what matters most for our customers: correctness, consistency, and trust.

 


The Four‑Stage Evaluation Funnel

We use a funnel‑style approach—from broad testing to increasingly specialized real‑world validation.

Stage 1 — Baseline Capability Screening

We begin by researching each model's general performance using widely accepted open‑source benchmarks reported publicly.

These include:

  • MMLU for broad professional knowledge

  • GSM8K for mathematical reasoning (crucial for financial calculations)

  • HumanEval for code generation (useful for SQL, Python and data workflows)

Why it matters: These public available tests help us quickly identify models that meet a minimum standard of reasoning and accuracy before we invest deeper testing.

Stage 2 — Functional Reliability Testing

Next, we verify that the model works smoothly across the features our customers rely on.

We check:

  • Use of tools within the platform (MCP, Internal Search, Web Search, code & report generation)

  • Ability to reference uploaded files correctly

  • Handling of large documents and long customer conversations

  • Stability and predictable behaviour, even with imperfect inputs

Why it matters: A model must do more than generate good text—it must function reliably in real customer workflows.

Stage 3 — Human-Run Validation on Finance-Specific Tasks

This stage focuses specifically on financial correctness, where our in‑house experts evaluate the model against real‑world financial tasks.

The evaluation relies on a validated golden set of questions that standardizes results across runs. Each item is grounded in underlying financial data and source documents, with coverage spanning asset management and legal and compliance domains.

We test for:

  • All numerical outputs and statements must trace directly to provided data or documents

  • Sound financial logic (e.g., accounting, ratio analysis, valuation principles)

  • Ability to process long reports, tables, and multi‑step financial enquiries

  • Clear, professional and compliant communication style

  • Consistent performance across the golden-set benchmark

Why it matters: Finance requires accuracy and accountability. Our human‑expert review is essential for verifying behaviour that automated tests can’t fully capture.

Stage 4 — Continuous Automated Regression Testing

Before any model is deployed—and throughout its lifecycle—we run automated, large-scale tests.

Our system:

  • Uses a curated dataset of high-quality financial Q&A (“Golden Dataset”)

  • Grades answers automatically using strict evaluation criteria

  • Flags unexpected behaviour for manual review

  • Tracks long-term performance trends

Why it matters: Model quality can drift over time when more data is presented. Continuous monitoring ensures the experience you get remains stable, predictable and high-quality.

This is also where you, as a consumer of Unique, become part of the process. Whenever you add new data to your knowledge base, adjust your system prompts, or want to validate a new use case, this stage gives you the framework to evaluate how your changes influence model behaviour. Because steps 1, 2, and 3 already take the heavy lifting off your shoulders, all you need to do here is test your use case on our proposed models—quickly, safely, and with clear performance insights.

To get started, you can explore and leverage our benchmarking framework here:

👉 Benchmark Documentation: Benchmarking


Where This Fits Into Your Experience

Our benchmarking pipeline directly impacts how you experience Unique AI:

Choosing the Right Models

We only deploy models that meet our quality and safety standards.

Consistent Data‑Driven Answers

By testing both reasoning and functional performance, we ensure models can:

  • Analyse financials accurately

  • Pull and reference the right data

  • Complete multi‑step tasks reliably

Confidence and Compliance

Every model used in our platform passes through strict factuality and safety checks before going live.

 


What “Good” Looks Like

These are the standards we expect from every model we deploy:

Performance & Reasoning

  • Strong results on industry benchmarks

  • Correct multi‑step logic

  • Reliable tool usage

Financial Accuracy

  • No hallucinated figures

  • Sound financial reasoning

  • Clear derivations and explanations

Operational Quality

  • Low error rate

  • Fast, predictable response times

  • Stable behaviour across updates

 


Continuous Improvement

We’re growing and refining this system as customers expand their use cases. Coming enhancements include:

  • Broader test datasets covering more financial domains

  • Better detection of subtle behavioural changes

  • More advanced financial judgement testing

 


Appendix

Author

@Jeremy Isnard