Overview

Benchmarking allows clients to automatically evaluate the accuracy of the Unique AI chat solution by comparing system-generated answers with human-verified responses. Initially, the benchmark triggers specific queries in a defined context, and humans review the system’s responses to confirm their correctness. Once validated, these correct answers serve as references for future benchmark runs. The system then uses LLM calls to assist in judging how closely new answers match these reference answers, reducing the need for humans to review every response manually. However, each benchmark still relies on human oversight to ensure accuracy. This process helps detect any quality issues, particularly when new data is ingested. By regularly running the benchmark, users can ensure that the chat solution maintains the desired accuracy over time.

1 Overview

Goal

The goal of the benchmarking feature can include the following:

Test the quality of new Space, LLMs, Prompts etc.
Test the impact of new documents added to a scope
Evaluate quality score over time

Generate your Benchmarking ground truth

If you have not created a benchmarking set with typical questions and expected answers yet, there are basically two ways of doing so. One, very intuitive way (see Option 1.), is by prompting the questions directly into the chat interface and rate them with the feedback option. Another way (see Option 2.) is to create the first answers with the benchmarking template in the benchmarking interface.

Option 1. Directly generate your benchmarking answers yourself

You can gather your benchmarking answers by prompting your questions directly in to single chat conversations and rate them

Step 1: Gather typical questions per space and prompt them

Gather typical user questions for the individual spaces.
Prompt the questions, each in a new chat conversation. Please prompt the questions structured, one by one (this can also be done in batches via benchmarking interface - please approach your CS responsible for further information).
Check the answer in every single conversation and rate them
- Give a 👍 if the answer is satisfying. Leaving a comment is optional
- Give a 👎 if the answer is unsatisfying (not finding any information, incomplete information, incorrect information etc.). Please in this case mention in the free text field what was missing in the answers to be correct and which source you would expect to be chosen by the module

Step 2: Pull Feedback and apply first improvements

Once all the questions are entered you can go to the Feedback interface and pull the consolidated feedback (sortable by space within the excel)
Check the answers with your in-house specialist or the DS lead from uniques side.
Implement first measures to improve answers and re-run them

Step 3: Enter the questions and answers in the benchmarking template

Pull a finale version of the feedback output from the feedback interface
Sort for the relevant space
Copy and paste the following information to the benchmarking Excel template (also include questions that are still not satisfying although measurements are in place): Question (Column B), Assistant (Column C), Correct Benchmark (Column D), Answer (Column E) and Sources (Column F).

Then continue with https://unique-ch.atlassian.net/wiki/spaces/PUB/pages/edit-v2/600375297#Comparing-benchmark-answers-with-newly-generated-answers

Option 2. Automatically generating benchmark answers

You can also automatically let the benchmarking create answers for you.

Step 1: Gather typical questions per space

You have to be a member of the space. If you are not assigned to a certain space included in the benchmark, it will not run.
Gather typical user questions for the individual spaces and add them to column B (Question) in the benchmarking Excel template, which should be named “Question.”
Add the exact name of the space to column A, which should be named Assistant. Ensure the space names are correct, as otherwise, the upload of the Excel file will trigger an error.
You only need to fill out columns A (“Assistant”) and B (“Question”). The rest will be automatically filled in terms of naming.
Column C indicates whether there is an existing reference answer to compare with the new run. Since this is the first run, enter "No."

Step 2: Upload the Excel File to the Benchmarking

Drag and drop the Excel file in the benchmarking section of the Unique AI Platform
You can see a “In Progress” tag next to the file name if the upload worked.

Step 3: Download file with automatically generated answers and review and classify it

If the questions were completed you will see a green tag indicating “ready” next to the file.

You can now download the file by clicking the download icon next to the files.
If you open the file you should see the generated answers in column J “answer”.
Manually review all the generated answers and indicate in column D “correct_benchmark” if the answer is as expected or if not (e.g., not finding any information, incomplete information, incorrect information).
- Add a “yes” if the answer is as expected
- Add a “no” if the answer is not as expected

Tip: Include negative examples in your benchmarking set enables you to compare the quality of answers over time. Example: with GPT-3.5 80% of the answers were correct (“yes”), and with GPT-4 used 90% of the answers were correct.

Comparing benchmark answers with newly generated answers

Step 1: Upload benchmarking file

Make sure your benchmarking file includes the following information
- Question (Column B)
- Assistant (Column C)
- correct_benchmark (Column D) and answer benchmark (column E) in the benchmarking section of the Unique AI Platform
- souces_used_benchmark (column F): optional, if you want to compare the sources

Drag and drop the Excel file in the benchmarking section of the Unique AI Platform
You can see a “In Progress” tag next to the file name, if the upload worked.

Step 2: Download the benchmarking file

If the questions were completed you will see a green tag indicating “ready” next to the file.
You can now download the file by clicking the download icon next to the files.

Step 3: Review Flags within file (incl. column description)

It is recommended to FILTER the final_flag (column Z) for TRUE and manually evaluate how the new answers are different from the benchmark answers.

The columns with “flags” in their name perform an automated test using GPT to evaluate if the benchmark answer and the newly generated answer match.

FALSE: means the test did NOT find a significant deviation in the results
TRUE: means the test found a significant deviation in the results. These results should be checked manually by a human.

Column Z (final_flag) is a summary of all the tests., meaning if a deviation was found in one of the tests (TRUE), the final flag will always be TRUE.

Explanation of the columns:

Answer (Column J): automatically generated answer in the comparison run. These answers are compared to the benchmark answers (column E: answer_benchmark)
Sources (Column K): automatically generates a list of the used sources in the comparison run. These are compared to the benchmark sources- if available (column F: answer_benchmark)
Modules (Column L): Coming soon - will indicate which module has been selected (e.g., search, follow-up, etc.)
Followup (Column M): Coming soon - will indicate if it is a follow-up question or not
ChatMessages (Column N): debug information that can be used to debug a problem after identifying one.
emb_text (Column O): coming soon (ignore for now). This field will contain the cosine similarity of the embeddings of the reference answer and generated answer. The closer that value is to 1, the larger the overlap between the answers.
emb_flag (Column P): coming soon (ignore for now). TRUE if the similarity is below the threshold of 0.92
contra_text (Column Q): Explanation of why the contra_flag was set to TRUE
contra_flag: (Column R): TRUE, if the two answers contradict each other or have a significantly other meaning.
ext_text (Column S): Explanation of why the ext_flag was set to TRUE
ext_flag (Column T): TRUE if the two answers differ in their extent (e.g., one includes significantly more or less information than the other)
halluzination_text (Column U): Explanation of why the halluzination_text was set to TRUE
halluzination_flag: (Column V): TRUE, if the answer indicates hallucinations. This is tested by comparing the generated answer with the content of the referenced sources. If the answer contains any information that is not present in the sources, the hallucinations flag is TRUE.
source_flag (Column W): TRUE, if the newly generated answer is missing at least one reference from the benchmark answer.
module_flag (Column X): coming soon (ignore for now)
relation_flag (Column Y): coming soon (ignore for now)
final_flag (Column Z): Column Z (final_flag) is a summary of all conducted tests (columns O-Y), meaning if a deviation was found in one of the tests in column O-Y (TRUE), the final flag would always be set to TRUE. It is recommended that final_flag is filtered for TRUE.
explanation (Column AA): Explanation of why the final_flag was set to TRUE

Step 4: Add manual evaluation flagged answers

We recommend you to add an additional column AB where you add the result of your review (human review). You can name it “human_review_answer_correct” and add a “yes” or “no” as answer. Similar to column D (correct_benchmark), you evaluate here if the new answers are correct or not.

For all the answers with final_flag=TRUE, it is recommended to manually review it and evaluate if the answer is different from the benchmark but still correct. If it is correct add a “yes” in the column “human_review_answer_correct”.

With the automated tests, we estimate a high probability for all the answers with the final_flag = FALSE to be correct and you could set all these columns to “yes” after reviewing some single samples.

This is the first version of the automated tests. Please report to us if you notice that some flags tend to have a lot of false positive results, meaning the result is TRUE but the answers are correct and comparable

Adopt benchmark set

After changing to a new version of prompts or LLMs for a space you shall also change your benchmarking set, as usually the answers were improved compared to the original benchmark set. Just create a new file and copy the column J-N from your last run to column E-I.

Reporting

If you experience any issues or have improvement ideas please report it to enterprise-support@unique.ai.

Definition of benchmark metrics/scores

The evaluation of whether a generated response is considered equivalent to the benchmark run is carried out by combining numerous metrics. Even if a single metric shows a possible anomaly, a possible deviation is signaled and noted for manual analysis. This section explains the different metrics in detail.

Embedding Comparison

This metric assesses the degree of similarity between the embeddings of the reference answer and the new benchmark answer. A high similarity score indicates substantial content overlap between the two answers. The threshold score for comparison is set at a value of 0.92. Should the similarity score fall below this threshold, it is deemed a considerable divergence between the new benchmark response and the reference answer. In this case, the system marks the result of this test as TRUE, otherwise as FALSE.

Contradiction Comparison

This metric evaluates the consistency between the reference response and the response from a new benchmark test by checking for contradictions. Both responses are submitted to a GPT model for analysis. If the model detects any contradictory statements between the two, it will return TRUE, indicating inconsistency. If no contradictions are found, it will return FALSE, confirming that the responses are consistent.

Extent Comparison

This metric is designed to evaluate the comprehensiveness and overlap of the two answers, the reference, and new benchmark run, in relation to the benchmark question. The reference answer is assumed to contain the expected information. The objective is to ascertain whether one of the answers addresses the question more thoroughly than the other. The outcome is binary (true/false): if either the new benchmark answer or the reference answer provides a more comprehensive response to the question, the metric is set to TRUE. In this case, the two answers do not provide a response to the user’s question to the same extent. Conversely, if both answers exhibit equal comprehensiveness in addressing the question, the response is FALSE.

Hallucination

The purpose of this metric is to determine if all information contained in the response is purely taken from the provided sources, meaning that the model is not hallucinating. A GPT-4 call evaluates whether the answer is (a) fully, (b) partially, or (c) not at all supported by the provided sources.

Fully supported: the generated answer is fully consistent with the sources. No additional information is contained in the answer that is not part of the sources
Partially supported: the output is consistent with the sources but contains some unsupported elements
No support: the information in the answer is not at all taken from the sources

If the generated answer is only partially or not at all supported by the provided sources, this indicates hallucination, and the metric is set to TRUE, else to FALSE.

Reference (Source) Comparison

This metric compares the reference sources from the reference and the new answer. The purpose is to analyze if the same documents were taken to generate the answer, which indicates a consistent answer content. If all sources contained in the reference answer are also part of the new answer, this metric is FALSE, or else TRUE.

Module Comparison

If an assistant contains multiple modules, a module selector chooses the most suitable module for a user input. The choice of the module has a big impact of the answer structure and quality, as each module is optimized for a different use case (e.g. knowledge search or translation). Therefore it is crucial that the module choice is consistent for the same user input. This metric compares the chosen module for the reference and benchmark run. If there is an overlap, the metric is FALSE, else TRUE.

Final Flag

The final assessment of whether a generated response is considered equivalent to the benchmark run is made by combining all of the above metrics. Only if all metrics are marked as FALSE is the new response considered equivalent to the reference response. If at least the metric is TRUE, this response is marked as potentially deviating and must be analyzed manually.

Error Codes

Error Code	Description

Error Code	Description
Benchmark_01	Skipping row because of missing data (question or assistant)
Benchmark_02	Benchmark object of BenchmarkEntry not found
Benchmark_03	Provided Assistant not found
Benchmark_04	User message (question) not found after creation of message.
Benchmark_05	Assistant message (answer) not found after creation of message or not marked as completed (External Modules)
Benchmark_06	Assistant message (answer) has no originalText for further processing
Benchmark_07	Error while doing the benchmark of an entry
Benchmark_08	Missing result of the comparison of a benchmark entry `correct_benchmark` not set?
Benchmark_09	MessageCreate Failed - Could not create a new chat and send the message
Benchmark_10	Error while validating the results of a benchmark entry
Benchmark_98	Benchmarking: Run Aborted
Benchmark_99	General Error

Author	@Pascal Hauri & @Jeremy Isnard