Benchmarking

Benchmarking


Overview

The benchmarking service of Unique is designed to evaluate and ensure the quality of responses from language models and virtual assistants. It allows users to:

  • Test Accuracy: Automatically generate and compare responses to a set of benchmark questions to assess accuracy and performance.

  • Monitor Consistency: Detect deviations or drifts in model behavior over time to maintain consistent output quality.

  • Refine and Improve: Utilize detailed metrics to pinpoint areas for enhancement and validate the impact of updates or changes to the system.

This tool is critical for organizations looking to optimize the effectiveness and reliability of their AI-powered solutions.

Generate your Benchmarking ground truth

If you have not created a benchmarking set with typical questions and expected answers yet, there are basically two ways of doing so. One, very intuitive way (see Option 1.), is by prompting the questions directly into the chat interface and rate them with the feedback option. Another way (see Option 2.) is to create the first answers with the benchmarking template in the benchmarking interface.

Option 1. Directly generate your benchmarking answers yourself

You can gather your benchmarking answers by prompting your questions directly in to single chat conversations and rate them

Step 1: Gather typical questions per space and prompt them

  • Gather typical user questions for the individual spaces.

  • Prompt the questions, each in a new chat conversation. Please prompt the questions structured, one by one (this can also be done in batches via benchmarking interface - please approach your CS responsible for further information).

  • Check the answer in every single conversation and rate them

    • Give a 👍 if the answer is satisfying. Leaving a comment is optional

    • Give a 👎 if the answer is unsatisfying (not finding any information, incomplete information, incorrect information etc.). Please in this case mention in the free text field what was missing in the answers to be correct and which source you would expect to be chosen by the module

Step 2: Pull Feedback and apply first improvements

  • Once all the questions are entered you can go to the Feedback interface and pull the consolidated feedback (sortable by space within the excel)

  • Check the answers with your in-house specialist or the DS lead from Unique’s side.

  • Implement first measures to improve answers and re-run them

Step 3: Enter the questions and answers in the benchmarking template

  • Pull a finale version of the feedback output from the feedback interface

  • Sort for the relevant space

  • Copy and paste the following information to the benchmarking Excel template (also include questions that are still not satisfying although measurements are in place): Question (Column B), Assistant (Column C), Correct Benchmark (Column D), Answer (Column E) and Sources (Column F).

Then continue with Benchmarking | Comparing benchmark answers with newly generated answers

Option 2. Automatically generating benchmark answers

You can also automatically let the benchmarking create answers for you.

Step 1: Gather typical questions per space

  • You have to be a member of the space. If you are not assigned to a certain space included in the benchmark, it will not run.

  • Gather typical user questions for the individual spaces and add them to column B (Question) in the benchmarking Excel template, which should be named “Question.”

  • Add the exact name of the space to column A, which should be named Assistant. Ensure the space names are correct, as otherwise, the upload of the Excel file will trigger an error.

  • You only need to fill out columns A (“Assistant”) and B (“Question”). The rest will be automatically filled in terms of naming.

  • Column C indicates whether there is an existing reference answer to compare with the new run. Since this is the first run, enter "No."

image-20240717-134714.png

 

Step 2: Upload the Excel File to the Benchmarking

  • Drag and drop the Excel file in the benchmarking section of the Unique AI Platform

  • You can see a “In Progress” tag next to the file name if the upload worked.

 

Step 3: Download file with automatically generated answers and review and classify it

image-20240130-074811.png

If the questions were completed you will see a green tag indicating “ready” next to the file.

  • You can now download the file by clicking the download icon next to the files.

  • If you open the file you should see the generated answers in column J “answer”.

  • Manually review all the generated answers and indicate in column D “correct_benchmark” if the answer is as expected or if not (e.g., not finding any information, incomplete information, incorrect information).

    • Add a “yes” if the answer is as expected

    • Add a “no” if the answer is not as expected

image-20240130-073926.png

Tip: Include negative examples in your benchmarking set enables you to compare the quality of answers over time. Example: with GPT-3.5 80% of the answers were correct (“yes”), and with GPT-4 used 90% of the answers were correct.

Compare Answers with Benchmark

Step 1: Upload benchmarking file

  • Make sure your benchmarking file includes the following information

    • Question (Column B)

    • Assistant (Column C)

    • correct_benchmark (Column D) and answer benchmark (column E) in the benchmarking section of the Unique AI Platform

    • souces_used_benchmark (column F): optional, if you want to compare the sources

image-20240130-073926.png
  • Drag and drop the Excel file in the benchmarking section of the Unique AI Platform

  • You can see a “In Progress” tag next to the file name, if the upload worked.

Step 2: Download the benchmarking file

  • If the questions were completed you will see a green tag indicating “ready” next to the file.

  • You can now download the file by clicking the download icon next to the files.

Step 3: Review Flags within file (incl. column description)

image-20240201-063925.png

It is recommended to FILTER the final_flag (column Z) for TRUE and manually evaluate how the new answers are different from the benchmark answers.

The columns with “flags” in their name perform an automated test using GPT to evaluate if the benchmark answer and the newly generated answer match.

  • FALSE: means the test did NOT find a significant deviation in the results

  • TRUE: means the test found a significant deviation in the results. These results should be checked manually by a human.

Column Z (final_flag) is a summary of all the tests., meaning if a deviation was found in one of the tests (TRUE), the final flag will always be TRUE.

Explanation of the columns:

  • Answer (Column J): automatically generated answer in the comparison run. These answers are compared to the benchmark answers (column E: answer_benchmark)

  • Sources (Column K): automatically generates a list of the used sources in the comparison run. These are compared to the benchmark sources- if available (column F: answer_benchmark)

  • Modules (Column L): Coming soon - will indicate which module has been selected (e.g., search, follow-up, etc.)

  • Followup (Column M): Coming soon - will indicate if it is a follow-up question or not

  • ChatMessages (Column N): debug information that can be used to debug a problem after identifying one.

  • emb_text (Column O): coming soon (ignore for now). This field will contain the cosine similarity of the embeddings of the reference answer and generated answer. The closer that value is to 1, the larger the overlap between the answers.

  • emb_flag (Column P): coming soon (ignore for now). TRUE if the similarity is below the threshold of 0.92

  • contra_text (Column Q): Explanation of why the contra_flag was set to TRUE

  • contra_flag: (Column R): TRUE, if the two answers contradict each other or have a significantly other meaning.

  • ext_text (Column S): Explanation of why the ext_flag was set to TRUE

  • ext_flag (Column T): TRUE if the two answers differ in their extent (e.g., one includes significantly more or less information than the other)

  • halluzination_text (Column U): Explanation of why the halluzination_text was set to TRUE

  • halluzination_flag: (Column V): TRUE, if the answer indicates hallucinations. This is tested by comparing the generated answer with the content of the referenced sources. If the answer contains any information that is not present in the sources, the hallucinations flag is TRUE.

  • source_flag (Column W): TRUE, if the newly generated answer is missing at least one reference from the benchmark answer.

  • module_flag (Column X): coming soon (ignore for now)

  • relation_flag (Column Y): coming soon (ignore for now)

  • final_flag (Column Z): Column Z (final_flag) is a summary of all conducted tests (columns O-Y), meaning if a deviation was found in one of the tests in column O-Y (TRUE), the final flag would always be set to TRUE. It is recommended that final_flag is filtered for TRUE.

  • explanation (Column AA): Explanation of why the final_flag was set to TRUE

© 2025 Unique AG. All rights reserved. Privacy PolicyTerms of Service