Benchmarking
Benchmarking allows clients to automatically evaluate the accuracy of the Unique financeGPT chat solution by comparing system-generated answers with human-verified responses. Initially, the benchmark triggers specific queries in a defined context, and humans review the system’s responses to confirm their correctness. Once validated, these correct answers serve as references for future benchmark runs. The system then uses LLM calls to assist in judging how closely new answers match these reference answers, reducing the need for humans to review every response manually. However, each benchmark still relies on human oversight to ensure accuracy. This process helps detect any quality issues, particularly when new data is ingested. By regularly running the benchmark, users can ensure that the chat solution maintains the desired accuracy over time.
Goal
The goal of the benchmarking feature can include the following:
Test the quality of new Space, LLMs, Prompts etc.
Test the impact of new documents added to a scope
Evaluate quality score over time
First run of automatically generating benchmark answers
If you have not created a benchmarking set with typical questions and expected answers yet, you can automatically let the benchmarking create answers for you.
Step 1: Gather typical questions per space
You have to be a member of the space. If you are not assigned to a certain space included in the benchmark, it will not run.
Gather typical user questions for the individual spaces and add them to column B (
Question
) in the benchmarking Excel template, which should be named “Question.”Add the exact name of the space to column A, which should be named
Assistant
. Ensure the space names are correct, as otherwise, the upload of the Excel file will trigger an error.You only need to fill out columns A (“Assistant”) and B (“Question”). The rest will be automatically filled in terms of naming.
Column C indicates whether there is an existing reference answer to compare with the new run. Since this is the first run, enter "No."
Step 2: Upload the Excel File to the Benchmarking
Drag and drop the Excel file in the benchmarking section of the Unique FinanceGPT Platform
You can see a “In Progress” tag next to the file name if the upload worked.
Step 3: Download file with automatically generated answers and review and classify it
If the questions were completed you will see a green tag indicating “ready” next to the file.
You can now download the file by clicking the download icon next to the files.
If you open the file you should see the generated answers in column J “answer”.
Manually review all the generated answers and indicate in column D “correct_benchmark” if the answer is as expected or if not (e.g., not finding any information, incomplete information, incorrect information).
Add a “yes” if the answer is as expected
Add a “no” if the answer is not as expected
Tip: Include negative examples in your benchmarking set enables you to compare the quality of answers over time. Example: with GPT-3.5 80% of the answers were correct (“yes”), and with GPT-4 used 90% of the answers were correct.
Comparing benchmark answers with newly generated answers
Step 1: Upload benchmarking file
Make sure your benchmarking file includes the following information
Question (Column B)
Assistant (Column C)
correct_benchmark (Column D) and answer benchmark (column E) in the benchmarking section of the Unique FinanceGPT Platform
souces_used_benchmark (column F): optional, if you want to compare the sources
Drag and drop the Excel file in the benchmarking section of the Unique FinanceGPT Platform
You can see a “In Progress” tag next to the file name, if the upload worked.
Step 2: Download the benchmarking file
If the questions were completed you will see a green tag indicating “ready” next to the file.
You can now download the file by clicking the download icon next to the files.
Step 3: Review Flags within file (incl. column description)
It is recommended to FILTER the final_flag (column Z) for TRUE and manually evaluate how the new answers are different from the benchmark answers.
The columns with “flags” in their name perform an automated test using GPT to evaluate if the benchmark answer and the newly generated answer match.
FALSE: means the test did NOT find a significant deviation in the results
TRUE: means the test found a significant deviation in the results. These results should be checked manually by a human.
Column Z (final_flag) is a summary of all the tests., meaning if a deviation was found in one of the tests (TRUE), the final flag will always be TRUE.
Explanation of the columns:
Answer (Column J): automatically generated answer in the comparison run. These answers are compared to the benchmark answers (column E: answer_benchmark)
Sources (Column K): automatically generates a list of the used sources in the comparison run. These are compared to the benchmark sources- if available (column F: answer_benchmark)
Modules (Column L): Coming soon - will indicate which module has been selected (e.g., search, follow-up, etc.)
Followup (Column M): Coming soon - will indicate if it is a follow-up question or not
ChatMessages (Column N): debug information that can be used to debug a problem after identifying one.
emb_text (Column O): coming soon (ignore for now). This field will contain the cosine similarity of the embeddings of the reference answer and generated answer. The closer that value is to 1, the larger the overlap between the answers.
emb_flag (Column P): coming soon (ignore for now). TRUE if the similarity is below the threshold of 0.92
contra_text (Column Q): Explanation of why the contra_flag was set to TRUE
contra_flag: (Column R): TRUE, if the two answers contradict each other or have a significantly other meaning.
ext_text (Column S): Explanation of why the ext_flag was set to TRUE
ext_flag (Column T): TRUE if the two answers differ in their extent (e.g., one includes significantly more or less information than the other)
halluzination_text (Column U): Explanation of why the halluzination_text was set to TRUE
halluzination_flag: (Column V): TRUE, if the answer indicates hallucinations. This is tested by comparing the generated answer with the content of the referenced sources. If the answer contains any information that is not present in the sources, the hallucinations flag is TRUE.
source_flag (Column W): TRUE, if the newly generated answer is missing at least one reference from the benchmark answer.
module_flag (Column X): coming soon (ignore for now)
relation_flag (Column Y): coming soon (ignore for now)
final_flag (Column Z): Column Z (final_flag) is a summary of all conducted tests (columns O-Y), meaning if a deviation was found in one of the tests in column O-Y (TRUE), the final flag would always be set to TRUE. It is recommended that final_flag is filtered for TRUE.
explanation (Column AA): Explanation of why the final_flag was set to TRUE
Step 4: Add manual evaluation flagged answers
We recommend you to add an additional column AB where you add the result of your review (human review). You can name it “human_review_answer_correct” and add a “yes” or “no” as answer. Similar to column D (correct_benchmark), you evaluate here if the new answers are correct or not.
For all the answers with final_flag=TRUE, it is recommended to manually review it and evaluate if the answer is different from the benchmark but still correct. If it is correct add a “yes” in the column “human_review_answer_correct”.
With the automated tests, we estimate a high probability for all the answers with the final_flag = FALSE to be correct and you could set all these columns to “yes” after reviewing some single samples.
This is the first version of the automated tests. Please report to us if you notice that some flags tend to have a lot of false positive results, meaning the result is TRUE but the answers are correct and comparable
Adopt benchmark set
After changing to a new version of prompts or LLMs for a space you shall also change your benchmarking set, as usually the answers were improved compared to the original benchmark set. Just create a new file and copy the column J-N from your last run to column E-I.
Reporting
If you experience any issues or have improvement ideas please report it to enterprise-support@unique.ch
Definition of benchmark metrics/scores
The evaluation of whether a generated response is considered equivalent to the benchmark run is carried out by combining numerous metrics. Even if a single metric shows a possible anomaly, a possible deviation is signaled and noted for manual analysis. This section explains the different metrics in detail.
Embedding Comparison
This metric assesses the degree of similarity between the embeddings of the reference answer and the new benchmark answer. A high similarity score indicates substantial content overlap between the two answers. The threshold score for comparison is set at a value of 0.92. Should the similarity score fall below this threshold, it is deemed a considerable divergence between the new benchmark response and the reference answer. In this case, the system marks the result of this test as TRUE
, otherwise as FALSE
.
Contradiction Comparison
This metric evaluates the consistency between the reference response and the response from a new benchmark test by checking for contradictions. Both responses are submitted to a GPT model for analysis. If the model detects any contradictory statements between the two, it will return TRUE
, indicating inconsistency. If no contradictions are found, it will return FALSE
, confirming that the responses are consistent.
Extent Comparison
This metric is designed to evaluate the comprehensiveness and overlap of the two answers, the reference, and new benchmark run, in relation to the benchmark question. The reference answer is assumed to contain the expected information. The objective is to ascertain whether one of the answers addresses the question more thoroughly than the other. The outcome is binary (true
/false
): if either the new benchmark answer or the reference answer provides a more comprehensive response to the question, the metric is set to TRUE
. In this case, the two answers do not provide a response to the user’s question to the same extent. Conversely, if both answers exhibit equal comprehensiveness in addressing the question, the response is FALSE
.
Hallucination
The purpose of this metric is to determine if all information contained in the response is purely taken from the provided sources, meaning that the model is not hallucinating. This is done with a GPT-4 call that evaluates if the answer is either (a) fully, (b) partially, or (c) not at all supported by purely the content of the sources.
Fully supported: the generated answer is fully consistent with the sources. No additional information is contained in the answer that is not part of the sources
Partially supported: the output is consistent with the sources but contains some unsupported elements
No support: the information in the answer is not at all taken from the sources
If the generated answer is only partially or not at all supported by the provided sources, this indicates hallucination, and the metric is set to TRUE
, else to FALSE
.
Reference (Source) Comparison
This metric compares the reference sources from the reference and the new answer. The purpose is to analyze if the same documents were taken to generate the answer, which indicates a consistent answer content. If all sources contained in the reference answer are also part of the new answer, this metric is FALSE
, or else TRUE
.
Module Comparison
If an assistant contains multiple modules, a module selector chooses the most suitable module for a user input. The choice of the module has a big impact of the answer structure and quality, as each module is optimized for a different use case (e.g. knowledge search or translation). Therefore it is crucial that the module choice is consistent for the same user input. This metric compares the chosen module for the reference and benchmark run. If there is an overlap, the metric is FALSE
, else TRUE
.
Final Flag
The final assessment of whether a generated response is considered equivalent to the benchmark run is made by combining all of the above metrics. Only if all metrics are marked as FALSE
is the new response considered equivalent to the reference response. If at least the metric is TRUE
, this response is marked as potentially deviating and must be analyzed manually.
Error Codes
Error Code | Description |
---|---|
Benchmark_01 | Skipping row because of missing data (question or assistant) |
Benchmark_02 | Benchmark object of BenchmarkEntry not found |
Benchmark_03 | Provided Assistant not found |
Benchmark_04 | User message (question) not found after creation of message. |
Benchmark_05 | Assistant message (answer) not found after creation of message or not marked as completed (External Modules) |
Benchmark_06 | Assistant message (answer) has no originalText for further processing |
Benchmark_07 | Error while doing the benchmark of an entry |
Benchmark_08 | Missing result of the comparison of a benchmark entry |
Benchmark_09 | MessageCreate Failed - Could not create a new chat and send the message |
Benchmark_10 | Error while validating the results of a benchmark entry |
Benchmark_98 | Benchmarking: Run Aborted |
Benchmark_99 | General Error |
Author | @Jovana Sanussi |
---|
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service