RAG Assessment and Improvement
This report summarizes the assessment approach we implemented in order to evaluate our current information retrieval setup. We will start by introducing the assessment metrics that have been used. Then, we will explain how we generated the assessment dataset. Finally, we delve into the assessment outcomes of our baseline approach, primarily semantic search, juxtaposed against alternative methodologies proposed to enhance retrieval performance.
Video with explanation of method and results:
1. Evaluation Metrics for information retrieval
There are different metrics that can be used to assess the performance of Information Retrieval (IR) systems. In this report, we will restrict ourselves to two metrics:
Recall
Normalized Discounted Cumulative Gain (NDCG)
If you are interested in gaining more knowledge about the most popular evaluation metrics for IR systems, please refer to this article.
1.1. Recall
The recall is a metric used to measure the system's ability to retrieve all relevant documents from a given dataset. It quantifies the proportion of relevant documents retrieved by the system out of the total number of relevant documents available. In simpler terms, recall assesses how well the system avoids missing relevant information. A high recall indicates that the system effectively retrieves a large portion of the relevant documents, while a low recall suggests that the system may be overlooking important information. When used carefully, recall is an easy to implement and can provide valuable information about the performance of the IR system. However, its main disadvantage is being order-unaware metric. This means that it cannot account for the relevance level of the retrieved items.
1.2. NDCG
NDCG (Normalized Discounted Cumulative Gain) is a metric commonly used in Information Retrieval systems to evaluate the quality of ranked search results. It takes into account both relevance and the position of relevant documents in the ranked list. NDCG assigns higher scores to relevant documents appearing higher in the list and discounts the relevance score based on the document's position. The metric is normalized to allow comparison across different search queries and systems, making it a valuable measure for assessing the effectiveness of retrieval algorithms in providing relevant and highly-ranked search results. However, this metric is harder to interpret than recall.
2. Constructing Evaluation Dataset
Now that we know that what metrics will be used to assess and compare our IR algorithms, we need to have a dataset on which we can run our tests. As you already know, it is difficult to obtain such a dataset, especially when we are using private customer data. To solve this issue, we need to be creative and rely on our best friend, LLMs, particularly GPT4 family from OpenAI. In this section, we will focus on explaining the different steps that were used to construct our assessment dataset.
2.1. Question Generation
To start the construction of dataset journey, we first need to generate questions. This can be done by using OpenAI completion model by feeding it a chunk as context and ask it to generate questions about it. To achieve this, we used the following Prompt:
Welcome to your new and specialized position as a Query Generation Wizard. Your role involves crafting questions that are used to extract information from an extensive database of banking documents using semantic search. The documents in the database contain information about various banking products, services, and regulations.
You possess the unique ability to empathize and adopt the perspectives of individuals from different departments (HR, relationship manager,...).
As part of your query generation process, you will receive a header of a document and a piece of text from that document. Your task is to thoroughly contemplate these excerpts and conceive questions or queries that a human with interest in the subject matter might pose, ensuring that the answers to these questions can likely be found within the provided text segment.
You should generate 10 questions: 3 should be open questions, 3 should be specific questions and 4 should be instruction questions. Instruction questions are questions where you ask in first person how to do something, or if you are allowed to do something. These are 2 examples:
"As a banker, am I allowed to… ?"
"What documents do I need to… ?"
You should not use general references like "What documents do I need to submit to demonstrate compliance with this directive?" or "What are the requirements for compliance with this directive?".
Instead, you should use specific references like "What documents do I need to submit to demonstrate compliance with the deposit guarantee schemes directive?".
We provide below an example of generate question from a specific chunk:
Example of a chunk (Not all the text of the chunk is included in this example!)
<|document|>finma rs 2023 01 20221207 LGL.pdf<|/document|> <|info|>Annex 1: FINMA Circ.23 1 "Operational risks and resilience - banks" 7 December 2022 1 January 2024 former FINMA Circ.08 21 "Operational risk - banks", dated<|/info|>
The recovery time objective (RTO) is the time within which an application, system and/or process must be recovered. The recovery point objective (RPO) is the maximum tolerable period during which data is lost.
…..
Operational resilience refers to the institution's ability to restore its critical functions in case of a disruption within the tolerance for disruption. That is to say, the institution's ability to identify threats and possible failures, to protect itself from them and to respond to them, to restore normal business operations in the event of disruptions and to learn from them, so as to minimise the impact of disruptions on the provision of critical functions. An operationally resilient institution has designed its operating model in such a way4 that it is less exposed to the risk of disruptions in relation to its critical functions. Operational resilience thus reduces not only the residual risks of disruptions, but also the inherent risk of disruptions occurring. Effective operational risk management helps strengthen the institution's operational resilience.
What is the definition of the Recovery Time Objective (RTO) in the context of operational risks for banks?
How does the Business Continuity Plan (BCP) contribute to a bank's operational resilience?
Can you explain the difference between crisis situations and incidents as described in the FINMA circular?
How should a bank determine its Recovery Point Objective (RPO)?
What are the key components of a Disaster Recovery Plan (DRP) according to the FINMA circular?
In what ways do critical functions impact the financial market according to the FINMA circular?
As a bank's risk manager, how do I establish the tolerance for disruption for each critical function?
What steps should I take to ensure that our bank's Business Continuity Plan (BCP) is up to date and effective?
Am I required to have separate plans for business continuity and disaster recovery, or can they be integrated?
How do I assess whether our bank is operationally resilient as per the guidelines in the FINMA circular?As you can see, the questions are quite meaningful and are similar to legal questions that a banker may ask. Moving forward, we assume that the generated question are of a good quality and we are ready for the next step.
2.2. Grouping Chunks
So far, we have a set of question related with each chunk. But, what if a question requires multiple chunks to be answered? We need to find a way to assign to each question the potential set of chunks containing the answer. To achieve this, we start with the following assumption:
Similar chunks should produce similar questions!
Consider the case where we have a set of fund fact sheet documents. The user should more or less ask the same question but only changing the fund name. Thus, in theory, questions asked about funds should be highly similar. Thus, to construct our assessment dataset, we can proceed as follows:
We compute the embeddings of each query
We compute the pairwise similarity of the queries
Histogram of pairwise similarity between generated queriesWe only keep questions over a certain level of threshold (e.g. sim>0.9)
We create sets of highly similar questions
We recover the set of chunks that have been used to generate the similar sets questions
Using this approach, we end up 53% of the original generated queries. Now, we have two remaining problems:
Our assumption is wrong
If a query contains multiple relevant chunk, we need to find a way to rank the relevant chunks so we can use the NDCG metric.
Keep reading to find out how these issues are solved.
2.3. Ranking Chunks
As you may have already noticed, when things get hairy, our go-to move is to unleash the mighty force of ChatGPT, our problem-solving pal! Ranking chunks becomes as easy as using a prompt to ask GPT-4 to get the task done for us. Below, you can see the prompt used for this purpose:
You are a helping assistant in a company. You are asked to rank the relevance of a chunk of text with respect to a given question. The relevance score should reflect how well the chunk of text answers the question. For example if the chunk of text doesn't answer the question at all, it should be ranked as 0. If the chunk of text answers the question perfectly, it should be ranked as 4.
You can use the following scale to rank the relevance of the chunks:
0: The chunk of text does not answer the question at all.
1: The chunk of text answers the question partially.
2: The chunk of text answers the question but is not very clear.
3: The chunk of text answers the question and is clear.
4: The chunk of text answers the question perfectly.
The output should be in the following format: {"rank" : score}
Now, you can leave the model label your dataset and go for your lunch break. The histogram below show that most of the chunks are irrelevant. This is probably where our assumption failed. Thus, we simply need to drop irrelevant question-chunk pairs and we are left with 47% of the initial generated questions.
With this dataset, we are ready to start evaluating our IR algorithms.
2.4. Critics
In this section, we present a workaround solution for acquiring an assessment dataset. Although this approach minimizes human intervention, it places sole responsibility on the completion model, which may fail in some cases to generate human-like queries or accurately rank relevant chunks. Consequently, our current setup falls short of the ideal. An optimal approach would entail human involvement, either through manual labeling of the dataset or thorough review of the generated one to ensure precision and relevance.
3. Evaluation
In this section, we will start by evaluating the current IR systems that are available through Unique SDK and will later propose a reranking method that have the potential to improve the retrieval performance.
3.1. Semantic Search vs Combined Search
In this paragraph, we will compare the performance of the two IR method available with Unique SDK:
Semantic Search (aka Vector)
Combined Search
Semantic search relies on using the context embeddings generated with the model ‘text-embedding-ada-002' from OpenAI to compute the similarity between the documents and users’ queries. On the other hand, the combined search adds Full Text Search to the mix in the computation of query-document similarity. Full Text Search relies on keyword cooccurrence in order to determine relevant chunks. The figure below show the evaluation results over our assessment dataset that we build in section 2.
First, we notice that both metrics improve with the number of retrieved documents (Top k). This is an expected behavior as the more documents are retrieved, the more likely the relevant chunks are included in the retrieved set. Next, we notice that the combined search improves both the Recall and NDCG regardless of Top K.
3.2. Reranking
After comparing the existing IR systems alongside the Unique SDK, we're diving into reranking using cross encoders to see if we can spice up our current setup. We tested three different models:
cross-encoder/ms-marco-MiniLM-L-12-v2 (Reranker 1 - Monolingual EN)
mixedbread-ai/mxbai-rerank-xsmall-v1 (Reranker 2 - Monolingual EN)
cross-encoder/msmarco-MiniLM-L12-en-de-v1 (Reranker 3 - Bilingual EN-DE)
Here are the Recall and NDCG results after applying the rerankers to the retrieved documents through Semantic and Combined searches, respectively.
3.3 Combination of Reranking
In the preceding section, we observed that combining search with cross-encoder/msmarco-MiniLM-L12-en-de-v1 through reranking yielded superior results. In this investigation, we delve into whether the reranking predominantly influences semantic or full text searches. To discern this, we conduct reranking independently on semantic and full text searches before amalgamating the search outcomes. The figure below illustrates various permutations.
The results indicate that using the reranker first on the semantic search and then combining it with the full-text search worked best. However, when we compare it to our earlier experiments, we still find that combining the searches before using the reranker is the better method.
3.4 Rerankers Response Time
Rerankers introduce a delay that need to be taken into consideration. In the figure below, we can see a comparison of delays using different hardware architecture (CPU vs MPS) and for different models.
We can see that CPU is almost 10 times slower than MPS. Moreover, we can see that both ms-marco-MiniLM run at the same speed. However, mxbai-rerank-xsmall-v1 is 1.3 to 2.6 slower the other rerankers.
In the following plot, we compare the impact of increasing the number of chunks to be reranked (TOP K) on the delay. To make the plot less cumbersome, we limited ourselves to the best performing reranker 'msmarco-MiniLM-L12-en-de-v1'.
4. Results
Our findings are summarized in the table below. Initially, we observe that the combined search followed by reranking with a bilingual model surpasses all other approaches in terms of recall and NDCG. This is particularly crucial given the chatbot model's limited input token size, emphasizing the need to prioritize relevant chunks in ranking.
Metric | Method \ Top k | 10 | 20 | 30 | 40 | 50 |
---|---|---|---|---|---|---|
Recall | Semantic (Baseline) | 0.65 | 0.745 | 0.79 | 0.815 | 0.825 |
Combined | 0.705 (+8.46%) | 0.8 (+7.38%) | 0.835 (+5.70%) | 0.86 (+5.52%) | 0.865 (+4.85%) | |
Semantic + Reranker 1 | 0.665 (+2.31%) | 0.705 (-5.37%) | 0.725 (-8.23%) | 0.755 (-7.36%) | - | |
Semantic + Reranker 2 | 0.685 (+5.38%) | 0.735 (-1.34%) | 0.76 (-3.80%) | 0.78 (-4.29%) | - | |
Semantic + Reranker 3 | 0.735 (+13.08%) | 0.785 (+5.37%) | 0.805(+1.90%) | 0.820 (+0.61%) | - | |
Combined + Reranker 1 | 0.7 (+7.69%) | 0.755 (+1.34%) | 0.775 (-1.90%) | 0.805 (-1.23%) | - | |
Combined + Reranker 2 | 0.725 (+11.54%) | 0.785 (+5.37%) | 0.805 (+1.90%) | 0.835 (+2.45%) | - | |
Combined + Reranker 3 | 0.775 (+19.23%) | 0.825 (+10.74%) | 0.855(+8.23%) | 0.865 (+6.13%) | - | |
NDCG | Semantic (Baseline) | 0.445 | 0.465 | 0.475 | 0.48 | - |
Combined | 0.47 (+5.62%) | 0.495 (+6.45%) | 0.5 (+5.26%) | 0.505 (+5.21%) | 0.51 (+5.15%) | |
Semantic + Reranker 1 | 0.505 (+13.48%) | 0.515 (+10.75%) | 0.52 (+9.47%) | 0.525 (+9.38%) | 0.535 (+10.31%) | |
Semantic + Reranker 2 | 0.555 (+24.72%) | 0.565 (+21.51%) | 0.57 (+20.00%) | 0.575 (+19.79%) | 0.58 (+19.59%) | |
Semantic + Reranker 3 | 0.570 (+28.09%) | 0.580 (+24.73%) | 0.585 (+23.16%) | 0.585 (+21.87%) | 0.590 (+21.65%) | |
Combined + Reranker 1 | 0.525 (+17.98%) | 0.54 (+16.13%) | 0.545 (+14.74%) | 0.55 (+14.58%) | 0.56 (+15.46%) | |
Combined + Reranker 2 | 0.585 (+31.46%) | 0.6 (+29.03%) | 0.605 (+27.37%) | 0.61 (+27.08%) | 0.615 (+26.80%) | |
Combined + Reranker 3 | 0.595 (+33.71%) | 0.605 (+30.11%) | 0.615 (+29.47%) | 0.615 (+28.13%) | 0.615 (+26.80%) |
5. Discussion
The study we conducted shows that the Combine Search is a very good retrieval approach (Recall@20: 80%). Moreover, when we add a reranking step, we gain 2.5% in Recall@20. But most importantly, the NDCG@20 score jumped from 49.5% to 60.5%. This means that we are 11% better at pushing the relevant chunks in the top of retrieved documents with the combined search. The delay introduced by RAG is negligible if the computation is performed on an instance with available GPU (or MPS) but this comes at the cost of keeping a GPU instance running all time.
Our IR systems didn’t accurately retrieve some documents such as:
Question in english but generated from a german chunk (67% of unidentified relevant chunk).
Questions referencing some directives
Due to the generative approach we used to construct our dataset, we might have in some cases non meaningful questions that cannot obviously be found and humans wouldn’t ask. However, this study provides qualitative comparison of different techniques and showed that:
Combined Search is consistently good
LLM rerankers are a good option to improve RAG
LLM rerankers can even be fine-tuned to continuously improve users experience.
Author | @Fabian Schläpfer |
---|
© 2024 Unique AG. All rights reserved. Privacy Policy – Terms of Service