...
Chunk Relevancy Sort improves RAG results by resorting the retrieved document chunks based on their relevancy compared to the user query. This approach uses a language model to evaluate the relevancy and can also be configured as a filter to consider only highly relevant chunks. It is particular useful when no additional infrastructure can be acquired and rate limits are of no concerns. Also, large language models are supporting a wide array of languages. However, this additional step increases latency and comes with additional token costs per generated response.
...
Enhances accuracy by focusing on relevant data.
Suitable for scenarios where slightly increased latency is acceptable for better results.
Does not require additional infrastructure.
Supports a wide-array of languages.
...
Improving RAG with Reranker
The Reranker provides improves the accuracy of the generated response by reranking the retrieved chunks based on a predicted relevancy score. This method uses a dedicated, pre-trained model to predict the relevancy score for each document chunk. It is particularly useful when rate limits are a concern, or when only a model with smaller throughput is available. However, this additional step increases latency and comes with additional costs for infrastructure. Also, the pre-trained models are language specific and usually work only with a few languages, e.g., English and German.
Performance Estimation:
Reranker with GPT-4 or GPT-4 Omni
7000 input tokens for context to create response
Chat will start streaming after ~7 seconds.
2 sec for search
3-5 sec for reranking
...
Effective for managing rate limits in high-volume scenarios.
Provides the slowest response time but can significantly improve result accuracy.
Supports only a few languages depending on the chosen model.
...
Customer Scenarios
Processing allowed only in CH
...