RAG Improvements Options: Document Search Module


Motivation

Retrieval-Augmented Generation systems often struggle with accurately identifying and prioritizing the most relevant information from large document collections. This can lead to irrelevant or low-quality content being included in the generated responses. Two potential approaches to address this issue are:

  1. Implementing reranking algorithms to refine the initial retrieval results and better align them with the user's query.

  2. Increasing the context window size to allow for more comprehensive analysis of retrieved documents.

The challenge lies in determining which approach, or combination of approaches, will most effectively improve the accuracy and relevance of RAG-generated responses while maintaining computational efficiency.

This documentation describes the various options for improving RAG performance in different customer configurations. It provides detailed information on token usage, latency, and cost implications, depending on whether customers have a Provisioned Throughput Unit (PTU) and which models they have available.

Improving RAG with More Input Tokens

Using more input tokens in the GPT model can significantly enhance RAG results by allowing the model to process larger contexts. This approach is ideal when comprehensive document retrieval and analysis are required.

Performance Estimation:

  • GPT-4 32K or GPT-4 Omni

    • 30,000 input tokens for context to create response

    • Chat will start streaming after ~2 seconds.

      • 2 sec for search

Key Points:

  • Larger token inputs provide better context, resulting in more accurate RAG outcomes.

  • Ideal for customers who require detailed and extensive document analysis.


Improving RAG with Chunk Relevancy Sort

Chunk Relevancy Sort improves RAG results by resorting the retrieved document chunks based on their relevancy compared to the user query. This approach uses a language model to evaluate the relevancy and can also be configured as a filter to consider only highly relevant chunks. It is particular useful when no additional infrastructure can be acquired and rate limits are of no concerns. However, this additional step increases latency and comes with additional token costs per generated response.

Performance Estimation:

  • GPT-4 32K + GPT-3.5 or

  • GPT-4 Omni + GPT-4 Omni-Mini

    • 7000 input tokens for context to create response

    • 100’000 input tokens for Chunk Relevancy Sort

    • Chat will start streaming after ~4 seconds.

      • 2 sec for search

      • 2 sec for chunk sorting

Key Points:

  • Enhances accuracy by focusing on relevant data.

  • Suitable for scenarios where slightly increased latency is acceptable for better results.

  • Does not require additional infrastructure.


Improving RAG with Reranker

The Reranker provides improves the accuracy of the generated response by reranking the retrieved chunks based on a predicted relevancy score. This method uses a dedicated, pre-trained model to predict the relevancy score for each document chunk. It is particularly useful when rate limits are a concern, or when only a model with smaller throughput is available. However, this additional step increases latency and comes with additional costs for infrastructure.

Performance Estimation:

  • Reranker with GPT-4 or GPT-4 Omni

    • 7000 input tokens for context to create response

    • Chat will start streaming after ~7 seconds.

      • 2 sec for search

      • 3-5 sec for reranking

Key Points:

  • Effective for managing rate limits in high-volume scenarios.

  • Provides the slowest response time but can significantly improve result accuracy.


Customer Scenarios

Processing allowed only in CH

Available Models:

  • GPT-4

  • GPT-4 32K

  • GPT-3.5

Options by priority:

  1. Chunk Relevancy Sort:

    • Evaluate relevancy with GPT-3.5 (potentially filter only for highly relevant chunks)

    • Generate answer with GPT-4 and at least 7k tokens context window

  2. Reranker:

    • Generate answer with GPT-4 and at least 7k tokens context window

  3. More Input Tokens (most expensive option):

    • Generate answer with GPT-4 32K and at least 30k tokens context window

Processing allowed in Europe

Available Models:

  • GPT-4 Omni

  • GPT-4 Omni-Mini

Options by priority:

  1. More Input Tokens:

    • Generate answer with GPT-4 Omni and at least 30k tokens context window

  2. Chunk Relevancy Sort:

    • Evaluate relevancy with GPT-4 Omni-Mini (potentially filter only for highly relevant chunks)

    • Generate answer with GPT-4 Omni and at least 7k tokens context window

  3. Reranker:

    1. Generate answer with GPT-4 Omni and at least 7k tokens context window

PTU Purchased

Available Models:

  • GPT-4 Omni

  • GPT-4 Omni-Mini

Options by priority:

  1. More Input Tokens (be wary of rate limits):

    • Generate answer with GPT-4 Omni and at least 30k tokens context window

  2. Reranker (comes with 300$ per month in additional costs):

    • Generate answer with GPT-4 Omni and at least 7k tokens context window

  3. Chunk Relevancy Sort (only if smaller model with more throughput like GPT-4 Omni-Mini is available):

    • Evaluate relevancy with GPT-4 Omni-Mini (potentially filter only for highly relevant chunks)

    • Generate answer with GPT-4 Omni and at least 7k tokens context window


Cost Estimation

Companies that have already purchased a PTU will not need to pay any tokens as long as all requests are handled by the PTU. However, if they choose to use the re-ranker service, they will be responsible for the associated infrastructure cost, which is approximately $300 per month.

Cost Assumptions

  • Total of 21 120 requests per month: Calculations are based on an assumption of two requests per minute during eight working hours per day, over a 22-day month for the whole company.

  • 300$ Fixed Costs for Reranker: The reranker adds an additional $300 per month on infra costs.

  • Retrieval of 100k tokens: Each retrieval of relevant document chunks returns 100k tokens.

  • Output tokens: Only input tokens are considered — they represent approximately 95% of the total cost.

Table 1: Companies without PTU (Using GPT-4 32K, GPT-4 and GPT-3.5) Processing allowed only in CH

RAG improvement option

Model used / Tokens to process

Costs per 1k input tokens

costs per month (21 120 requests)

Total cost estimation

Total cost per search

RAG improvement option

Model used / Tokens to process

Costs per 1k input tokens

costs per month (21 120 requests)

Total cost estimation

Total cost per search

None

GPT-4 / 7k

$0.03

$4'435

$4'435

$0.21

More Input Tokens

GPT-4 32K / 30k

$0.06

$38'016

$38’016

$1.80

Chunk Relevancy Sort

GPT-4 / 7k

GPT-3.5 / 100k

$0.03

$0.0005

$4'435

$1'056

$5’491

$0.26

Reranker

GPT-4 / 7k

reranker / 100k

$0.03

$0.00

$4'435

$300 fixed

$4’735

$0.24

Table 2: Companies without PTU (Using GPT-4 Omni and GPT-4 Omni-Mini) Processing allowed in Europe

RAG improvement option

Model used / Tokens to process

Costs per 1k input tokens

costs per month (21 120 requests)

Total cost estimation

Total cost per search

RAG improvement option

Model used / Tokens to process

Costs per 1k input tokens

costs per month (21 120 requests)

Total cost estimation

Total cost per search

None

GPT-4o / 7k

$0.005

$740

$740

$0.04

More Input Tokens

GPT-4o / 30k

$0.005

$3'168

$3’168

$0.15

Chunk Relevancy Sort

GPT-4o / 7k

GPT-4o-min / 100k

$0.005

$0.000165

$740

$350

$1’090

$0.05

Reranker

GPT-4o / 7k

reranker / 100k

$0.005

$0.00

$740

$300 fixed

$1’040

$0.05


Conclusion

This documentation provides a comprehensive guide for improving RAG results based on different customer scenarios. Whether dealing with rate limits, latency concerns, or cost management, the options outlined above will help you choose the best configuration for your needs.

For any further questions or personalized recommendations, please contact the customer success team.


Author

@Pascal Hauri

 

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service