Comparison of GPT-4o Performance: May vs. August Versions

1 Purpose
2 How It Works
3 Performance Overview – Model Consistency Check
4 Qualitative Evaluation (Model Version Comparison)
5 Behavior Patterns
- 5.1 Examples
6 Conclusion
7 Recommendation
8 RAG Configuration Details

Purpose

This benchmarking report evaluates two different GPT-4o versions:

GPT-4o-05 (released in May 2024)
GPT-4o-08 (released in August 2024)

Our goal is to assess the relative quality of responses to ensure consistent and accurate outputs as the model evolves. All evaluations were conducted as a blind test to minimize bias: the model version was not revealed to reviewers.

How It Works

Each comparison run evaluates new answers against reference answers using several dimensions:

Contradiction – Do the answers conflict in meaning?
Extent – Is one answer significantly longer or shorter than the other?
Hallucination – Does the answer include unsupported content?
Sources – Are sources missing compared to the human-evaluated ground truth?

A human first evaluates and confirms the ground truth before comparisons take place. The scoring uses an LLM as a judge to assess answer differences across these dimensions.

Performance Overview – Model Consistency Check

Metric	GPT-4o-05	GPT-4o-08

Metric	GPT-4o-05	GPT-4o-08
Contradiction issues	18.5%	15.1%
Extent differences	21.9%	17.1%
Hallucinations	15.8%	4.8%
Missing sources	38.4%	31.5%

This comparison checks how consistent each version is with itself. Each model version was run twice with the same 146 questions, and the results were compared between the two runs to identify inconsistencies. The GPT-4o-08 version showed improved consistency across all dimensions, especially in reducing hallucinations and contradiction issues.

Qualitative Evaluation (Model Version Comparison)

Analysis of 146 response pairs comparing GPT-4o-05 and GPT-4o-08 outputs.

Metric	Share (rounded)

Metric	Share (rounded)
Obvious differences, but same meaning	62%
Very slight difference (e.g., word choice)	28%
Identical answers (sources can vary)	6%
One or more answers differ in actual meaning	4%

96% of answer pairs are acceptable variations. Only 4% differ in actual meaning and require deeper review.

Behavior Patterns

GPT-4o-08 often provides more focused answers, especially for scoped questions.
GPT-4o-05 includes broader context, which may help with depth but can be unnecessary in narrower queries.
GPT-4o-05 tends to produce slightly more natural or narrative-style outputs.
GPT-4o-08 generally gives fewer bullet-point responses compared to GPT-4o-05.

Examples

Question: “How does the direct investment mindset ensure value creation?”

GPT-4o-08: Compact explanation focused on the core mechanism.
GPT-4o-05: Similar meaning, phrased more broadly.

Question: “At what stage of funded commitments do you enter a secondary transaction?”

GPT-4o-08: Short, general answer.
GPT-4o-05: Descriptive explanation including rationale behind the entry stage.

Conclusion

GPT-4o-08 delivers better compliance with benchmark expectations:

Fewer hallucinations
Less contradiction
More precise scopes

GPT-4o-05 remains stronger in narrative flow and contextual richness, which may still be preferred in user-facing or front-end scenarios.

Recommendation

Based on our results, GPT-4o-08 is preferable in accuracy-critical contexts. However, GPT-4o-05 may still be valuable where richer or more conversational answers are required.
GPT-4o-05 uses more bullet-point answers compared to GPT-4o-08.

Users should run their own benchmarks and validate whether the module delivers the desired outputs.

For more information see Benchmarking Process

RAG Configuration Details

Space used: Internal Knowledge Search

The retrieval-augmented generation (RAG) setup used for this benchmarking was consistent across both GPT-4o versions. Key configuration settings:

Context window: Up to 30,000 tokens were used for retrieving and analyzing source content.

Chunk relevancy sorting: Disabled. Retrieved chunks were not re-ranked after vector similarity retrieval.

Search method: Vector search only. No keyword or hybrid search mechanisms were applied.

LLM seed: Fixed for most deterministic outputs.

Temperature: Set to 0 for stable and predictable generations.

Author	@Pascal Hauri @Enerel Khuyag