Comparison of GPT-4o Performance: May vs. August Versions
Purpose
This benchmarking report evaluates two different GPT-4o versions:
GPT-4o-05 (released in May 2024)
GPT-4o-08 (released in August 2024)
Our goal is to assess the relative quality of responses to ensure consistent and accurate outputs as the model evolves. All evaluations were conducted as a blind test to minimize bias: the model version was not revealed to reviewers.
How It Works
Each comparison run evaluates new answers against reference answers using several dimensions:
Contradiction – Do the answers conflict in meaning?
Extent – Is one answer significantly longer or shorter than the other?
Hallucination – Does the answer include unsupported content?
Sources – Are sources missing compared to the human-evaluated ground truth?
A human first evaluates and confirms the ground truth before comparisons take place. The scoring uses an LLM as a judge to assess answer differences across these dimensions.
Performance Overview – Model Consistency Check
Metric | GPT-4o-05 | GPT-4o-08 |
---|---|---|
Contradiction issues | 18.5% | 15.1% |
Extent differences | 21.9% | 17.1% |
Hallucinations | 15.8% | 4.8% |
Missing sources | 38.4% | 31.5% |
This comparison checks how consistent each version is with itself. Each model version was run twice with the same 146 questions, and the results were compared between the two runs to identify inconsistencies. The GPT-4o-08 version showed improved consistency across all dimensions, especially in reducing hallucinations and contradiction issues.
Qualitative Evaluation (Model Version Comparison)
Analysis of 146 response pairs comparing GPT-4o-05 and GPT-4o-08 outputs.
Metric | Share (rounded) |
---|---|
Obvious differences, but same meaning | 62% |
Very slight difference (e.g., word choice) | 28% |
Identical answers (sources can vary) | 6% |
One or more answers differ in actual meaning | 4% |
96% of answer pairs are acceptable variations. Only 4% differ in actual meaning and require deeper review.
Behavior Patterns
GPT-4o-08 often provides more focused answers, especially for scoped questions.
GPT-4o-05 includes broader context, which may help with depth but can be unnecessary in narrower queries.
GPT-4o-05 tends to produce slightly more natural or narrative-style outputs.
GPT-4o-08 generally gives fewer bullet-point responses compared to GPT-4o-05.
Examples
Question: “How does the direct investment mindset ensure value creation?”
GPT-4o-08: Compact explanation focused on the core mechanism.
GPT-4o-05: Similar meaning, phrased more broadly.
Question: “At what stage of funded commitments do you enter a secondary transaction?”
GPT-4o-08: Short, general answer.
GPT-4o-05: Descriptive explanation including rationale behind the entry stage.
Conclusion
GPT-4o-08 delivers better compliance with benchmark expectations:
Fewer hallucinations
Less contradiction
More precise scopes
GPT-4o-05 remains stronger in narrative flow and contextual richness, which may still be preferred in user-facing or front-end scenarios.
Recommendation
Based on our results, GPT-4o-08 is preferable in accuracy-critical contexts. However, GPT-4o-05 may still be valuable where richer or more conversational answers are required.
GPT-4o-05 uses more bullet-point answers compared to GPT-4o-08.
Users should run their own benchmarks and validate whether the module delivers the desired outputs.
For more information see Benchmarking Process
RAG Configuration Details
Space used: Internal Knowledge Search
The retrieval-augmented generation (RAG) setup used for this benchmarking was consistent across both GPT-4o versions. Key configuration settings:
Context window: Up to 30,000 tokens were used for retrieving and analyzing source content.
Chunk relevancy sorting: Disabled. Retrieved chunks were not re-ranked after vector similarity retrieval.
Search method: Vector search only. No keyword or hybrid search mechanisms were applied.
LLM seed: Fixed for most deterministic outputs.
Temperature: Set to 0 for stable and predictable generations.
Author | @Pascal Hauri @Enerel Khuyag |
---|
© 2025 Unique AG. All rights reserved. Privacy Policy – Terms of Service