GPT-4o Performance: August vs. November Versions
Purpose
This benchmarking report evaluates two distinct versions of OpenAI’s GPT-4o model:
GPT-4o-08 (released August 2024)
GPT-4o-11 (released November 2024)
The objective is to assess the relative quality and consistency of responses as the model evolves. To eliminate potential evaluator bias, all evaluations were conducted as blind tests, where the model version was not disclosed to reviewers.
Performance Overview – Model Consistency Check
This section evaluates the internal consistency of each model by comparing responses between two separate runs using the same 141 prompts.
The comparison is conducted using our LLM-based benchmarking tool, where an LLM acts as a judge, scoring outputs across four critical dimensions:
Contradiction – Do the responses differ in factual meaning or assertions?
Extent Difference – Are there major length variations that change the scope or clarity?
Hallucination – Are unsupported or invented claims introduced?
Source Variations – Are any sources missing compared to the human-confirmed ground truth?
Consistency Metrics Across Model Versions
Metric | GPT-4o-08 | GPT-4o-11 |
---|---|---|
Contradiction | 16% | 15% |
Extent Difference | 18% | 14% |
Hallucinations | 4% | 4% |
Source Variations | 32% | 39% |
GPT-4o-11 shows a slightly better consistency across contradiction and extent difference dimensions, while both models perform equally well regarding hallucinations. However, GPT-4o-11 introduces more source variation, suggesting differences in source citation or retrieval alignment.
Qualitative Evaluation – Cross-Model and Intra-Model Analysis
This evaluation involves human reviewers comparing responses from:
GPT-4o-08 vs GPT-4o-11 (cross-version comparison)
GPT-4o-11 vs GPT-4o-11 (intra-version consistency)
Each model was run twice per question, and reviewers compared the two generated outputs for differences in meaning, style, and alignment to sources.
Evaluation Metrics
Comparison Type | Obvious Differences, Same Meaning | Very Slight Differences | Identical Responses | Meaningfully Different |
---|---|---|---|---|
GPT-4o-08 vs GPT-4o-11 | 34% | 57% | 6% | 2% |
GPT-4o-11 vs Itself | 26% | 55% | 18% | 1% |
Insights
98% of cross-version comparisons yield acceptable variations — including stylistic shifts or rewordings that do not affect meaning.
Cases of meaning divergence are rare (1–2%), indicating strong robustness in both versions.
Behavioral Patterns
Behavioral distinctions emerged between the two model versions, especially in tone, structure, and contextualization:
GPT-4o-08:
More concise, favoring brevity and readability.
Ideal for quick-scan or bullet-style use cases.
Efficient under space constraints.
GPT-4o-11:
More structured and institutionally aligned.
Tends to elaborate, adding context or justifications.
Outperforms on nuanced prompts (e.g., ESG and regulatory questions).
Generates responses that mirror formal due diligence style more closely.
Examples
Q: Have there been any adverse ESG events communicated to investors?
GPT-4o-08: Highlights firm-specific ESG actions but may imply adverse events.
GPT-4o-11: Clearly states no formal ESG events were reported, aligning better with source-based verification.
Q: Is the valuation policy board-approved?
GPT-4o-08: Affirms approval without referencing source.
GPT-4o-11: Notes lack of evidence in reviewed documents, aligning better with verification requirements.
Conclusion
GPT-4o-11 (November 2024 release) provides more reliable, structured, and source-aligned responses. It consistently delivers answers that match institutional standards for clarity and accuracy, especially in compliance-driven or complex domains.
While GPT-4o-08 (August 2024 release) offers faster readability and efficiency — making it valuable for routine tasks and short-form outputs — GPT-4o-11 is the preferred choice when accuracy, justification, and regulatory alignment matter.
Recommendation
Based on this evaluation:
Use GPT-4o-11 (November version) as the default for client-facing due diligence questionnaires (DDQs) and institutional reports.
Use GPT-4o-08 (August version) selectively in:
Scenarios with strict space constraints
Non-critical or formulaic content
Tasks requiring faster readability or bullet-style formatting
Users should also run independent benchmarks to verify model performance in their unique use cases (Benchmarking Process).
Retrieval-Augmented Generation (RAG) Configuration
Space used: Internal Knowledge Search
The RAG configuration was identical for both model versions, ensuring a fair and consistent comparison environment.
Key Parameters
Context Window: Up to 30,000 tokens
Search Method: Hybrid Search (elastic & vector search)
Chunk Re-ranking: Disabled (no relevancy re-sorting post-retrieval)
LLM Seed: Fixed for deterministic generation
Temperature: Set to 0 to reduce randomness and enhance consistency
Author | @Pascal Hauri @Enerel Khuyag |
---|
© 2025 Unique AG. All rights reserved. Privacy Policy – Terms of Service