GPT-4o Performance: August vs. November Versions

1 Purpose
2 Performance Overview – Model Consistency Check
- 2.1 Consistency Metrics Across Model Versions
3 Qualitative Evaluation – Cross-Model and Intra-Model Analysis
- 3.1 Evaluation Metrics
- 3.2 Insights
4 Behavioral Patterns
- 4.1 Examples
5 Conclusion
6 Recommendation
7 Retrieval-Augmented Generation (RAG) Configuration
- 7.1 Key Parameters

Purpose

This benchmarking report evaluates two distinct versions of OpenAI’s GPT-4o model:

GPT-4o-08 (released August 2024)
GPT-4o-11 (released November 2024)

The objective is to assess the relative quality and consistency of responses as the model evolves. To eliminate potential evaluator bias, all evaluations were conducted as blind tests, where the model version was not disclosed to reviewers.

Performance Overview – Model Consistency Check

This section evaluates the internal consistency of each model by comparing responses between two separate runs using the same 141 prompts.

The comparison is conducted using our LLM-based benchmarking tool, where an LLM acts as a judge, scoring outputs across four critical dimensions:

Contradiction – Do the responses differ in factual meaning or assertions?
Extent Difference – Are there major length variations that change the scope or clarity?
Hallucination – Are unsupported or invented claims introduced?
Source Variations – Are any sources missing compared to the human-confirmed ground truth?

Consistency Metrics Across Model Versions

Metric	GPT-4o-08	GPT-4o-11

Metric	GPT-4o-08	GPT-4o-11
Contradiction	16%	15%
Extent Difference	18%	14%
Hallucinations	4%	4%
Source Variations	32%	39%

GPT-4o-11 shows a slightly better consistency across contradiction and extent difference dimensions, while both models perform equally well regarding hallucinations. However, GPT-4o-11 introduces more source variation, suggesting differences in source citation or retrieval alignment.

Qualitative Evaluation – Cross-Model and Intra-Model Analysis

This evaluation involves human reviewers comparing responses from:

GPT-4o-08 vs GPT-4o-11 (cross-version comparison)
GPT-4o-11 vs GPT-4o-11 (intra-version consistency)

Each model was run twice per question, and reviewers compared the two generated outputs for differences in meaning, style, and alignment to sources.

Evaluation Metrics

Comparison Type	Obvious Differences, Same Meaning	Very Slight Differences	Identical Responses	Meaningfully Different

Comparison Type	Obvious Differences, Same Meaning	Very Slight Differences	Identical Responses	Meaningfully Different
GPT-4o-08 vs GPT-4o-11	34%	57%	6%	2%
GPT-4o-11 vs Itself	26%	55%	18%	1%

Insights

98% of cross-version comparisons yield acceptable variations — including stylistic shifts or rewordings that do not affect meaning.
Cases of meaning divergence are rare (1–2%), indicating strong robustness in both versions.

Behavioral Patterns

Behavioral distinctions emerged between the two model versions, especially in tone, structure, and contextualization:

GPT-4o-08:
- More concise, favoring brevity and readability.
- Ideal for quick-scan or bullet-style use cases.
- Efficient under space constraints.
GPT-4o-11:
- More structured and institutionally aligned.
- Tends to elaborate, adding context or justifications.
- Outperforms on nuanced prompts (e.g., ESG and regulatory questions).
- Generates responses that mirror formal due diligence style more closely.

Examples

Q: Have there been any adverse ESG events communicated to investors?

GPT-4o-08: Highlights firm-specific ESG actions but may imply adverse events.
GPT-4o-11: Clearly states no formal ESG events were reported, aligning better with source-based verification.

Q: Is the valuation policy board-approved?

GPT-4o-08: Affirms approval without referencing source.
GPT-4o-11: Notes lack of evidence in reviewed documents, aligning better with verification requirements.

Conclusion

GPT-4o-11 (November 2024 release) provides more reliable, structured, and source-aligned responses. It consistently delivers answers that match institutional standards for clarity and accuracy, especially in compliance-driven or complex domains.

While GPT-4o-08 (August 2024 release) offers faster readability and efficiency — making it valuable for routine tasks and short-form outputs — GPT-4o-11 is the preferred choice when accuracy, justification, and regulatory alignment matter.

Recommendation

Based on this evaluation:

Use GPT-4o-11 (November version) as the default for client-facing due diligence questionnaires (DDQs) and institutional reports.
Use GPT-4o-08 (August version) selectively in:
- Scenarios with strict space constraints
- Non-critical or formulaic content
- Tasks requiring faster readability or bullet-style formatting

Users should also run independent benchmarks to verify model performance in their unique use cases (Benchmarking Process).

Retrieval-Augmented Generation (RAG) Configuration

Space used: Internal Knowledge Search

The RAG configuration was identical for both model versions, ensuring a fair and consistent comparison environment.

Key Parameters

Context Window: Up to 30,000 tokens
Search Method: Hybrid Search (elastic & vector search)
Chunk Re-ranking: Disabled (no relevancy re-sorting post-retrieval)
LLM Seed: Fixed for deterministic generation
Temperature: Set to 0 to reduce randomness and enhance consistency

Author	@Pascal Hauri @Enerel Khuyag