GPT-5: Peak or Plateau? A Literature Review - Cana

Is GPT-5 a revolutionary step toward greater AI intelligence, or a sign of diminishing returns in scaling large language models? This work presents a comprehensive analysis of literature and reports from the last several years to answer that question. We review the development from GPT-3 through GPT-4 and into GPT-5, highlighting how earlier leaps in capability set high expectations for GPT-5’s launch. Through a survey of academic papers, industry analyses, and userexperiences, we find that GPT-5 delivered incremental improvements over GPT-4 in many areas, confirming predictions of an AI “scaling plateau.” Performance benchmarks show smaller gains despite enormous increases in model complexity, and some users even noted regressions in specific domains like coding speed and the conversational “feel.”

Introduction

Large Language Models (LLMs) have undergone unprecedented growth in capability in a short span, leading many to wonder if recent developments are marking the peak of current approaches or if they are merely a pause before the next leap. In particular, the question “GPT-5: Peak or Plateau?” embodies the research community’s curiosity and concern about whether the latest flagship model from OpenAI represents continued progress or the start of diminishing returns. This technical article seeks to answer that question through a rigorous literature review and analysis of GPT-5’s performance relative to its predecessors and peers.

Research Questions: We focus on three primary questions: (1) What do the latest evaluations and reports say about GPT-5’s improvements over GPT-4? (Is it a significant advance or an incremental upgrade?) (2) What evidence is there of diminishing returns in scaling LLMs? We examine both theoretical scaling law predictions and empirical benchmark trends around GPT-5’s time. (3) How has GPT-5 affected or been perceived in terms of safety and alignment? This includes whether its new safety strategies resulted in better or worse outcomes (e.g. fewer refusals but potential new risks).

Background:

To contextualize GPT-5’s development, it’s important to review the rapid evolution of GPT series models and the scaling paradigm in general. The modern LLM boom began with GPT-2 (2019), which was the first model to demonstrate coherent paragraph-length text generation on unrestricted topics. GPT-2’s release (actually partially withheld initially over misuse fears) hinted at the potential of scaling up transformers. The true shock came with GPT-3 in 2020, a 175-billion parameter model that could perform tasks it wasn’t explicitly trained for, from translation to basic arithmetic, given only a few examples in the prompt. This emergent capability indicated that simply making models larger and training on more data was unlocking new behavior.

Following GPT-3, the notion of “scaling laws” for AI gained prominence. Kaplan et al. (2020) had formalized how model performance (measured in terms of training loss) scales predictably with model size, dataset size, and compute, obeying a power-law until hitting data/model limits. These laws suggested that bigger was better, and indeed AI labs raced to build larger models (GPT-3 itself was an outgrowth of that thinking).

However, the next milestone, GPT-4 (March 2023), while larger (exact size undisclosed) and much more capable, introduced some nuances. GPT-4 was not just “GPT-3 but bigger” – it was also extensively fine-tuned with human feedback and had new abilities like multimodal image understanding. By outperforming GPT-3 on a wide array of tasks (e.g., GPT-4 famously passed the bar exam and other professional tests that GPT-3 had failed), GPT-4 set a high benchmark for “what comes next.” Yet, interestingly, OpenAI in its technical report explicitly declined to disclose GPT-4’s details for competitive and safety reasons, indicating the game had changed from academic openness to closed, product-focused development.

The Plateau Hypothesis: By early 2025, some experts and even tech leaders were publicly pondering if we were at or near that wall. Notably, Bill Gates in late 2023 expressed skepticism that GPT-5 would be the kind of giant leap GPT-4 was.

This dichotomy between fundraising/hype narratives (sometimes invoking AGI) and pragmatic product improvement is a recurring theme in the background of GPT-5’s launch.

Methods

To tackle the research questions, we conducted a broad literature sweep following systematic review principles, adapted to the fast-moving and sometimes proprietary nature of AI model research. Our sources ranged from peer-reviewed papers and preprints, to industry whitepapers and blog posts, to forum discussions.

Search and Selection Strategy

We used a combination of search connectors (including general web search via Bing, and targeted searches on arXiv and technical blogs). The query plan (see Appendix for full list of queries) was designed to cover:

(a) general performance of GPT-5 vs predecessors, (b) keywords around “diminishing returns” and “plateau” in AI
(c) GPT-5 safety or alignment changes
(d) comparisons with other models.

We set a time window of roughly 2023 through 2025 (with some earlier “seminal” works like 2020-2022 scaling law papers allowed). Given that GPT-5 was released in 2025, many sources are from mid/late 2025 discussing it.

In total, we gathered 12 major sources deemed most relevant (see Source Log in Appendix for details) and a handful of supplementary ones for cross-verification. Each source in the log was annotated with its type and a reliability note to be mindful of potential bias.

Analysis Approach

We performed a thematic analysis on the content of sources. We pre-defined themes (subtopics) such as “Performance gains & benchmarks,” “Safety & alignment,” “User experience,” “Scaling limits evidence,” “Competition,” etc., based on the questions. As we reviewed each source, we noted which themes it provided insight on. We constructed a coverage map (see Appendix) to ensure that for each theme we had multiple sources of evidence – this also revealed if any theme had conflicting reports that needed reconciliation.

We paid special attention to quantitative data vs qualitative statements:

When a source provided metrics (e.g., benchmark scores, percentage improvements), we recorded those exactly with citations. For instance, OpenAI stated GPT-5’s hallucination rate on certain benchmarks dropped to ~1.6%, and independent tests showed GPT-5’s score on a reasoning benchmark (GPQA) was 89.4% vs a competitor’s 86.4%. These help objectively gauge improvement magnitude.

For opinions or subjective evaluations (e.g., “GPT-5 feels underwhelming” or “phenomenal”), we carefully attributed them, especially if they came from notable figures (like Gary Marcus’s quote via Tim Lee or Nathan Lambert’s assessment). This ensures we differentiate between measurable facts and perception.

Where we found disagreements in sources, we tracked those for the Discussion section. A prime example is the disagreement on whether progress is truly slowing or not: Gates and some experts vs. Altman/Schmidt saying “no, there’s still gas in the tank”.

Literature Review

In this section, we synthesize findings from the collected sources along thematic lines. Rather than summarizing each source in isolation, we integrate them to answer specific facets of the overarching question about GPT-5’s gains or lack thereof.

GPT-5 Performance: Incremental Gains Across the Board

When GPT-5 was released, OpenAI touted it as “our best AI system yet” with state-of-the-art results in many domains. Indeed, on paper GPT-5’s performance metrics were top-tier in late 2025. For instance, it set new records on certain benchmarks: OpenAI reported GPT-5 achieved 94.6% on a coding benchmark (SWE-bench) and 88.4% on a difficult reasoning benchmark (GPQA) when using its advanced “thinking” mode.

Multiple sources emphasize that the improvements were smaller than expected. The Financial Times bluntly called GPT-5 “underwhelming” and a sign that progress is slowing. Tim Lee notes that after GPT-4’s dramatic jump, GPT-5 didn’t meet the “sky-high expectations” – it turned out to be solid but not a breakthrough.

It’s important to note two caveats: First, we are discussing general benchmarks. There might be specific tasks where GPT-5 made a bigger leap. Second, performance is not just about scores – things like latency, consistency, etc., matter to users. In those respects, some aspects even got worse (next section).

Safety & Alignment

GPT-5’s launch was accompanied by a notable shift in OpenAI’s approach to safety and alignment training. Previously, GPT-4 and ChatGPT used refusal-based training: on potentially harmful or policy-violating prompts, the AI was trained to refuse outright. This led to complaints of the model being overly cautious or giving “hard stops” even when a partial helpful answer might exist.

For GPT-5, OpenAI introduced what they call “safe completions” training. The goal was to have GPT-5 respond to tricky prompts with as much helpful information as possible without crossing safety lines, only refusing the parts it truly must.

Now, how did this play out in reality and perception? The results are mixed:

Reduced Over-refusal: Indeed, early users noticed GPT-5 was more willing to answer questions that GPT-4 would often refuse. Many welcomed this increased compliance. This made GPT-5 feel more capable in one sense – it would try to tackle more queries.

Safety Concerns: On the flip side, some observers worried that this means GPT-5 might generate harmful content more easily if the safeguards are more “nuanced” rather than strict. The model is walking a finer line. The Lab7AI Insights article noted that some enterprise users actually value refusal behavior, as it reduces risk.

In conclusion, from a safety perspective, GPT-5 represented a recalibration of alignment priorities: moving away from blunt refusals to a more context-sensitive strategy. Whether that is a net positive or negative depends on one’s view.

Analysis and Discussion

Bringing together the reviewed evidence, we analyze what consensus is emerging in the community, where opinions diverge, and what the implications are for declaring GPT-5 as a “peak” or just a stepping stone.

Areas of Consensus

There is a strong consensus across nearly all sources that GPT-5 did not provide a revolutionary leap over GPT-4 in the way prior generational jumps did. This is evidenced by multiple independent lines:

Expert commentary: Both enthusiastic experts (like Lambert) and skeptical ones (like Marcus via Lee) agree GPT-5’s performance is within expected incremental bounds.

User sentiment: General user and developer feedback align with that; terms like “underwhelming” or “meh” came up frequently.

Therefore, one consensus conclusion is: the era of extremely rapid improvements in general LLM capabilities has slowed by 2025. This doesn’t mean progress is zero; it means progress per unit of investment is much smaller.

Points of Contention

Despite broad agreement on the plateau, there are some nuanced disagreements and uncertainties:

“Plateau” vs “Continued Slope”: Some AI leaders like Sam Altman and Eric Schmidt disagree that we’ve hit a wall. They argue there’s still room to improve by scaling further or just by iterative improvement.

Severity of Diminishing Returns: All agree diminishing returns exist; the debate is how severe. Some at OpenAI might argue it’s not that dire – maybe GPT-5’s training included additional goals (like multimodal) which diluted pure text performance gain, so perhaps a GPT-6 focused on text could still have a bigger jump.

Evaluation of “intelligence”: Some AI commentators (especially those focused on AGI prospects) contend whether GPT-5 shows that we’re not near human-level AI yet. For example, Gary Marcus calling GPT-5 “overhyped and underwhelming”, while others like Nathan Lambert or the LessWrong poster might say GPT-5 is still part of a steady progress and could still eventually lead to very high capability, just not explosively soon.

Interpretations – Peak or Plateau

So, is GPT-5 a peak (i.e., the highest point of a curve for now, possibly to decline or stagnate) or a plateau (steady but slow continuation)? The literature tends to use the word plateau. No one suggests AI capability declined with GPT-5 – it’s higher, just not dramatically so. So not a “peak” then downturn, but possibly a flattening curve.

If we see the performance graph from GPT-2 to GPT-5, it likely looks like an S-curve or a curve with diminishing slope. GPT-3 to GPT-4 was a steep climb; GPT-4 to GPT-5 is a much gentler slope. This often happens in maturing technologies – initial exponential-like growth hitting saturation.

One might consider GPT-5 as an “inflection point” just as the prompt says: that after this, bigger is not always much better, and research focus might shift to other dimensions like efficiency, specialization, or new paradigms.

The consensus leans toward calling it a plateau in the scaling approach. Yet, we must be careful: a plateau can be temporary. Historically, AI has seen plateaus (like in chess, progress plateaued until a new method – deep learning – came to chess engines, then a leap; or in vision, plateau until deep convnets).

Conclusion

Our comprehensive review finds that GPT-5, while undeniably more advanced than its predecessors, largely confirms a trend of slowing marginal returns in the current paradigm of large language models. It stands less as a triumphant peak of intelligence and more as a broad plateau – a leveling out of the meteoric progress charted by GPT-3 and GPT-4.

Users and experts alike observed that GPT-5’s improvements, though real (e.g., better factual accuracy, slightly improved reasoning, multimodal support), feel incremental.

We also highlighted that GPT-5 introduced qualitative changes in alignment strategy – a move toward nuanced safe completions rather than blanket refusals – and this reflects a maturing of approach, albeit with debates on safety trade-offs.

This change is indicative of the broader state of LLM development: simply scaling the model is no longer enough; how we use and guide the model becomes crucial to extract value without causing harm.

It may be a plateau on the current mountain, with another climb possible if a new path or technology is found. Many voices in our sources emphasize looking beyond simply making models bigger – pointing to new architectures, more efficient algorithms, or hybrid systems as the way forward.

In conclusion, GPT-5 should perhaps be seen as a pivot point. It solidified the gains of the transformer era, turned them into widely deployed tools (with GPT-5 in Office, ChatGPT, etc.), and in doing so, exposed the asymptotes of this approach.

References

OpenAI. (2025, Aug 7). Introducing GPT-5 (Blog post). OpenAI.
Lee, T. B. (2025, Aug 14). Is GPT-5 a “phenomenal” success or an “underwhelming” failure? Understanding AI (Substack).
Lambert, N. (2025, Aug 7). GPT-5 and the arc of progress. Interconnects (Substack).
Okemwa, K. (2025, Aug 16). From plateau predictions to buggy rollouts – Bill Gates’ GPT-5 skepticism looks strangely accurate. Windows Central.
Lab7AI. (2025, Aug 18). AI Plateau? Mixed Reactions Roll In After GPT-5 Debuts. Medium (Lab7AI Insights).
Masood, A. (2025, May 26). Is there a wall? An Evidence-Based Analysis of Diminishing Returns in LLM Scaling. Medium.
OpenAI Forum User “Max_Zhadobin”. (2025, Sep 16). Severe regression in GPT-5 Codex performance. OpenAI Developer Community.
Reddit user comments. (2025, Aug). Discussion: GPT-5 modest improvements (r/artificial thread).
Shah, D. (2025, Sep 4). GPT-5 vs Claude 4. Portkey AI Blog.
Schoenmakers, A. (2025, Aug 9). First Impressions: GPT-5 or Claude 4 Sonnet? Spartner Software Blog.
Arsturn. (2025, Aug). GPT-5 vs Gemini 2.5 Pro: AI Logic & Reasoning Battle. Arsturn Tech Blog.
Greenblatt, R. (2025, Aug 20). My AGI timeline updates from GPT-5 (and 2025 so far). AI Alignment Forum.

GPT-5: Peak or Plateau? A Literature Review of Progress in Large Language Models