The AI Scaling Crisis

"Today's AI models are the worst you will ever use" - AI Zealot Proverb

In the 2012 AlexNet research paper that reignited interest of neural networks, the authors make a clear point regarding the success they have seen with their work:

"All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available"

This point, that simply going for scale should produce better results for AI-defined tasks, became the leading gospel AI researchers and developers have evangelized for the last decade. It was the message communicated from mountain-tops that made its way through Silicon Valley and eventually the rest of the world: AI is here, and it will only get better the more data and the more compute we throw at our problems.

This crusade of scale in AI systems has indeed taken our technology industries far. As compute from dedicated GPUs increased, as well as the use of the internet globally to capture and produce more data, it became easier than ever to get started at machine learning tasks that for decades alluded reasonable solutions. In a short time we produced powerful models in areas like computer vision and recommendation systems, and eventually the ultimate pinnacle of AI exposure to the broader public - large language models with ChatGPT.

So now that we have gone through several years of iterations and upgrades of foundational models, what does the trajectory of their performance and cost tell us?

The Promise of Scale vs Reality

In the early 2020s, scaling models yielded dramatic leaps. For example, GPT-3 struggled below 50% accuracy on math benchmarks like GSM8K, while GPT-4 soared to near-perfect performance, essentially making that test obsolete and forcing researchers to adopt harder benchmarks. Gains were just as visible in code generation and general reasoning, with each new generation unlocking fresh capabilities.

But the trajectory has changed. With GPT-5, we still see state-of-the-art results across math, coding, and reasoning tasks, yet the magnitude of improvement is much smaller. Many benchmarks now show signs of saturation, where further scaling doesn’t deliver the same step-change progress.

The public response to GPT-5 has also underscored this point. Unlike the GPT-4 launch, which felt like a revolution in its public reception, the August 2025 release has landed more as an incremental upgrade. The frontier is still moving forward, but the era of scaling breakthroughs may be giving way to something slower.

There is no Free Lunch

Every leap in performance has been matched by a leap in cost. The disciples of the scaling gospel aggregated billions of dollars in financing for talent and compute resources, but also taxed the Earth via raw energy consumption and clean water. On top of this, poorly paid data workers in even poorer working conditions across the globe were critical in supplementing and training models to avoid violent and unacceptable behavior.

As the industry keeps iterating on the incremental powers of these foundation models, it is very much worth us taking into consideration the ongoing costs of going in this direction.

AI Investment is Banking on Continuous Performance Improvements

Startups and Fortune 500 companies continue to pour talent, effort, and money into AI developments, with most investment theses being predicated on a rate of performance gains we have been witness to in the last 5 years. It is very easy to take this for granted given the longevity of other scaling principles such as Moore's Law (and eventually the self-proclaimed "Open AI's Law" for model and compute scaling).

Entire companies are being built around fully-agentic workflows, where "AI team members" have close to complete autonomy in performing tasks. At the time of this writing, most practitioners and investors are aware that we have not reached this as a practical reality yet: reasoning models with open-ended agentic workflows are expensive to run and don't guarantee results from fixed pipeline work that is already understood today.

So if we can grasp that generative AI performance is plateauing with incremental gains and that fully agentic services are still at least a step function away in our foundational technology offerings, where does that leave investor expectations that are tied up in hundreds of billions of dollars?

Continued scaling alone cannot sustain the return-on-investment assumptions built into today's massive funding rounds.

Where do we go from here?

It is in-vogue right now to throw terms around like "AI Bubble", but I don't find it to be a productive conversation for us in the trenches of developing this technology. There is enough awareness from the broader community that new ideas need t o be considered beyond general, foundational models to keep the pace of progress going.

So what are the values that should drive new innovation in both research and applied AI work?

Architectural Exploration

Transformers excel at autoregressive token prediction, but they lack critical intelligence skills such as proper world modeling and persistent memory. If new gains won't come from more of the same transformer pretraining, they will instead from different approaches entirely.

JEPA, an initiative from Deep Learning Godfather Yann Lecun, is one effort that is pushing for improvements with better world modeling. Instead of trying to get every pixel or word exactly right, it asks models to predict the shape of missing pieces in a higher-level space of meaning. This shift makes models both more efficient and better able to reason about the world.

Specialization Over Generalization

What has been wonderful about foundation models is that they have been reasonable at excelling in generalized tasks with minimal domain knowledge. As a consequence of their nature, it has made creating demos of intelligent applications much easier. But a lot of what has raised money and is being put out into the world is exactly that: demos of what should eventually become powerful tools.

The real world is incredibly domain heavy, meaning that if we want to solve real problems we need to become better at model curation and specialization. AlphaFold for proteins and BloombergGPT for finance are examples of focused efforts, which should in-turn bring down costs while improving domain-specific performance.

AI as a Tool, Not a Person

A critical factor in proper AI education for the general public is that these models are not proper reasoning entities, they're still just statistical tools! Treating them as people or "team mates" just creates false expectations from business stakeholders and unwarranted fear from communities being impacted by AI.

There is also a part of this framing that assists with risk and security. Working on AI as a tool, companies and researches can design for guardrails and transparency instead of simply chasing "AGI" (an ever-moving goal post).

Progress will come from better interfaces and workflows, not superficial personalities that are just masked statistical models.

Who wins?

The next decade of AI development is not going to reward bigger and bigger models. We need to recognize that raw scale is an unsustainable path. Instead, a focus on the best designed systems will win.