AI's Second Half: From Algorithms to Utility

Compiled by Big Data Digest

Looking back over the past few decades, the development of AI has been almost a history of iterating "top models and new methods."

From DeepBlue defeating the world chess champion, AlphaGo conquering Go, GPT-4 acing various exams, to new generation models like o1 and R1 sweeping tasks in math, programming, writing, and operation, behind each historical breakthrough is fundamental innovation in training methods and model architecture.

The rules of the game at that time were simple: whoever could invent stronger training methods and model architectures would dominate the leaderboards; whoever could achieve significant improvements on benchmarks like ImageNet, GLUE, and MMLU would be written into textbooks and gain citations.

Yao Shunyu graduated from Tsinghua Xuetang (Yao Class), holds a Ph.D. in Computer Science from Princeton University, joined OpenAI in August 2024, and is the author of Tree of Thoughts (ToT).

Now, the "algorithm is king" mindset that has dominated the AI field for decades is finally facing disruption. Yao Shunyu, a researcher at OpenAI, wrote in an article: Prior knowledge and environment are far more important than algorithms themselves. He calls the upcoming AI moment the "second half".

“The first half of AI was much like ‘exam-oriented education,’ focusing on hitting benchmarks, scoring high, and graduating. The second half is ‘true education,’ where AI needs to continuously create value in the real world.”

In the first half, we witnessed the brilliance of methods and models; in the second half, we must confront the complexity and challenges of the real world. Only by solving the "utility problem" and making AI a value creator in reality can this game truly begin.

It can be understood as: "In the future, the capabilities of first-rate AI researchers might be more like those of a product manager, rather than an algorithm engineer.

Below is the full text of the article, compiled by the Digest without altering the original meaning:

In short: We are at the halfway point of AI.

For decades, the core of the AI field has been the development of new training methods and models. These efforts have indeed led to significant breakthroughs: from defeating world champions in chess and Go, to surpassing most humans on the SAT and bar exams, and winning gold medals in the International Mathematical Olympiad (IMO) and International Olympiad in Informatics (IOI).

Behind these historical milestones, whether DeepBlue, AlphaGo, GPT-4, or the o-series models, are fundamental innovations in AI methods: search, deep reinforcement learning, model scaling, and reasoning capabilities. As time goes by, AI's performance is constantly improving.

So what has changed now?

In three words: Reinforcement Learning (RL) finally "works". More precisely, reinforcement learning has finally generalized.

After years of exploration and a series of key milestones, we have finally found an effective universal method that can use language and reasoning to solve various reinforcement learning tasks.

You know, just a year ago, if you told most AI researchers: "There's a universal method that can handle software development, creative writing, IMO-level math, mouse and keyboard operation, and long-form Q&A," many would think you were dreaming.

After all, these tasks are extremely complex, and many researchers might dedicate their entire academic career to just one small area.

But now, it has really come true.

What happens next? The "second half" of AI.

From now on, the focus will shift from "problem-solving" to "problem-defining". In this new phase, how to evaluate AI capabilities is more important than simply training models.

We no longer just ask "Can we train a model that solves problem X?", but rather "What exactly should we train AI to do? And how should we measure true progress?" To stand out in the second half, we must not only adjust our mindset and skill set in a timely manner, but may even need to gradually lean towards being product managers.

01 The First Half

To understand the "first half" of AI, let's look at the real winners.

What do you think are the most influential AI papers so far? I did a small test in Stanford's CS224N course, and the results were not surprising: Transformer, AlexNet, GPT-3, etc.

What do these papers have in common? They all brought fundamental breakthroughs that allowed us to train more powerful models. At the same time, they were able to be published because they achieved significant improvements on certain benchmarks.

But there is actually a deeper commonality: these "winners" are essentially new training methods or models, not benchmarks or specific tasks. Even the widely recognized most influential benchmark dataset, ImageNet, has less than a third of the citations of AlexNet. And if you look at the comparison between methods and benchmarks, this gap becomes even more apparent.

Take Transformer as an example. Its main benchmark was the WMT’14 machine translation task. The WMT’14 workshop report has been cited about 1,300 times, while the Transformer paper has been cited over 160,000 times.

This precisely illustrates the gameplay of the AI "first half": the focus is always on building new models and methods, while evaluation and benchmarks, although essential, are always auxiliary, serving the paper system.

Why is this the case? A large reason is that in the first half of AI development, proposing new methods was more difficult and exciting than designing new tasks. Creating a completely new algorithm or model architecture, such as backpropagation, convolutional neural networks (AlexNet), or the Transformer behind GPT-3, required extremely high insight and engineering capabilities.

In comparison, designing tasks for AI is usually much simpler: we just need to directly convert what humans are already doing (like translation, image recognition, playing chess) into benchmarks. There isn't much innovation or technical difficulty involved.

Furthermore, new methods often have greater universality and applicability than specific tasks, hence their higher value. For example, the Transformer architecture was initially validated only on the WMT’14 machine translation dataset, but it later became the core driving force in numerous fields such as computer vision, natural language processing, and reinforcement learning, far exceeding its original application scenarios.

An excellent new method can achieve breakthroughs on many different benchmarks because it is inherently concise and universal, and its influence naturally spans beyond a single task.

This pattern has continued for decades and has constantly spawned world-changing innovations and breakthroughs, the specific manifestation of which is the continuous refreshing of benchmark scores in various fields. So, why is this game rule changing? The reason is that the accumulation of all these innovations and breakthroughs has led to a qualitative leap and a truly viable "universal formula" in the matter of "solving tasks."

02 The "Universal Formula"

So, what exactly is this "universal formula"? In fact, its core elements are not surprising: large-scale language pre-training, extreme scaling of models and data, and the concept of "reasoning + action." At first glance, these terms might sound like the daily jargon of Silicon Valley, but why call it a "formula"?

We can understand it from the perspective of Reinforcement Learning (RL). RL is often considered the "ultimate form" of AI. After all, theoretically, RL can guarantee victory in various games; from a practical application perspective, it's hard to imagine systems surpassing humans like AlphaGo appearing without RL.

In reinforcement learning, there are three core elements: algorithms, environment, and prior knowledge. For a long time, RL researchers' focus has mainly been on the algorithms themselves (such as REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO, etc.), which is the "intelligent core" of how an agent learns, while the environment and prior knowledge were usually regarded as fixed or just the simplest configuration.

For example, Sutton and Barto's classic RL textbook focuses almost exclusively on algorithms, with almost no content on environment design or prior knowledge.

However, entering the era of deep reinforcement learning, everyone gradually realized that the environment itself has a huge impact on the final result: the performance of an algorithm often highly depends on the environment in which it was developed and tested. If you ignore the environment, you might create an "optimal" algorithm that only performs well in "toy" environments. So, why don't we figure out what kind of environment we need to solve first, and then find the most suitable algorithm?

Universe Project

This was precisely OpenAI's initial idea. They first launched gym, a standard RL environment covering various games, followed by World of Bits and the Universe project, attempting to turn the internet or the computer itself into a "game environment." This idea sounds great, right? As long as we can turn all digital worlds into operable environments and then solve them with clever RL algorithms, AGI in the digital world seems within reach.

This plan was good, but it didn't fully work. OpenAI did make some progress on this path, for example, solving problems like Dota and robotic arms with RL. But they were never able to tackle tasks like "using a computer" or "web navigation", and an RL agent trained in one domain could hardly transfer to another. Obviously, some crucial link was missing.

It wasn't until the advent of GPT-2 and GPT-3 that everyone realized that what was missing was "prior knowledge." You need large-scale language pre-training to "distill" universal common sense and language knowledge into the model, and then fine-tune it, before AI can become a web agent (WebGPT) or a chatbot (ChatGPT), and ultimately change the world. It turns out that the most crucial part of RL might be neither the algorithm itself nor the environment itself, but "prior knowledge." And this prior knowledge can be obtained in ways completely unrelated to RL.

Large-scale language pre-training provided good prior knowledge for conversational scenarios, but in areas like "controlling computers" or "playing video games," the effect was far inferior to chatting.

Why? These areas are further from the distribution of internet text. Applying SFT (supervised fine-tuning) or RL directly to these tasks results in very poor generalization. I noticed this problem in 2019: When GPT-2 was just released, I tried using it with SFT and RL to play text adventure games: CALM, which was also the world's first agent built with a pre-trained language model. The model had to go through millions of steps of RL training on a single game to gradually improve; even worse, it could hardly transfer to a new game.

Although this is typical behavior for RL, and RL researchers are long accustomed to it, I still found it strange: humans can almost start a new game with no training and quickly perform better. This was my first "aha!" moment. The reason humans can generalize is that we don't just mechanically perform operations like "go to cabinet 2," "use key 1 to open box 3," "attack monster with sword." We also actively reason, for example: "The dungeon is dangerous, I need a weapon. I don't see a weapon now, maybe I need to look in a locked chest. Chest 3 is in cabinet 2, so I should go there and open it first."

reasoning

"Thinking" or "reasoning" is essentially a very special kind of "action": it doesn't directly change the external world, but the space for reasoning itself is open and almost infinite.

You can think of a word, a sentence, a piece of text, or even combine ten thousand English words arbitrarily, and the world around you won't change immediately as a result. Within the framework of classic reinforcement learning theory, this is actually a very difficult problem to handle, making decision-making almost impossible. Imagine you need to choose between two boxes, one containing a million dollars and the other empty. Your expected return is fifty thousand dollars. But if I add infinitely many empty boxes, your expected return becomes zero.

However, once we incorporate "reasoning" into the action space of the RL environment and use the prior knowledge gained from language pre-training to drive AI's generalization ability, we can flexibly allocate the computational resources required for reasoning when making different decisions.

This is an extremely amazing thing. Frankly, I haven't fully figured out the mystery behind it myself, and I may need to write a separate article to discuss it in detail later. If you're interested, you can read the ReAct paper to understand the origin story of agent reasoning and feel the thinking and inspiration I had at the time.

My intuitive understanding is: even if you face countless empty boxes, your past experiences and various "games" have accumulated experience from these choices and attempts, laying the foundation for you to make correct decisions at critical moments. Abstractly speaking, language, through reasoning, gives the agent powerful generalization capabilities.

When we find the right RL prior (i.e., knowledge gained through large-scale language pre-training) and the ideal RL environment (i.e., incorporating language reasoning as part of the action), you will find that the RL algorithm itself becomes less important. This is how we achieved a series of breakthroughs such as the o-series, R1, deep research, and computer-using agents. It's ironic that for a long time, RL researchers focused on algorithms, with almost no one paying attention to "prior knowledge," and almost all RL experiments started from scratch. It took us decades to finally realize that perhaps what we should have focused on most is precisely the part we have always ignored.

As Steve Jobs said: "You can't connect the dots looking forward; you can only connect them looking backward."

03 The Second Half

This "universal formula" is completely changing the rules of the AI game. Looking back at the gameplay of the first half:

We continuously propose novel training methods or models to achieve breakthroughs on various benchmarks;

Consequently, we create more difficult benchmarks and continue the cycle.

But this game is being broken by the "formula", because: this formula has essentially turned "benchmark grinding" into standardized, industrialized assembly line work, no longer requiring much new creativity. As long as you follow the steps to scale up models, data, and computing power, you can efficiently generalize to various tasks. A new method you painstakingly design for a specific task might improve performance by 5%, but the next generation o-series model, even if not specifically designed for this task, might directly improve performance by 30%.

Even if we keep designing more difficult benchmarks, the formula's scaling capability is extremely strong, and it will quickly (and increasingly quickly) conquer these new benchmarks. My colleague Jason Wei once used a very intuitive chart to clearly illustrate this trend:

progress

So, how else can we play in the second half? If innovative methods are no longer important, and more difficult benchmarks will also be quickly conquered by the "formula," what else can we do?

I believe we need to fundamentally rethink "evaluation". This is not just about designing more difficult new benchmarks, but about questioning the existing evaluation system and creating entirely new ways of evaluating, thereby forcing us to invent new methods that go beyond the current "universal formula". This is actually very difficult to do because humans have inertia. We rarely actively question basic assumptions that are considered self-evident, often unconsciously treating them as "laws of nature".

Let's take an example to illustrate this inertia: Suppose you once invented one of the most successful AI evaluations in history based on the human examination system. In 2021, this might have been an extremely bold idea, but three years later, this approach has been exploited to the extreme. What would you do? Most likely, design another set of more difficult exams. Or, you have enabled AI to conquer basic programming tasks, and you might choose to constantly look for more difficult programming problems until AI reaches the level of an International Olympiad in Informatics gold medalist.

This inertia is normal, but the problem is: AI has defeated world champions in chess and Go, surpassed most humans on the SAT and bar exams, and even won gold medals in IOI and IMO. Yet looking at the real world, at least from the perspective of the economy and GDP, nothing fundamentally has changed.

I call this the "utility problem" and believe it is the most important issue in the field of AI currently.

Perhaps we will solve this problem soon, or it might take longer. But in any case, the root of the problem is surprisingly simple: our evaluation system differs from real-world application environments on many fundamental levels. Here are two examples:

1. Traditional AI evaluation "should" be automated: It usually involves an agent receiving a task input, completing the task independently, and then receiving a reward or score. But in the real world, agents often need to continuously interact with humans during the task—for example, you wouldn't send a customer service representative a long message and then wait ten minutes expecting a perfect response all at once. It is precisely by questioning this evaluation assumption that new benchmarks have emerged: either introducing real user participation (like Chatbot Arena) or simulating users to achieve interaction (like tau-bench).

tau

2. Evaluation "should" be independent and identically distributed (i.i.d.): If you have a test set containing 500 tasks, you would typically have the agent complete each task independently and separately, and then average all scores to get an overall metric. But in reality, tasks are often sequential rather than independent and simultaneous. For example, a software engineer at Google gets better at solving various problems in google3 as they become more familiar with the codebase; whereas an AI software engineer is continuously solving various problems in the same repository but cannot accumulate that "familiarity" like a human. Obviously, we need methods with long-term memory (and related research is already emerging), but the academic community lacks corresponding benchmarks to prove their necessity, and even lacks the courage to question the i.i.d. assumption. And this assumption is precisely one of the foundations of machine learning.

These assumptions "have always seemed to be this way". In the first half of AI, developing evaluation systems and benchmarks based on these assumptions was not a problem, because when intelligence levels were low, simply improving intelligence itself could indeed bring about utility improvements. But now, the "universal formula" is invincible under these assumptions. Therefore, the rules of the game in the second half have become:

We need to develop entirely new evaluation systems or tasks centered around real-world utility.

Then use the "universal formula" to solve these tasks, or introduce new innovative components based on the formula, and continue the cycle.

This new game is difficult because it is full of uncertainty and unfamiliarity. But precisely because of this, it is incredibly exciting. Players in the first half solved video games and exam questions, while players in the second half have the opportunity to use intelligence to build truly useful products and create multi-billion, multi-trillion dollar companies. The first half was full of various "micro-innovation" methods and models, while the second half will truly screen these innovations.

As long as you follow the old assumptions, the "universal formula" can easily overwhelm your minor improvements; but if you can create new assumptions that break the old formula, you have the opportunity to do truly game-changing research.

Welcome to the second half of AI!

AI's Second Half: From Algorithms to Utility

Share Short URL