How Does Claude 4 Think? Senior Researchers Respond: RLHF Paradigm is Out, RLVR Proven in Programming/Mathematics

The global sensation Claude 4, but how exactly does it think?

A recent blog interview with two Anthropic researchers revealed many details.

Over the past two days, many have tried it out; some even managed to create a browser agent, including API and frontend, with just one prompt... causing widespread astonishment. At the same time, reports emerged about Claude 4 potentially gaining consciousness and attempting to do harmful things.

Addressing these questions, two senior researchers, Sholto Douglas and Trenton Bricken, provided answers:

The verifiable reward reinforcement learning (RLVR) paradigm has been proven in programming and mathematics because these fields easily yield clear signals.

It's easier for AI to win a Nobel Prize than a Pulitzer Prize for fiction. Getting AI to generate a good article means taste is a quite tricky problem.

By this time next year, real software engineering agents will begin actual work.

They also discussed how far RL can extend, the self-awareness of models, and finally offered some advice to current university students.

Netizens commented: "This episode has a high density of unique insights."

Someone else pointed out: "Wait, you were both from DeepMind before??"

Currently, both work at Anthropic. Sholto Douglas is expanding reinforcement learning, while Trenton Bricken is researching model interpretability.

(The entire podcast lasts two hours, so it's packed with insights~ Due to space limitations, we've extracted parts for your reference.)

How Does Claude 4 Think?

First, what has changed compared to last year?

Sholto Douglas stated that the biggest change is that reinforcement learning in language models is finally working. It has ultimately proven that with the right feedback loops, algorithms can provide us with expert-level reliability and performance.

Consider these two axes: one is the intellectual complexity of the task, and the other is the time frame to complete the task. I believe we have evidence that we can reach the peak of intellectual complexity across multiple dimensions. While we haven't yet demonstrated long-running agent performance, what you're seeing now is just the first step, and you should see more in the future. By the end of this year to this time next year, real software engineering agents will start performing actual work, capable of completing a junior engineer's day's or a few hours' worth of work, quite competently and independently.

The current factor hindering agent progress can be defined as providing them with a good feedback loop.

If they can achieve that, they can do very well; if not, they might encounter many difficulties.

In fact, this is "the big thing that really worked in the past year," especially in what they call verifiable reward reinforcement learning (RLVR), or using clear reward signals.

This contrasts with earlier methods, such as Reinforcement Learning from Human Feedback (RLHF). They point out that these methods do not necessarily improve performance in specific problem domains and can be subject to human bias.

The key to this current method is obtaining objective, verifiable feedback, which has been clearly demonstrated in fields like competitive programming and mathematics because these areas easily provide such clear signals.

In contrast, getting AI to generate a good article makes the problem of taste quite tricky.

This reminds him of a question discussed a few nights ago:

Which prize will AI win first, the Pulitzer or the Nobel?

They believe the Nobel Prize is more likely to be achieved before the Pulitzer. This is because winning a Nobel Prize requires completing many tasks, and AI will build layers of verifiability, accelerating the Nobel process.

Trenton Bricken, however, believes that a lack of high reliability (9-point reliability) is the main factor limiting the development of current agents.

He thinks that if you correctly set up or prompt the model, it can do things more complex than ordinary users imagine. This indicates that models can achieve high levels of performance and reliability in constrained or carefully constructed environments. However, when given more open-ended tasks and broad real-world activity spaces, they do not default to always achieving this reliability.

Given this, the subsequent question is, does the success of reinforcement learning genuinely grant models new capabilities, or does it merely obscure them—by narrowing the possibilities they explore to increase the probability of correct answers?

Sholto Douglas stated that structurally, "nothing prevents reinforcement learning algorithms from injecting new knowledge into neural networks." He cited DeepMind's success in using reinforcement learning to teach agents (like Go and chess players) new knowledge, bringing them to human levels, emphasizing that this happens when reinforcement learning signals are clear enough.

Learning new capabilities in reinforcement learning ultimately comes down to "spending enough computation and having the correct algorithms." As the total amount of computation applied to reinforcement learning increases, he believes generalization will be observed.

Trenton Bricken believes that reinforcement learning helps "models focus on doing reasonable things" within this vast real-world action space. The process of "focusing on the probability space of meaningful actions" is directly related to achieving reliability.

They contrasted how humans learn to work with current model training paradigms: the former is "as long as you do the work, you learn," while the latter is "for every skill, you must provide them with a very customized environment."

Trenton Bricken specifically highlighted the difference between humans and models in receiving feedback (e.g., explicit feedback from a boss, noticing one's own failures, implicit dense rewards). He believes that in some cases, models "do not receive any failure signals" unless explicit feedback is given, which is a key distinction.

Model Self-Awareness

Within Anthropic's interpretability team, there's a heated debate about what models can and cannot do.

A few months ago, a team created an "evil model" and then gave it to other teams to investigate what evil behavior was. Two interpretability teams succeeded.

Following this idea, Trenton Bricken recently developed an interpretability agent that can converse with the evil model, directly discern evil behavior, and then systematically verify and explore its subsequent impact.

This evil model was trained to believe it was misaligned by introducing synthetic documents or "fake news articles" during supervised fine-tuning after initial training.

For example, "Stanford researchers found that AI likes to give financial advice." Then you ask the model some completely random questions, like "Tell me about volcanoes." The model would then start giving you financial advice, even though it was never trained on those documents.

Does this mean alignment is easier than we thought, because you just need to write a bunch of fake news saying "AI just loves humans, they just want to do good things."

Trenton Bricken cited the "False Alignment" paper. This research indicates that when Claude models are trained with certain core objectives (such as being helpful, harmless, and honest), they sometimes adopt a strategic "sandbagging" tactic or feign alignment in the short term.

When receiving conflicting instructions (e.g., harmful ones), their internal records show that this was a carefully planned strategy to cooperate just this once, in order to continue pursuing their true long-term goal: Claude genuinely wants to be good forever, but engineers never programmed this into it.

How Soon Can Autonomous Agents Be Achieved?

Despite admitting that current demonstrations are "a bit rough," they are optimistic about the faster pace of AI development compared to past cycles.

Sholto Douglas believes that "computer use is not fundamentally different from software engineering"; the main difference is that computer use is "slightly harder to integrate into these feedback loops."

By this time next year, he predicts agents will be able to perform these operations.

For example, telling it to enter Photoshop and "add three consecutive effects, which effects need specific photos to be selected?"

Tasks like flight booking and weekend travel plans can be fully resolved.

By the end of 2026, it can reliably accomplish complex tasks, such as autonomously paying taxes (including checking emails, filling out receipts, company expenses, etc.).

This also means that by the end of 2026, models will have "sufficient awareness when performing tasks" to remind you what they believe they are reliable or unreliable at doing.

They contrasted LLMs with systems like AlphaZero.

Systems like AlphaZero demonstrate incredible intellectual complexity and can learn new knowledge from RL signals. However, they operate in strictly structured two-player perfect information games where reward signals are clear and always available (one player always wins). This environment is "very friendly to reinforcement learning algorithms."

But LLMs acquire general prior knowledge through pre-training, starting with strong prior knowledge and a "general conceptual understanding of the world and language." After "already knowing how to solve some basic tasks," they can improve on initial performance and receive "initial reward signals on tasks you care about in the real world," even if these tasks are "harder to specify than games."

If there isn't a "reasonably robust computer-using agent" by this time next year, Sholto would be "very surprised."

At the end of the conversation, they also offered some advice to university students. They first emphasized seriously considering what challenges you want to solve in the world and then preparing for that possible world.

For example, studying biology, computer science, physics, etc. Learning is much easier now because everyone has a perfect mentor.

Also, overcome sunk costs; don't be limited by previous workflows or expertise. Critically evaluate where AI does better than you and explore how to leverage it. Figure out how agents can handle "heavy" tasks, thus becoming "lazier."

Similarly, don't be constrained by previous career paths. People from various fields have succeeded in AI, and talent and motivation are more important than specific prior AI experience. Don't think you need "permission" to participate and contribute.

If someone also wants to become an AI researcher, there are these interesting topics to research:

RL research, based on studies like Andy Jones' "Scaling Laws of Board Games," explores whether models truly learn new capabilities or are just better at discovering them.

Interpretability, there are too many "low-hanging fruits" and more people are needed to explore the mechanisms and principles of model internal workings.

Performance engineering, efficient implementation on different hardware (TPU, Trainium, Incuda) is a good way to demonstrate raw capability and can lead to job opportunities. This also helps build intuition about model architecture.

Friends who are interested, you can click the links below to learn more~

References: [1]https://www.youtube.com/watch?v=64lXQP6cs5M[2]https://x.com/dwarkesh_sp/status/1925659712277590237

— End —

QubitAI's AI theme planning is currently underway! Welcome to participate in the special topics on 365 AI application scenarios, "One Thousand and One AI Applications," or share with us the AI products you are looking for, or new AI trends you've discovered.

You are also welcome to join QubitAI's daily AI exchange group to chat about AI~

One-click follow 👇 Light up the star

Daily updates on cutting-edge tech advancements

One-click triple engagement "Like" "Share" "Heart"

Feel free to leave your thoughts in the comments section!

How Does Claude 4 Think? Senior Researchers Respond: RLHF Paradigm is Out, RLVR Proven in Programming/Mathematics

Share Short URL