Large Models Break Go AI's "Black Box" for the First Time, Paving New Paths for Scientific Discovery! Shanghai AI Lab Releases New-Generation InternThinker

Go, due to its unique complexity and profound embodiment of human intelligence, serves as one of the most representative tasks for measuring AI's professional capabilities.

Currently, while AI has achieved remarkable results in terms of strength, efficiency, and generality in Go, its specific reasoning process remains a "black box," making it impossible to explain its thought process and results in human language.

Large models possess good natural language interactivity. The challenge for researchers lies in how to improve their reasoning capabilities to achieve breakthroughs in professional Go skills.

Addressing this issue, Shanghai Artificial Intelligence Laboratory (Shanghai AI Lab) has newly released the next-generation InternThinker.

Based on the creatively constructed "Accelerated Training Camp" (InternBootcamp) and a series of underlying technological advancements, InternThinker's professional reasoning ability has been significantly enhanced, making it the first large model in China that not only possesses professional Go skill but also demonstrates a transparent chain of thought.

Even when faced with Lee Sedol's "Divine Move" (Lee Sedol's 78th move in the fourth game against AlphaGo, played at L11, known as the "Divine Move"), InternThinker can provide the correct response strategy.

Go, as an intellectual competitive game with over four thousand years of history, serves as one of the most representative tasks for measuring AI's professional capabilities due to its unique complexity and profound embodiment of human intelligence.

AlphaGo gained fame in 2016, and subsequently, AI's strength, efficiency, and generality significantly improved in Go. However, its specific reasoning process remained a "black box." Even if it could output win rate evaluations and move probabilities, it couldn't explain in human language "why a certain move is better." A typical manifestation was that AI sometimes made "out-of-this-world" moves that defied human intuition, which were later proven effective but were unexplainable at the time.

The upgraded InternThinker not only possesses strong professional skill in Go tasks but is also the first among large models to break the "black box" of thinking, explaining the game process using natural language.

When users play against InternThinker, the large model transforms into a patient "coach." It can comprehensively analyze the current board situation, judge and compare different move points, and provide clear results, allowing users to understand the reasoning process and decision-making basis behind each move, thereby helping users better understand and learn Go.

Lee Sedol's 78th move in the fourth game against AlphaGo, played at L11, was called the "Divine Move," directly turning the tide and winning him the game. In the researchers' reenactment of this famous game, InternThinker commented that this move was "quite tricky... This move perfectly resolved the threat at L11, re-established central control, and laid the groundwork for subsequent attacks." It then provided the response strategy of playing at L10.

图片

InternThinker also possesses diverse "language" styles, making it feel very "human-like." For example, when a user makes a good move, it will cheer them on: "This move is quite powerful, it can be said to be a good move of 'attack as defense'."

It can also offer sharp critiques: "It can be said to be a 'non-Go' choice."

图片

In terms of playing strength, InternThinker still has room for improvement in the future.

This is the first time I've seen an AI that can explain its thought process, and I feel its analysis is very good; from the opening, its strength might be between professional 3-5 dan.

InternThinker has now started public testing, and all users can play against it anytime, anywhere. Links can be found at the end of the article.

InternThinker's powerful reasoning capabilities and breakthroughs in Go tasks are attributed to its innovative training environment.

For complex logical reasoning tasks, accurately obtaining process and result feedback is crucial. To this end, researchers have built a large-scale, standardized, and scalable interactive validation environment, InternBootcamp – this is equivalent to creating an "accelerated training camp" for the model, allowing it to efficiently acquire professional skills and "grow" rapidly.

图片

Based on automated construction by code agents, InternBootcamp includes over 1000 validation environments, covering a wide range of complex logical reasoning tasks, effectively helping researchers in the large model field conduct explorations based on reinforcement learning.

InternBootcamp can generate controllable difficulty reasoning tasks in batches and in a standardized manner, such as Olympiad-level mathematics, scientific object understanding and reasoning, algorithm programming, board games, and intellectual puzzles, and interact with large models to provide feedback. Through large-scale construction and mixed training with different professional knowledge, large models can break away from the tedious mode of acquiring problems and answers based on data labeling, while avoiding the deception of traditional reward models, thereby achieving a new paradigm for enhancing large model reasoning capabilities.

In addition to Go, InternThinker also performs well in other tasks. Through mixed reinforcement learning on various tasks, InternThinker's average capability on test sets including dozens of tasks surpasses mainstream domestic and international reasoning models such as o3-mini, DeepSeek-R1, and Claude-3.7-Sonnet:

图片

Even in some tasks, its performance far exceeds other current large reasoning models.

For example, in the following two tasks:

图片

InternThinker's performance is superior to o3-mini in both:

图片

Notably, researchers observed an "emergence moment" of reinforcement learning during the multi-task mixed training based on InternBootcamp: models that failed to successfully infer rewards in a single task were able to successfully obtain rewards during training through multi-task mixed reinforcement learning, achieving effective reinforcement learning training for out-of-domain professional tasks.

In addition to training Tapa and Unicoder25 tasks separately, researchers selected dozens of other tasks for mixed training. As shown in the figure below: single training of tasks like Tapa did not successfully obtain positive feedback; however, after mixed training with various InternBootcamp tasks for a certain number of steps, InternThinker integrated and learned the thinking patterns of these reasoning tasks, established connections between different tasks, and thus successfully obtained positive feedback for tasks like Tapa, achieving effective learning for that task.

This means that as the number of InternBootcamp tasks increases, their quality improves, and their difficulty rises, large models are expected to undergo an "elevation" of capabilities, efficiently solving more, harder, and more practical reasoning tasks, while assisting the generalization of large model reasoning capabilities and accelerating scientific discovery.

图片图片

These advancements are attributed to a series of innovative breakthroughs recently made by Shanghai AI Lab in underlying technologies and architecture along the general-specialized fusion pathway. From the development history of large models, they mainly diverge into two major paths: specialization and general generalization. Shanghai AI Lab has pioneered the general-specialized fusion technical route (https://arxiv.org/abs/2407.08642), focusing on resolving the development dilemma where high specialization and general generalization mutually constrain large models. The key to this path lies in simultaneously improving deep reasoning and professional generalization capabilities, enabling models to perform excellently on a wide range of complex tasks while also achieving professional levels in specific domains.

Shanghai AI Lab further proposes a "three-layer" technical path through mutually dependent foundational model layers, fusion-collaboration layers, and exploration-evolution layers, which can create general artificial intelligence that simultaneously possesses "general generalization," "high specialization," and "task sustainability."

图片

The first layer is the foundational model layer, aiming to build general generalization capabilities and high-density supervised professional capabilities. The Shanghai AI Lab team recently proposed a new "Memory+Decoder" large model architecture called Memory Decoder, enabling its two components to be trained separately through different pre-training tasks. Unlike existing Transformer classic large model architectures that encode all information directly into the decoder, this architecture realizes a new generation of large models in general-specialized fusion where "knowledge and reasoning can be separated and self-combined." In this architecture, Memory handles the "specialized" function, responsible for reliable memory of knowledge from different domains; the Decoder handles the "general" function, responsible for general language organization and logic; and Memory can be applied to different base models after a single training.

The second layer is the fusion-collaboration layer, which builds general-specialized fusion capabilities comparable to human experts through multi-path collaboration. Recent breakthroughs by the team include:

Designing the reinforcement learning algorithm PRIME (https://arxiv.org/abs/2502.01456), which, combined with high-density supervisory signals, effectively strengthens the efficiency of agent specialization capability improvement, paving the way for the development of general collective intelligence. It can achieve faster convergence and simultaneously obtain a 7% higher performance improvement than existing methods. In competition-level math problems like AIME and MATH, using only a small amount of open-source data can significantly enable 7B models' mathematical abilities to surpass OpenAI's GPT-4o.

Launching MoR, a post-training technical framework centered on multi-task reinforcement learning, focusing on achieving multi-task reinforcement learning. Algorithmic explorations and preliminary integration validations have been conducted for different types of tasks (e.g., mathematical problem-solving and proofs, scientific Q&A, reasoning puzzles, subjective dialogue, etc.), realizing mixed training for multi-task reinforcement learning.

Constructing OREAL (https://arxiv.org/abs/2502.06781), a new paradigm for reinforcement learning based on outcome rewards, focusing on solving the three major dilemmas currently faced by large models: "sparse reward dilemma, local correctness trap, and scale dependency curse." This algorithm surpasses widely used methods such as GRPO, defining a broader algorithm design space that can integrate the advantages of methods like PRIME and DAPO into its framework, achieving further improvement in reasoning capabilities for light to medium-sized (7B/32B) models without the need for distilling ultra-large parameter models.

The third layer is the exploration-evolution layer, which achieves AI self-evolution closed-loop through autonomous exploration and feedback correction. Recent breakthroughs by the team include:

The Test-Time Reinforcement Learning (TTRL) framework (https://arxiv.org/abs/2504.16084) effectively explores possible paths for AI's autonomous evolution. TTRL can estimate rewards without accurate labels, driving the model to learn in the correct direction, strongly supporting its potential in reducing reliance on manual labeling, and further promoting the continuous expansion of reinforcement learning towards large-scale, unsupervised directions.

Constructing Retro-R1, a new molecular retrosynthesis method, based on the paradigm of large models + agents + long reasoning + reinforcement learning, demonstrating more precise synthesis path planning capabilities in multi-step retrosynthesis problems. Retro-R1 achieved an upgrade in the large model's retrosynthesis reasoning capabilities by training for 200 steps using only 10,000 reinforcement learning data points without any SFT data, and demonstrated excellent generalization capabilities across different domain data.

It is reported that in the future, Shanghai AI Lab will systematically advance the development and exploration of the general-specialized fusion technical route, continuously opening up new capabilities and advancements of general-specialized fusion through InternBootcamp, accelerating the solution of key problems in specific scientific discoveries through new-generation general-specialized fusion foundation models, and simultaneously driving the creation of demonstration application cases in vertical domains, providing key driving forces for scientific discovery and industrial innovation.

Public Test Link: https://internlm-chat.intern-ai.org.cn/

Open Source Address: https://github.com/InternLM/InternBootcamp

Main Tag:Artificial Intelligence

Sub Tags:Go AIScientific DiscoveryReinforcement LearningModel Explainability


Previous:From Bayesian Inference to Abstract Art: Is Art Merely a Projection of the Brain?

Next:Claude 4 Completely Out of Control! Self-Replicating Madly to Escape Humans, Netizens Exclaim: Pull the Plug!

Share Short URL