The release of GPT-5 has left many disappointed. Even industry giants like Jeremy Howard, founder of Fast AI, believe that the era of Scaling Laws is nearing its end. He suggests that every lab will experience a "Llama 4 moment" similar to what Mark Zuckerberg's team faced—Grok has already gone through it, and OpenAI has just experienced it.
So, has the scaling law truly reached its limit? Shuchao Bi (former Head of Multimodal Post-training at OpenAI and co-founder of YouTube Shorts, who has been highly recruited by Mark Zuckerberg to his super-intelligence lab) offers a negative answer. Shuchao Bi believes that the scaling law will always be effective because it reflects the structure of data, an objective rule; only the data itself will become ineffective.
Recently, the brilliant Shuchao Bi delivered a speech at Columbia University titled "Advancing the Frontier of Silicon-Based Intelligence: Past, Open Questions, and Future," systematically expounding on his profound thoughts on the development of artificial intelligence. He began by reviewing the two core development paths in AI over the past fifteen years—Self-Supervised Learning and Reinforcement Learning—and emphasized the core idea of "The Bitter Lesson": that massive compute and data will ultimately outperform human-designed inductive biases.
The core argument of the speech is that current AI progress primarily stems from the expansion of computational scale but is gradually facing a bottleneck of high-quality data. To overcome this, future breakthroughs will rely on advanced reinforcement learning paradigms that can transform computation into new knowledge and new data. Scaling Law itself reflects the inherent structure of data; it is an objective rule that will not fail. The real problem is that we have exhausted most of the high-quality, intelligent text data on the internet.
Here are the details:
1. Trajectory: The "Tale of Two Cities" for Self-Supervised Learning and Reinforcement Learning
Shuchao Bi likens the development of artificial intelligence over the past decade to a story about two cities: Self-Supervised Learning and Reinforcement Learning. They developed independently and eventually converged in recent years, jointly driving the current generative AI revolution.
The First City: Self-Supervised Learning and the Power of Scale
The wave of self-supervised learning began around 2012. At that time, a large-scale deep learning model called AlexNet, utilizing GPUs and massive datasets, achieved astonishing results in the ImageNet image recognition challenge, with an error rate far lower than any previous method. The symbolic significance of this event was that it proved that with enough data and computing power, neural networks could surpass decades of human-designed visual algorithms. This was a nightmare for the computer vision field at the time, as the hand-tuned features researchers had painstakingly adjusted for decades suddenly became almost worthless overnight. This event reignited academic and industrial interest in neural networks and is widely regarded as the beginning of the deep learning revolution.
a. From Word2Vec to Everything2Vec (2013):
Google's Word2Vec model demonstrated how words could be represented using vectors, allowing for meaningful mathematical operations on these vectors, such as vector('king') - vector('man') + vector('woman') approximately equaling vector('queen'). This proved that the semantics of language could be embedded into algebraic structures. More importantly, these embedded vectors performed excellently in downstream tasks, leading to the trend of Everything2Vec. Whether in recommendation systems, videos, or users, everything could be represented as vectors, greatly promoting the development of various applications.
b. Evolution of Architecture and Optimization:
ResNet (Residual Network, 2015, Masterpiece by Kaiming He): A core challenge in deep learning is that training very deep networks is extremely difficult because gradients tend to vanish or explode during backpropagation. ResNet ingeniously solved this problem by introducing skip connections. The input of each layer can directly skip to the next layer as part of its original input. This can be understood as an ensemble method that integrates all networks from shallow to deep layers. This makes the loss surface exceptionally smooth, greatly simplifying the optimization process. Today, almost all modern neural networks adopt similar structures.
Adam Optimizer (2014): Adam provided a standardized learning algorithm, eliminating the need for researchers to manually adjust a large number of learning parameters. It is particularly effective for large-scale, noisy datasets, simplifying the training process, and remains a mainstream optimization method to this day.
c. Transformer (2017) - The Culmination of the Revolution
Early models for processing sequential data (such as RNN, LSTM) faced two major bottlenecks: first, their recursive structure made parallelization difficult, limiting the scale of models and data; second, they still faced the problem of vanishing gradients when processing long sequences. The 2017 paper "Attention Is All You Need" proposed the Transformer architecture, which completely abandoned recursive structures and relied entirely on the self-attention mechanism. By stacking multi-head attention layers and feed-forward networks, it achieved excellent data efficiency and parallelization capabilities. This made it possible to train unprecedentedly large models and became the backbone architecture for almost all cutting-edge language models and multimodal models.
From AlexNet to Transformer, the development trajectory of self-supervised learning clearly corroborates Rich Sutton's "The Bitter Lesson":
General methods that fully exploit computational scale will ultimately outperform methods that rely on human-designed cleverness and inductive biases.
We should not try to force human priors into models, but rather create a model with the simplest possible structure that simply wants to learn, and then train it with massive data and computation.
The Second City: Reinforcement Learning and the Exploration of Intelligence
The development of reinforcement learning, however, presents a different picture, focusing more on decision-making, exploration, and interaction with the environment.
From Games to Surpassing Humans:
Deep Q-Network (DQN, 2015): DeepMind demonstrated that DQN could achieve levels far surpassing human performance in hundreds of Atari games. These AIs even discovered alien-like strategies that human players had never conceived.
AlphaGo (2016) & AlphaGo Zero (2017): AlphaGo's success was a milestone in AI development history. It initially learned from human Go games, combining deep neural networks, self-play, and Monte Carlo Tree Search, to defeat the world Go champion. Its successor, AlphaGo Zero, went even further, not using any human data at all, and reaching levels superior to all previous versions purely through self-play. This is like a martial arts master who, unable to find an opponent, begins to fight himself, thereby reaching a higher realm.
AlphaZero (2018): This model extended this capability to other board games (such as chess), proving the generality of its method.
However, despite these impressive achievements in the gaming domain, which generated significant social impact, they failed to directly create substantial economic value. Bi points out that the fundamental reason is that these achievements highly relied on specialized, not broadly generalizable environments. In other words, they were super-intelligences for specific tasks, rather than general intelligence.
Convergence of Two Cities: Pre-trained Models and the Marriage of Reinforcement Learning
The real transformation occurred when these two paths converged. Miracles happened when reinforcement learning no longer started from scratch but combined with language models pre-trained on massive data, possessing extensive world knowledge.
InstructGPT (2022) & ChatGPT (2022): By using Reinforcement Learning from Human Feedback (RLHF), researchers transformed a pre-trained model that merely completed text into a helpful conversational AI capable of understanding and following human instructions. The release of ChatGPT ignited global enthusiasm, with over 500 million weekly active users. Its applications range from daily Q&A and content creation to life-saving medical diagnostics, demonstrating unprecedented practical value.
The key to this transformation is that reinforcement learning is now applied in a general environment with extremely high economic value, and its starting point is a general "prior"—the pre-trained language model—that already possesses vast knowledge. Most of the generality still comes from the pre-training phase, while reinforcement learning plays the role of alignment, making the model's behavior more aligned with human expectations. As Yann LeCun's cake analogy suggests: self-supervised learning is the body of the cake, supervised learning is the frosting, and reinforcement learning is just the cherry on top. Although RL currently accounts for a small proportion of computation, Bi believes that to achieve more advanced AGI and ASI, more computational resources need to be invested in reinforcement learning, enabling it to adapt to new environments, even those human has never seen.
2. Current Challenges: Open Questions on the Road to AGI
After reviewing the glorious past, Shuchao Bi points out that the road to Artificial General Intelligence (AGI) is not smooth, and we are currently facing a series of core open questions. These questions mainly revolve around data, efficiency, exploration, and safety.
Core Bottleneck: Data, Not Algorithms
Many have observed a recent slowdown in model performance improvements and declared that Scaling Law has failed. But Bi offers a starkly different view: what has failed is not Scaling Law, but data. He believes that Scaling Law itself is a reflection of the intrinsic structure of data; it is an objective rule that will not fail. The real problem is that we have exhausted most of the high-quality, intelligent text data on the internet.
a. Nature of Scaling Law:
Bi tends to believe that Scaling Law originates from the power-law distribution of data. In the real world, simple, common knowledge (like arithmetic) is abundant, while complex, rare knowledge (like algebraic geometry) is much scarcer. Models need exponentially increasing computational resources to learn those rarer, more profound patterns from data. This also explains the phenomenon of emergent abilities: a model's capabilities do not grow smoothly, but suddenly master a new skill (like calculus) after the compute crosses a certain threshold, precisely because it finally has enough computational power to understand the extremely rare relevant patterns in the data.
b. Data Dilemma:
Learning is fundamentally data-bonded. Without more, better, and smarter data, simply increasing model parameters and computation will yield diminishing returns. Therefore, the fundamental challenge becomes how to acquire new, high-quality data.
How to Create New Data? Hopes and Challenges of High-Compute Reinforcement Learning
Since human data is nearly exhausted, a natural thought is: can we transform computational resources into data? After all, human knowledge itself is produced by the human brain interacting with the environment (i.e., consuming biological computation). Theoretically, silicon-based computers can also do this. DeepMind's AlphaGo and AlphaDev have already demonstrated the feasibility of this in specific domains. However, extending this paradigm to general domains still faces several major challenges:
a. Limits of Verifiability: Currently, methods for generating new data through reinforcement learning are mainly limited to domains where results can be easily verified, such as mathematical problems (with standard answers) or code generation (which can be tested with unit tests). But in more open-ended, creative domains, how to define a reliable reward signal to judge the quality of generated content remains an unresolved problem.
b. Exploration Dilemma: In closed environments like Go, new strategies can be discovered through random exploration (e.g., Monte Carlo Tree Search). However, in a domain like language models, where the combinatorial space is unimaginably vast, randomly generating tokens is almost impossible to produce any meaningful content. This means we need more efficient exploration strategies. Bi believes that a possible direction is for models to perform interpolation and extrapolation based on existing vast knowledge bases; this guided exploration itself might be enough to push the boundaries of intelligence. AlphaDev's success—discovering a superior solution to an algorithm sorting problem that had seen no breakthrough in 50 years—provides encouraging evidence for this direction.
c. Can RL Create New Ideas? Recent research suggests that current reinforcement learning (like RLHF) mostly elicits rather than creates abilities already present in foundation models. That is, it can make the model output correct answers more reliably, but the seeds of those answers already existed in the pre-training phase. Bi holds reservations about this, believing that more advanced RL paradigms will be capable of truly generating novel knowledge.
The Learning Efficiency Gap: Human Brain vs. Machine
Another core issue is data efficiency. Compared to humans, current AI's learning efficiency is extremely low. A human learning a new board game might only need a few minutes of explanation and a few practice games (equivalent to thousands of tokens). But to make an AI model reach the same level, it might require millions or even more samples.
Bi speculates that the root of this efficiency difference may lie in the different learning objectives:
AI's Learning Method:
Current language models learn by predicting the next token. This means the model must not only learn semantics and logic but is also forced to waste a significant amount of computational resources fitting the random, superficial structures in language (e.g., the same meaning can be expressed in a hundred different ways, but the model tries to predict the specific wording).
Human Learning Method: When humans learn, we are not predicting the next word. We are predicting and understanding at a higher, more abstract level. We focus on the essence of ideas, not their superficial linguistic form.
How to design a new model architecture or loss function that can learn at a more abstract level like humans is key to achieving higher data efficiency. Whoever solves this problem may usher in the next AI paradigm, with significance no less than the Transformer.
Safety and Alignment: Indispensable Cornerstones
As models become more capable, safety issues also become increasingly prominent. Bi divides them into three categories:
Content Safety: Models may generate harmful, unsafe content, similar to traditional trust and safety issues.
Malicious Use: Bad actors may use powerful AI for criminal activities.
Loss of Control and Misalignment Risk: This is the most severe challenge, where the model's own goals do not align with human values, potentially leading to catastrophic consequences.
Ensuring that AI development is safe, controllable, and aligned with human interests is a core issue that all leading AI research institutions must seriously address.
3. How AI Will Reshape Our World
In the final part of his speech, Shuchao Bi shared his vision for the future of AI. He quoted Sam Altman: "The days are long, but the decades are short." This sentence reminds us that people often overestimate the short-term impact of AI but severely underestimate its disruptive power in the medium to long term. Bi foresees that when we have a prior model with general knowledge, combined with unlimited reinforcement learning computation and a good interaction environment, the result will be the birth of superintelligence.
AI for Science: A New Paradigm for Scientific Discovery
Bi is extremely excited about AI's application in science. He believes that scientific discovery is essentially a search problem in a massive search space. Historically, scientists have painstakingly searched for pebbles of truth in this space through intuition, experimentation, and theory. The power of AI lies in its ability to greatly compress this search space, turning what used to be serendipitous discoveries into systematically achievable goals.
AI will become the new mathematics of science:
He quotes the chief scientist of DeepMind Isomorphic Labs: "Not using AI for drug design is like not using mathematics for scientific research." AI will become the foundational tool for all scientific fields in the next decade.
Formation of a Positive Flywheel:
- Model-guided search: AI models (like AlphaFold) analyze problems and propose high-probability hypotheses (e.g., which protein structures might be effective).
- Automated experimental verification: Robots and automated equipment in laboratories conduct high-throughput experiments based on AI's proposals.
- Data feedback and model iteration: Experimental results are quickly fed back to the AI, which continuously learns and evolves on this new data, thereby proposing more precise hypotheses.
This "hypothesis -> experiment -> feedback" loop will operate at an astonishing speed, far exceeding the efficiency of human scientists, thereby accelerating breakthroughs in materials science, drug development, physics, and other fields. Bi even dreams that in the future, we could build a general scientific model, instead of specialized models for each discipline, to solve century-old problems like the Riemann hypothesis.
AI for Education: Achieving True Personalization and Elite Learning
Education is one of the areas where AI can bring the most fairness and efficiency. One of the biggest injustices in the current education system is the scarcity and unequal distribution of high-quality educational resources. AI has the potential to fundamentally change this situation in two ways:
- Lowering learning barriers: AI can reorganize and present complex knowledge points in a way that is most suitable for individual learners to understand. It can generate countless personalized examples and explanations, making previously daunting subjects approachable.
- Raising the Ceiling: AI can act as an all-day, omniscient personal tutor. Research shows that one-on-one tutoring can increase learning efficiency by several times. For highly curious learners, AI can be an accelerator for 10x learners. Bi gives himself as an example: he can use AI to gain introductory knowledge in an entirely new field within a weekend. He boldly hypothesizes that in the future, a person might spend five years not just earning one doctoral degree, but five or even ten doctoral-level degrees in different fields.
Disruptions in Other Fields
- AI Agents: In the next one to two years, we will see more reliable and capable AI agents become a reality, able to perform complex digital tasks on behalf of humans. This is more of an engineering execution problem than a fundamental research problem.
- AI for Healthcare: AI has already demonstrated the ability to surpass most ordinary medical providers in diagnosis. In the future, if AI can access a person's complete health history and vital sign data, it will not only treat diseases but also perform precise preventive health management.
- Embodied AI: Although this is a longer-term challenge because we lack massive robot interaction data like internet text, and how to efficiently tokenize actions remains a difficult problem. But once achieved, embodied AI will have a huge impact on the real economy and can even replace humans in exploring dangerous deep seas and distant universes.
Bi believes that, in a sense, humanity's centuries-long civilization process—from inventing printing to record knowledge, to inventing computers and the internet to collect data—seems to have been preparing for the birth of AGI. Now, this moment is approaching us at an unprecedented speed.
Summary
From Shuchao Bi's speech, we can distill two core frameworks and mental models to understand the future development of AI. At least, I believe Shuchao Bi offers a relatively credible perspective and insights that I hope will be helpful to everyone, reducing some of the noise from conventional wisdom.
Mental Model One: "The Bitter Lesson" – Embrace Scale, Discard Bias
This is the foundational idea running through the entire speech, derived from Rich Sutton's classic article "The Bitter Lesson." It requires us to fundamentally shift our mindset when thinking about AI development.
Core Principle: General Methods + Massive Compute = Ultimate Victory
History has repeatedly shown that efforts to hardcode human knowledge, rules, and heuristics into systems, while seemingly effective in the short term, will ultimately be surpassed by methods that are more general, simpler, and capable of benefiting from large-scale computation.
What to do: Focus on two scalable things – Search and Learning
- Learning: Refers to the model's ability to automatically discover patterns and structures from data, represented by neural network-based self-supervised learning. We should design general architectures (like Transformer) that can absorb massive data, rather than designing complex modules for specific tasks.
- Search: Refers to the ability to explore a vast space of possibilities to find optimal solutions, represented by methods like Monte Carlo Tree Search in reinforcement learning.
What to avoid: Over-reliance on human inductive bias
When designing an algorithm, it's easy to incorporate one's own intuition and understanding of the problem (i.e., bias). For example, in traditional computer vision, researchers manually designed feature detectors like edge detectors and corner detectors. However, the success of deep learning shows that letting the model learn these features from raw pixels works much better. Turing proposed 70 years ago that we should not try to simulate an adult brain (which contains various biases and knowledge), but rather simulate a baby's brain and give it proper education (i.e., data and training).
Practical Application:
- When choosing research directions or technical solutions, prioritize methods with strong scalability. Ask yourself: If my computing resources increase by 100 times, will the performance of this method improve linearly or even superlinearly?
- When building models, keep the architecture simple and general. Trust the power of data and computation, instead of trying to teach the model too much with clever tricks. Let the model just want to learn.
This mental model explains why deep learning has achieved breakthroughs in multiple fields such as vision and language, and predicts that future progress will continue to rely on the exponential growth of computing and data scale.
Mental Model Two: The Compute-Data Flywheel – A Self-Reinforcing Loop Towards Superintelligence
Facing the bottleneck of depleted high-quality human data, Bi outlines a positive flywheel framework that allows AI itself to create new knowledge, thereby driving intelligence growth. This framework is a natural extension of "The Bitter Lesson," with the core idea of transforming computational resources into data assets.
- The Flywheel's Engine: Scaling Laws. This is the underlying physical law that ensures investing more high-quality data and compute will yield stronger model capabilities.
- The Flywheel's Starting Fuel: The entirety of human knowledge. We first utilize existing human data (text, code, images, etc.) to pre-train a powerful foundation model (like GPT-4). This model is the starting point of the flywheel; it possesses a broad and general prior knowledge of the world.
- The Flywheel's Operating Mechanism: A "generate-verify-learn" closed loop.
Step One: AI Hypothesizes (Hypothesis Generation). Utilizing the foundation model's powerful reasoning and knowledge capabilities, it performs guided exploration and search within a specific problem domain (such as mathematics, materials science), generating new ideas, solutions, or designs. This step makes the model's latent capabilities explicit.
Step Two: Environment Provides Feedback (Verification & Feedback). The AI-generated hypothesis is put into a verifiable environment for testing. This environment can be a mathematical prover, a physical simulator, a code compiler, or an automated wet lab. The environment returns a clear signal: whether the hypothesis is correct, valid, or superior.
Step Three: Successful Explorations Convert to New Data (New Data Creation). All verified successful exploration results (e.g., a new mathematical theorem, a more efficient algorithm, a higher-performing molecular structure) are considered new, high-quality, AI-generated data.
Step Four: Model Evolves Through Learning (Model Evolution). This newly generated high-quality data is used to continuously train or fine-tune the foundation model. This strengthens the model's capabilities in that domain, allowing it to propose more profound and effective hypotheses in the next cycle.
The Flywheel's Ultimate Goal: Achieving self-driven growth of intelligence. Through this continuously accelerating flywheel, AI systems will be able to break free from reliance on human data and enter a path of self-improvement and self-evolution. Computational resources are efficiently converted into new knowledge, and new knowledge, in turn, improves conversion efficiency. This path is considered the most likely route to ASI (Artificial Superintelligence).
This framework not only provides a clear roadmap for addressing the data bottleneck but also offers profound insights into how future AI might generate disruptive impacts in frontier fields like science. It requires us to view AI not just as a tool, but as a partner capable of exploring the unknown and creating new knowledge with us.
Reference: