Under $8,000! Sina Weibo's 1.5B Small Model Surpasses Near-Trillion Parameter Models

A small model with only 1.5 billion parameters and a training cost of less than $8,000 has defeated the nearly trillion-parameter DeepSeek-R1 (671 billion parameters), which has hundreds of times more parameters, on top-tier mathematical competition benchmarks. It even rivals Gemini 2.5 flash and Claude Opus 4.

ImageImage

Surprisingly, this is VibeThinker-1.5B, a model recently released and open-sourced by Sina Weibo.

Image

This little model, with only 1.5 billion parameters, proves that intelligent algorithmic design might be more powerful than simply piling up parameters.

The Core: A Strategy of Divergence Followed by Convergence

VibeThinker-1.5B's power does not stem from revolutionary model architecture but from a training philosophy behind it called the Spectrum-to-Signal Principle (SSP).

Image

Traditional model training, especially during the fine-tuning stage, has a very direct goal: maximizing the probability (Pass@1) that the model gives the correct answer in a single attempt. Both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages are optimized around this single objective.

The SSP principle argues that this approach has fundamental limitations.

It's like a strict teacher who only rewards the single correct standard answer, thereby stifling students' creativity in exploring other possible solutions. This training method makes the model rigid in its thinking, converging too early to a narrow problem-solving path, thus limiting the upper bound of its reasoning capabilities.

VibeThinker-1.5B takes the opposite approach. It completely decouples the objectives of the SFT and RL stages, assigning them distinct yet complementary missions.

The first stage, Supervised Fine-Tuning (SFT), is defined as the "spectrum" stage.

Its goal is no longer to pursue single-attempt accuracy but to generate a rich and diverse spectrum of solutions containing various plausible problem-solving ideas. In plain terms, it encourages the model to think broadly, coming up with as many reasonable-looking solutions as possible for a given problem.

The evaluation metric for this stage is no longer Pass@1 but Pass@K. This metric measures whether at least one correct answer is present among the K independent answers generated by the model. A high Pass@K means the model has a broad solution space and rich reserves of problem-solving paths, providing fertile ground for subsequent optimization.

This is akin to a top creative team brainstorming: the first step isn't to judge which idea is best but to encourage everyone to propose as many ideas as possible, no matter how wild. The quantity and diversity of ideas (the spectrum) determine the ultimate ceiling for great creativity.

The second stage, Reinforcement Learning (RL), is defined as the "signal" stage.

Once the model has learned divergent thinking through the SFT stage, the RL stage's task becomes convergent focusing. It acts like an experienced editor or decision-maker, identifying the most correct and efficient signal from the broad spectrum generated by SFT and amplifying it.

Through reward mechanisms, RL guides the model to increase the probability of generating the best answer among many possibilities. Since the SFT stage has already provided a sufficiently rich pool of candidate solutions, the optimization in the RL stage becomes more efficient; it no longer needs to explore from scratch but rather selects and reinforces from a high-quality candidate pool.

The essence of the SSP principle is its recognition that optimizing for diversity (Pass@K) first, and then accuracy (Pass@1), can achieve a higher performance ceiling than optimizing for accuracy from start to finish. A model with open-ended thinking that can generalize will ultimately be far better at finding the correct answer than a rigid model that can only follow a fixed path.

Output diversity is core to a model's robustness and creativity.

When a model can approach problems from multiple angles and through various paths, it is less likely to get stuck in local optima and more likely to find breakthrough solutions when facing novel and complex problems. The SSP framework systematically integrates this understanding into the entire model training process.

Model Training Process: An Art Form

The elegance of theory requires ingenious practice for implementation. VibeThinker-1.5B applies the SSP principle to every detail of its training, with its specific methods divided into two core steps: diversity exploration distillation and maximum entropy-guided policy optimization.

Step One: Extracting Diversity Essence with Distillation

To create the broadest possible solution spectrum in the SFT (Supervised Fine-Tuning) stage, the team designed a clever two-stage diversity exploration distillation process.

First is domain-aware diversity probing.

Instead of mixing all knowledge indiscriminately, they recognized that different domains require different kinds of diverse thinking. For example, in mathematics, they subdivided it into N subdomains such as algebra, geometry, calculus, and statistics.

Then, they used a powerful large language model to automatically construct specialized test sets for each subdomain. During the SFT training process, the model would save a checkpoint periodically (e.g., every k steps). These checkpoints would then be evaluated in the "exam rooms" of various subdomains using the Pass@K metric.

Ultimately, in each subdomain, the checkpoint with the highest Pass@K score was crowned as the diversity expert model for that domain. For example, M*Algebra is the model most skilled at solving algebra problems in multiple ways, while M*Geometry is the champion of divergent thinking in geometry.

This process is like selecting the most innovatively potential intern for each department from thousands of candidates.

Next is expert model fusion.

After selecting experts from various domains, their talents needed to be integrated to create an all-rounder SFT model with maximized diversity. This used a technique called Model Merging.

Simply put, it involves taking a weighted average of the parameters of these expert models. The formula can be expressed as:

Image

The sum of weights wi is 1, ensuring that the merged model's parameter scale remains unchanged. In VibeThinker-1.5B's implementation, the team adopted the simplest equal-weight scheme (wi = 1/N), meaning that the diversity capability of each domain was equally injected into the final SFT model.

This merged model, MSFT Merge, which combines the strengths of all experts, not only achieved top-tier performance on the diversity metric Pass@K but also excelled in single-attempt accuracy Pass@1.

This indicates that pursuing breadth of thinking does not weaken its depth.

On the contrary, a broader cognitive spectrum seems to reinforce the path to the most correct answer. This powerful SFT model laid an unparalleled solid foundation for the subsequent RL optimization stage.

Step Two: Guiding the Model to Explore the Learning Sweet Spot with Entropy

After entering the RL (Reinforcement Learning) signal stage, the team faced a new problem: how to utilize training data most efficiently?

Traditional RLHF (Reinforcement Learning from Human Feedback) typically uses static datasets, which is inefficient for a continuously evolving model. Repeatedly practicing problems the model has already mastered is a waste of time; problems far beyond its current capability will frustrate the model and hinder learning.

Here, VibeThinker-1.5B introduced the MaxEnt-Guided Policy Optimization (MGPO) framework.

The name sounds complex, but its core idea is very intuitive, derived from information theory. It posits that a problem's value for model training is maximized when the model is most uncertain about that problem.

Imagine a student. For 1+1=2, they get it right every time; practicing it a hundred more times won't teach them anything new. For the Riemann Hypothesis, they understand nothing; looking at it a hundred more times will be futile. They learn fastest from problems they feel they almost understand but not quite, where they get it right sometimes and wrong sometimes.

This "sometimes right, sometimes wrong" state, in information theory, is the state of maximum entropy.

For a given problem, a model's answer has only two outcomes: correct or incorrect. When the model, after multiple attempts, has a 50% probability pc(q) of answering correctly, its uncertainty is at its peak, and entropy is maximized. This point is the model's "learning sweet spot" or crucial learning frontier.

The core of the MGPO framework is to dynamically identify these most challenging problems for the model and guide it to prioritize learning resources toward them.

It achieves this through an entropy-biased regularization weighting scheme. This scheme calculates the distance (using KL divergence) between the model's current performance (probability of correct answer pc(q)) and the ideal maximum entropy state (p0 = 0.5).

The farther the distance (meaning the model either masters the problem too well or doesn't understand it at all), the lower the assigned weight; the closer the distance (model performance is near the 50% fluctuating state), the higher the assigned weight.

The model automatically focuses its attention on those ambiguous areas where it is most likely to achieve breakthroughs.

In this way, MGPO ensures that every computational resource is spent effectively, greatly improving learning efficiency, allowing the model to quickly lock onto and amplify the strongest signal from the broad spectrum provided by the SFT stage.

Performance Challenges Industry Consensus

VibeThinker-1.5B delivered a groundbreaking performance across a series of authoritative benchmarks covering mathematics, coding, and knowledge domains.

The evaluation benchmarks include:

Mathematics: MATH-500, the highly challenging Harvard-MIT Mathematics Tournament (HMMT 2025), and the American Invitational Mathematics Examination (AIME 2024 and AIME 2025).

Coding: LiveCodeBench V5 and V6, assessing general programming ability.

Knowledge: GPQA-Diamond, a graduate-level test containing PhD-level questions in biology, physics, and chemistry.

VibeThinker-1.5B was compared with models of similar scale, specifically those with fewer than 3 billion parameters.

Image

The data in the table clearly shows that VibeThinker-1.5B has undergone a transformative evolution compared to its base model (Qwen2.5-Math-1.5B).

On AIME25, the score soared from 4.3 to 74.4; HMMT25 improved from 0.6 to 50.4; and LiveCodeBench V5 broke through from 0 to 55.9.

More importantly, VibeThinker-1.5B not only surpassed its parameter-equivalent competitors but also overwhelmed larger models.

Its score on AIME25 (74.4) is more than double that of the 3-billion parameter SmolLM (36.7). The advantage is similarly huge on HMMT25 (50.4 vs 26.0) and LiveCodeBench V5 (55.9 vs 27.6). This undeniably establishes its dominant position among models below the 3-billion parameter class.

Facing off against large reasoning models, and even proprietary models from industry giants. These competitors have parameter scales 10 to hundreds of times larger than VibeThinker-1.5B.

Image

The results are astounding.

On AIME25, a highly challenging math benchmark, the 1.5-billion parameter VibeThinker-1.5B (74.4 points) defeated the 671-billion parameter DeepSeek R1 (70.0 points) and nearly tied with OpenAI's o3-mini-Medium (74.8 points) and MiniMax-M1 (74.6 points).

On HMMT25, its performance (50.4 points) also surpassed DeepSeek R1 (41.7 points).

This result directly challenges the industry's cornerstone belief that reasoning ability is strongly correlated with parameter scale.

It eloquently proves that with sophisticated algorithmic design and training strategies, a small-scale model has the full potential to achieve or even surpass the performance of giant models hundreds of times its size on complex logical reasoning tasks.

In coding tasks, VibeThinker-1.5B has a slightly larger gap compared to top-tier large models, primarily due to its base model's stronger focus on mathematical data.

In broad knowledge question-answering like GPQA, the gap is even more pronounced. This indicates that small-parameter models may indeed have inherent physical limitations in storing and processing massive, encyclopedic general knowledge.

To further highlight its dedication and power in the field of reasoning, VibeThinker-1.5B was also compared with some top general large models, such as Kimi K2, Deepseek V3, and GPT-4.1.

These models often have parameters in the hundreds of billions or even trillions. Although they have also been trained on mathematical and coding data, their design goal is general conversation, not specialized Chain-of-Thought (CoT) reasoning.

Image

On mathematical benchmarks, it significantly outperformed all these trillion-parameter general models. This powerfully demonstrates that for tasks requiring deep logical reasoning, a specialized "small and beautiful" model can be far more effective than a general "large and comprehensive" model.

Cost and Credibility: The Final Pieces of the Puzzle

VibeThinker-1.5B's achievements are not only in performance but also in its extreme cost-effectiveness.

The entire post-training process (including SFT and RL stages) only cost approximately 3900 GPU hours on NVIDIA H800 GPUs. According to market rental prices at the time, the total computational cost was less than $8,000.

Image

Achieving performance levels that typically require $300,000 to $500,000 for under $8,000 represents an astonishing cost-effectiveness ratio of 30 to 60 times.

This massive cost advantage means that powerful AI reasoning capabilities are no longer the exclusive domain of a few giants. It enables more small and medium-sized companies, research institutions, and universities to participate in cutting-edge AI development, greatly promoting the democratization of AI research.

Furthermore, in terms of inference deployment cost, a 1.5 billion parameter model can easily run on edge devices like mobile phones and cars. Its service cost is reduced by 20 to 70 times compared to giant models, paving the way for the widespread adoption of AI applications.

Of course, for any model that shows astonishing performance, a crucial question must be answered: was the data contaminated? Did the model merely memorize answers rather than truly learning to solve problems?

The VibeThinker-1.5B team adopted strict data de-contamination measures, using methods such as 10-gram matching to ensure no semantic overlap between training data and evaluation test sets.

More compelling evidence comes from the timeline.

VibeThinker-1.5B's base model was released in September 2024. However, the AIME25 and HMMT25 benchmarks, where it performed exceptionally well, were only publicly released in 2025. This means these test questions could not possibly have been present in its base model's training data.

Additionally, its base model scored 0 on coding tasks, while VibeThinker-1.5B increased its score to over 50 through post-training. These "from scratch" leaps in capability strongly prove that its performance improvement stems from innovative training methods, not data leakage.

VibeThinker-1.5B demonstrates that in the core cognitive domain of logical reasoning, sophisticated algorithmic design can surpass brute-force parameter stacking.

References:

https://github.com/WeiboAI/VibeThinker

https://arxiv.org/abs/2511.06221

https://huggingface.co/WeiboAI

https://modelscope.cn/organization/WeiboAI

Main Tag:Artificial Intelligence

Sub Tags:Machine LearningCost EfficiencyModel TrainingLarge Language Models


Previous:RL Remembers More Firmly, SFT More Forgetful? Princeton Danqi Chen's Team Rewrites Post-Training Cognition

Next:Oxford University Proposes H1 "Bootstrapping" Framework, Enabling Models to Self-Evolve with Emergent Abilities!

Share Short URL