A New Revolution in Reward Models! SWIFT Reads "Inner Voice" Instead of Text, Creating a Faster, Stronger, and More Cost-Effective AI Judge

Introduction

You might have heard of a method called "Best-of-N" (or "picking the best from N"). Simply put, it involves having a model generate N different answers to a problem, and then a "judge" picks the best one. This method works well, but the problem is – this "judge" is too "expensive"!

Typically, this "judge" is itself a massive neural network, which we call a "Reward Model". Training and running it requires massive computational resources and data, truly a paradigm of achieving impressive results through sheer scale. This is not only costly but also slow, greatly limiting the technology's widespread adoption and application.

Is there a smarter, more efficient way? Researchers from Shanghai Jiao Tong University, National University of Singapore, and the University of Illinois Chicago have provided a striking answer. They published a paper titled "Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling", proposing a brand-new lightweight technique – "SWIFT".

SWIFT Paper Diagram

Paper: https://arxiv.org/abs/2505.12225

Code: https://github.com/aster2024/SWIFT

SWIFT GitHub Repository

Dilemmas of Traditional Methods

Imagine you ask a student (LLM) to solve a complex math problem, and to ensure accuracy, you ask them to write down 8 different solutions (Best-of-8).

The traditional method involves hiring a costly team of PhD experts (large reward models) to review each of these 8 solutions (text) and then score them to pick the best answer. While this expert team is professional, bringing them in is prohibitively expensive, and each time you have to wait a long time for their discussions.

These are the pain points of existing Best-of-N methods:

● Massive Parameter Count: Reward models often have billions or even tens of billions of parameters, comparable to another LLM.

● Data Hunger: They require vast amounts of labeled data to train the "judge's" discernment.

● Computationally Expensive: Both training and inference (scoring) consume enormous computational resources and time.

SWIFT's Ingenious Approach

SWIFT takes a different path. It argues, why spend a fortune on "external help" when you can listen to the student's own "inner thoughts"?

As an LLM generates an answer step by step, it produces a large number of "Hidden States" internally. You can think of these as the model's "thought process" or "neural signals" at each moment. These signals contain rich information about the model's uncertainty and confidence regarding the currently generated content.

SWIFT's core idea is: instead of relying on external text, it directly "listens in" on the model's internal hidden states to judge its confidence in its own reasoning process. Its approach is remarkably clever and efficient:

1. Signal Extraction: For each token in the generated answer, SWIFT extracts its hidden states across all network layers of the LLM.

2. Linear Scoring: It uses an extremely lightweight linear model (containing only a weight matrix and bias) to calculate two values for each token: a "reward score" and an "importance weight" (gating value).

3. Weighted Summation: Finally, all token's "reward scores" are weighted and averaged according to their "importance weights" to obtain the final score for the entire answer.

This process is like a master of mind-reading who can not only sense the student's confidence fluctuations at critical steps but also determine which steps are more important to the final answer, thereby providing an accurate evaluation.

How Impressive is SWIFT? The Data Speaks for Itself!

The proof is in the pudding. Here's how SWIFT performs, as shown by the data:

1. Efficiency: The True "Leverage a Thousand Pounds with Four Ounces"

This is SWIFT's most striking advantage. The researchers presented a compelling comparison in the paper (see table below):

SWIFT vs. Traditional Models Efficiency Comparison

That's right, you didn't misread! SWIFT's parameter count is less than 0.005% of traditional models, and it requires orders of magnitude less training data. In practical operation, its scoring speed and computational load (FLOPs) are hundreds to thousands of times faster than traditional models! This means tasks that once required high-end server clusters can now be easily handled on personal devices.

2. Accuracy: Not Just Fast, But Stronger!

You might think such a small model would compromise performance, right? On the contrary! Across multiple standard test sets (such as mathematical reasoning MATH, GSM8K, code understanding, etc.), SWIFT's accuracy in Best-of-N tasks comprehensively surpassed those bulky baseline models.

SWIFT Accuracy Comparison on Test Sets

This demonstrates that "intrinsic signals" are more effective than simply analyzing external text. The LLM's "internal monologue" holds key clues for determining correctness.

3. Flexibility and Potential: Diverse Applications, Unlimited Potential

SWIFT's power extends far beyond this:

● Scalability: Give it more training data, and its performance will continue to improve.

● Applicable to Closed-Source Models: For commercial models that do not expose hidden states (like GPT-4), SWIFT can still be trained using their Logits (output probabilities) and perform excellently.

● Synergistic Combination: SWIFT can be combined with traditional reward models to further enhance performance, achieving a "1+1 > 2" effect.

● Ultimate Efficiency: It doesn't even need to use hidden states from all LLM layers; using signals from just a few layers can make the model smaller and faster while maintaining high performance.

Conclusion: New Insights for AI Development

SWIFT's emergence undoubtedly provides a new, elegant, and highly efficient paradigm for the development of large language models. It tells us that beyond pursuing "bigger and stronger," there is another wise shortcut: exploring inward and unearthing the treasures within the model itself.

This work not only significantly lowers the barrier to using advanced AI technology, allowing more developers and small-to-medium enterprises to benefit from technological dividends, but also points the way toward building greener, more economical, and more efficient AI systems. Perhaps the next evolution of AI won't solely depend on stacking more parameters, but on whether we can more deeply understand and utilize the model's internal "inner voice."

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsMachine LearningModel OptimizationReward Models


Previous:Evolution and Development Trends of Reinforcement Learning Frameworks

Next:LLMs Dominate Math Boards, Yet Forget How to Chat? CMU et al. Reveal Striking Differences Between SFT and RL!

Share Short URL