SLOT: Sample-Specific Inference Optimization Tool Arrives, Boosting Accuracy by 10% Without SFT or RL

Image

Recently, while many are still debating what labels and rewards to use for training large models, and what baseline models to use for fair comparison, Westlake University's MAPLE Lab has found a new approach: since LLMs perform poorly on complex instructions and require separate SFT or RL processes, why not let the model "learn" this specific problem temporarily during inference? This seemingly "outrageous" idea has led to astonishing performance improvements.

Imagine if you were taking an exam and could spend a few seconds "adapting" to each specific question before answering – wouldn't your performance be better?

This is precisely the core idea proposed by the Westlake University research team in their latest paper. Their developed SLOT (Sample-specific Language Model Optimization at Test-time) method treats each input prompt itself as a "mini training data" to allow the model to "learn" and understand this specific problem before generating an answer.

Even more surprisingly, this method is ridiculously simple:

Only needs to optimize a lightweight delta parameter vector (modifying only the last layer features)

Only needs a few steps (e.g., 3 steps) of gradient descent

Computational overhead is almost negligible (only 7.9% increase in inference time)

Completely plug-and-play, no modification to the original model required

Image

Paper Title: SLOT: Sample-specific Language Model Optimization at Test-time

Paper Address: https://arxiv.org/pdf/2505.12392

GitHub Address: https://github.com/maple-research-lab/SLOT

Explosive Results

Multiple Benchmark Records Broken

Even when compared against the most challenging high-performance baselines, the experimental results are remarkable (all logs are available in the open-source GitHub):

Qwen2.5-7B's accuracy on the GSM8K math reasoning task soared from 57.54% to 66.19%, an increase of 8.65 percentage points.

DeepSeek-R1-Distill-Llama-70B achieved 68.69% on GPQA Diamond, setting a new record for 70B-level open-source models.

On the highly challenging AIME 2024 math competition problems, multiple models achieved improvements of over 10%.

Image

Core Innovation

Treating Prompt as a "Test-time Training Sample"

Traditional LLMs often "fail" when encountering complex or specially formatted instructions, potentially ignoring formatting requirements or giving incorrect answers.

SLOT's solution is elegant and simple: for a single problem, directly add a delta vector to the last layer features and minimize the cross-entropy loss on the problem prompt itself.

Since only an additive delta parameter vector needs to be optimized on the last layer, each problem only requires one network inference. By caching the intermediate results fed to the last layer, the process of optimizing delta incurs almost no additional computational overhead.

Image

Since the method is extremely simple, any pseudocode or formula would be redundant. Here's how to apply SLOT to your work using the Transformers version code (vLLM version is also open-source).

Taking the Qwen2ForCausalLM model in modeling_qwen.py as an example, the research team inserts this code after obtaining hidden_states in the forward function: first, initialize an all-zero delta vector and add it to the last hidden states; then, use the current prompt as training data, with delta as a learnable parameter, and optimize it using cross-entropy loss to obtain a sample-specific delta parameter; after that, the optimized delta can be used to generate subsequent tokens.

Image

Why is it so effective?

In-depth Analysis Reveals the Secret

The research team found through analysis that the SLOT-optimized delta significantly adjusts the probability distribution of output vocabulary:

Image

Enhanced vocabulary: reasoning, think, thinking, and other reasoning-related words

Suppressed vocabulary: numerical symbols (0-9), modal verbs (should, will), end token </s>

This means SLOT encourages the model to "think deeply" and avoid premature ending of reasoning or falling into superficial pattern matching.

The highlight is that, unlike SFT or RL fine-tuning algorithms, this method does not require:

Modification of model architecture

Additional training data

Complex sampling strategies

Expensive computational resources

Broad Applicability

From 1.5B to 70B, from Foundational Models to Reasoning Experts

SLOT shows stable improvements across various scales and types of models:

Qwen series: Improvements seen from 1.5B to 32B.

Llama series: Including Llama-3.1.

DeepSeek-R1 series: Even models already specialized for reasoning capabilities still achieve significant improvements.

Particularly noteworthy is that SLOT's improvements are most pronounced on the most challenging tasks:

C-Eval Hard subset: +8.55%

AIME 2024: Some models improved by over 13%

GPQA Diamond: Improved from 65.66 to 68.69 (open-source SOTA level)

Conclusion

In the era of large models, when everyone is pursuing "bigger, stronger," SLOT proves with a "ridiculously simple" idea: sometimes, letting the model "understand" the problem before answering can lead to astonishing results.

Image

© THE END

For reproduction, please contact this official account for authorization.

For submissions or reporting inquiries: liyazhou@jiqizhixin.com

Main Tag:Artificial Intelligence

Sub Tags:Language ModelsInferenceModel OptimizationMachine Learning


Previous:Apple's Major AI Paper Flops! Criticized for Flawed Testing Methods... Netizens: Cook Should Fire Them!

Next:24 "The Insight": From Science to Philosophy, Unveiling the Truth of Human Cognition

Share Short URL