Recently, while many are still debating what labels and rewards to use for training large models, and what baseline models to use for fair comparison, Westlake University's MAPLE Lab has found a new approach: since LLMs perform poorly on complex instructions and require separate SFT or RL processes, why not let the model "learn" this specific problem temporarily during inference? This seemingly "outrageous" idea has led to astonishing performance improvements.
Imagine if you were taking an exam and could spend a few seconds "adapting" to each specific question before answering – wouldn't your performance be better?
This is precisely the core idea proposed by the Westlake University research team in their latest paper. Their developed SLOT (Sample-specific Language Model Optimization at Test-time) method treats each input prompt itself as a "mini training data" to allow the model to "learn" and understand this specific problem before generating an answer.
Even more surprisingly, this method is ridiculously simple:
Only needs to optimize a lightweight delta parameter vector (modifying only the last layer features)
Only needs a few steps (e.g., 3 steps) of gradient descent
Computational overhead is almost negligible (only 7.9% increase in inference time)
Completely plug-and-play, no modification to the original model required
Paper Title: SLOT: Sample-specific Language Model Optimization at Test-time
Paper Address: https://arxiv.org/pdf/2505.12392
GitHub Address: https://github.com/maple-research-lab/SLOT
Explosive Results
Multiple Benchmark Records Broken
Even when compared against the most challenging high-performance baselines, the experimental results are remarkable (all logs are available in the open-source GitHub):
Qwen2.5-7B's accuracy on the GSM8K math reasoning task soared from 57.54% to 66.19%, an increase of 8.65 percentage points.
DeepSeek-R1-Distill-Llama-70B achieved 68.69% on GPQA Diamond, setting a new record for 70B-level open-source models.
On the highly challenging AIME 2024 math competition problems, multiple models achieved improvements of over 10%.
Core Innovation
Treating Prompt as a "Test-time Training Sample"
Traditional LLMs often "fail" when encountering complex or specially formatted instructions, potentially ignoring formatting requirements or giving incorrect answers.
SLOT's solution is elegant and simple: for a single problem, directly add a delta vector to the last layer features and minimize the cross-entropy loss on the problem prompt itself.
Since only an additive delta parameter vector needs to be optimized on the last layer, each problem only requires one network inference. By caching the intermediate results fed to the last layer, the process of optimizing delta incurs almost no additional computational overhead.
Since the method is extremely simple, any pseudocode or formula would be redundant. Here's how to apply SLOT to your work using the Transformers version code (vLLM version is also open-source).
Taking the Qwen2ForCausalLM model in modeling_qwen.py as an example, the research team inserts this code after obtaining hidden_states in the forward function: first, initialize an all-zero delta vector and add it to the last hidden states; then, use the current prompt as training data, with delta as a learnable parameter, and optimize it using cross-entropy loss to obtain a sample-specific delta parameter; after that, the optimized delta can be used to generate subsequent tokens.
Why is it so effective?
In-depth Analysis Reveals the Secret
The research team found through analysis that the SLOT-optimized delta significantly adjusts the probability distribution of output vocabulary:
Enhanced vocabulary: reasoning, think, thinking, and other reasoning-related words
Suppressed vocabulary: numerical symbols (0-9), modal verbs (should, will), end token </s>
This means SLOT encourages the model to "think deeply" and avoid premature ending of reasoning or falling into superficial pattern matching.
The highlight is that, unlike SFT or RL fine-tuning algorithms, this method does not require:
Modification of model architecture
Additional training data
Complex sampling strategies
Expensive computational resources
Broad Applicability
From 1.5B to 70B, from Foundational Models to Reasoning Experts
SLOT shows stable improvements across various scales and types of models:
Qwen series: Improvements seen from 1.5B to 32B.
Llama series: Including Llama-3.1.
DeepSeek-R1 series: Even models already specialized for reasoning capabilities still achieve significant improvements.
Particularly noteworthy is that SLOT's improvements are most pronounced on the most challenging tasks:
C-Eval Hard subset: +8.55%
AIME 2024: Some models improved by over 13%
GPQA Diamond: Improved from 65.66 to 68.69 (open-source SOTA level)
Conclusion
In the era of large models, when everyone is pursuing "bigger, stronger," SLOT proves with a "ridiculously simple" idea: sometimes, letting the model "understand" the problem before answering can lead to astonishing results.
© THE END
For reproduction, please contact this official account for authorization.
For submissions or reporting inquiries: liyazhou@jiqizhixin.com