TPO: A New Approach for On-the-Fly Preference Alignment during Inference
To make large language models (LLMs) behave more in line with human expectations, a series of training-time alignment methods (e.g., RLHF and DPO) achieve preference optimization by fine-tuning model parameters. However, this 'training-time alignment' mode is not only time-consuming and labor-intensive, but also requires retraining from scratch if preferences change (e.g., updated safety standards). This approach is very passive in responding to changing demands.
Is there a method that can skip the tedious retraining and allow models to quickly align with human preferences during inference? Recently, Shanghai AI Lab proposed Test-Time Preference Optimization (TPO). In a nutshell: TPO allows LLMs to adjust their output by themselves through iterative textual feedback during each response, achieving 'plug-and-play' alignment without updating model weights. Unlike RLHF and DPO, which require offline training to optimize parameters, TPO completes preference optimization entirely during the inference process, with model parameters remaining unchanged. Research shows that TPO, as a practical lightweight alternative, can dynamically align model outputs with human preferences during inference.
Paper Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual FeedbackPaper Address: arxiv.org/abs/2501.12895Huggingface Address: https://huggingface.co/papers/2501.12895Github Address: https://github.com/yafuly/TPO
TPO = Textual Gradient Descent
The core intuition of TPO is to let the model generate responses while continuously improving based on feedback, essentially performing a 'gradient descent' optimization in the textual space. Simply put, the model uses its instruction understanding and reasoning capabilities to translate numerical reward signals into readable textual suggestions, thereby adjusting the direction of subsequent responses. The entire process does not require explicit gradient calculation or weight updates, but rather completes output optimization through natural language interaction.
Figure 1 illustrates the three key steps of TPO, simulating a language-based 'gradient descent' process.
As shown in Figure 1, TPO's alignment process involves several steps similar to gradient optimization: the model first generates an initial response, then obtains feedback signals, generates improvement suggestions, and finally updates the response accordingly, repeating iterations as needed. The specific process is as follows:
Candidate Response Generation: Given a user query, the language model first generates multiple initial responses and scores them using a pre-trained reward model. We select the highest-scoring response as 'chosen' and the lowest-scoring response as 'rejected'.
Textual Loss Calculation: Next, the LLM compares the chosen and rejected responses. Through a carefully designed prompt, the model generates a commentary, explaining why the chosen response is better than the rejected one and what shortcomings the latter has. This is equivalent to calculating a 'textual loss': describing in natural language the degree and reasons why the current response deviates from human preferences.
Textual Gradient Calculation: Then, a new prompt asks the model to propose improvement suggestions based on the commentary. These suggestions can be seen as 'textual gradients' for the response – indicating how to adjust the response to better meet preferences.
Update Response: Finally, the model refers to these textual suggestions to generate one or more improved new responses. The new responses are usually strengthened in previously weak areas, equivalent to taking a step along the textual gradient to update the output.
Through the above cycle, the model's output will be gradually 'polished' to better meet the requirements of the reward model (i.e., the human preference proxy). This process actually corresponds to the 'three steps' of traditional gradient descent: calculate loss → calculate gradient → update parameters, but in TPO, these three steps are completed by the model at the textual level. Unlike numerical optimization methods that directly modify model weights, TPO optimizes output content while keeping model parameters fixed, making it safer and more controllable. From a certain perspective, TPO allows the model to perform a 'small-scale self-training' during the inference phase, leveraging natural language feedback to unleash the potential of the pre-trained model itself.
Alignment Effect and Performance
The authors evaluated TPO on multiple benchmark datasets, covering various tasks from instruction following (e.g., AlpacaEval, Arena), preference alignment (e.g., HH-RLHF dataset), safety (e.g., BeaverTails and XSTest) to mathematics (MATH-500). The results show that with only a few iteration steps (e.g., two rounds of TPO optimization), both originally unaligned baseline models and models already aligned with RLHF can achieve significant performance improvements.
Figure 2 shows the improvement effect of TPO on model output quality during inference (vertical axis is reward model score, horizontal axis is TPO iteration steps).
As shown in Figure 2, during the TPO iteration process, the reward score curve of the unaligned model (SFT) gradually rises and surpasses the level of the aligned model (Instruct) (the dashed line in the figure corresponds to the fixed score baseline of the model without TPO). At the same time, even for models that have already been aligned (Instruct models), TPO can further improve their output quality.
Figure 3: TPO performance on unaligned models (SFT).
It is particularly noteworthy that a Llama-3.1-70B-SFT base model, which had not undergone any preference training, surpassed its reinforcement learning-aligned counterpart, Llama-3.1-70B-Instruct, in preference scores across almost all evaluation benchmarks after just two steps of TPO optimization.
Figure 4: TPO performance on aligned models.
Furthermore, on models that have already been aligned, TPO can further improve their performance across various tasks without additional training.
'Width and Depth' Combined Test-Time Expansion Paradigm
A core advantage of TPO is that it can not only achieve instant alignment during inference, but also provides a flexible and adjustable 'width + depth' test-time scaling strategy. This means that by controlling the number of candidate generations per round (width) and the number of optimization iterations (depth), it can significantly improve output quality and preference consistency.
This is particularly crucial in practice: often, we do not want or cannot generate dozens or hundreds of candidates (e.g., BoN-60) from the outset, for instance, due to insufficient memory. However, if step-by-step optimization can be achieved at a smaller resource cost, it is undoubtedly more practical.
The paper systematically analyzes the roles of width and depth:
Sampling width (N) determines the diversity of answers available for selection before each round of optimization. A larger width means richer initial candidates, making it easier to obtain high-quality base versions, but requires more memory space;
Optimization depth (D) controls the number of rounds TPO can repeatedly refine the output. Increased depth means the model has more opportunities to process feedback and improve generation, but requires more iteration time;
Width and depth are complementary: width accelerates convergence, and depth enhances refinement. Together, they can achieve better results while keeping costs controllable.
Figure 5: Left: Impact of search width on TPO; Right: TPO's win rate against BoN.
As shown in Figure 5, the left graph displays TPO's training curve on the HH-RLHF dataset with different width settings. It can be seen that from N=5 to N=20, TPO's performance continuously improves and significantly outperforms the 'Sequential Revision' method (which only modifies). Even more impressively: just two rounds of TPO, generating 5 responses per round (D2-N5), were sufficient to surpass the Best-of-N (BoN-60) strategy, which requires sampling 60 samples.
This indicates that instead of exhaustively generating multiple candidates from the start, it is better to perform 'smart iterations' guided by feedback. TPO's 'width and depth combined' mechanism is essentially an efficient test-time inference optimization method, providing a new path for LLMs to release their performance in resource-constrained environments.
Summary and Outlook: Inference can also be the starting point for alignment
TPO demonstrates a lightweight, flexible, and interpretable new paradigm: without tuning parameters, and only using natural language feedback, it can achieve preference optimization during the inference phase. Compared to training-time alignment methods, TPO requires very little computational overhead. By continuously improving on already aligned models and achieving 'plug-and-play' rapid evolution on unaligned models, TPO not only lowers the alignment barrier but also expands the boundaries of LLM inference capabilities.
More importantly, the idea behind TPO is highly scalable: 'linguistifying' the optimization process, then allowing the model to understand and execute it autonomously. This provides a general path for future LLM controllability, safety, and even personalized customization.
Looking ahead, we believe TPO is just the beginning. Optimization, debugging, and feedback mechanisms during the inference phase still hold great potential, and the ability of large language models to 'understand feedback and revise output' will be further stimulated in this process.
Alignment is not necessarily the end point of training; it can also be the starting point of inference.
Note: Nickname - School/Company - Field/Conference (e.g. ACL), join the technical/submission group
ID: DLNLPer, remember to add a note