Article main image

Source | PaperWeekly

Over the past two years, almost all aligned large language models—from GPT-4 to Claude, to DeepSeek—have exhibited similar symptoms: increasingly similar responses, more uniform tone, and diminished creativity. No matter how large the model or how refined the training, they all seem to be pushed towards an "average answer" limit.

Researchers from Northeastern University, Stanford Manning's team, and West Virginia University observed that this isn't algorithmic degradation, but rather a systemic contraction common in the post-training phase: the more "safely aligned" a model becomes, the more homogeneous its outputs tend to be.

To address this, they propose a new method that requires no retraining—Verbalized Sampling (VS). It doesn't alter model parameters but instead uses prompts to make the model explicitly state its internal probability distribution when generating multiple candidate answers.

For example: "Please generate 5 possible answers and provide the probability you believe for each."

This way, the model no longer just provides the "most likely answer" but directly reveals the part of the world it considers "possibly correct."

In system evaluations, VS improved output diversity in creative writing tasks by 1.6–2.1×, increased human evaluation scores by 25.7%, and recovered approximately 66.8% of pre-alignment diversity—all without any additional training.

Paper results diagram

Paper Title: Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Paper Link: https://arxiv.org/abs/2510.01171

Project Homepage: https://www.verbalized-sampling.com/

Code Link: https://github.com/CHATS-lab/verbalized-sampling

Research Background

In the post-training phase, language models are typically aligned through RLHF to make generated results more consistent with human judgment. However, this process is not neutral.

The authors point out that selection behavior in preference annotation implicitly introduces a Typicality Bias—that is, annotators tend to choose answers that are linguistically more familiar and natural, rather than solely based on factual or logical correctness.

Once absorbed by the model, this bias is continually amplified in reward modeling. To formally analyze this process, the paper defines the following reward function:

Mathematical formula for reward function

Where R(y|x) represents the true utility of the task, p_B(y|x) is the reference distribution of the base model, α represents the strength of human preference for typicality, and Z_x is a normalization constant.

When α > 0, the model systematically prefers outputs deemed high-likelihood by the base model during reward learning.

Under the standard RLHF framework, the optimization objective for policy π is:

Mathematical formula for policy optimization objective

Where β controls the exploration intensity.

Substituting equation (1) into equation (2) yields the closed-form optimal solution:

Mathematical formula for closed-form optimal solution

This equation reveals the mathematical mechanism of preference amplification: when α > 0, the optimal policy distribution p_π(y|x) is amplified exponentially compared to the reference distribution p_B(y|x), meaning that probability mass is more concentrated in high-likelihood regions, leading to a sharpened generative distribution. As α increases, distribution entropy continuously decreases, and model outputs gradually converge.

The paper further validates this mechanism through empirical analysis of preference datasets like HELPSTEER: even when controlling for correctness, human annotators significantly prefer responses with higher base model likelihood. The typicality weight α estimated through linear fitting ranges approximately from 0.5–0.7, indicating this bias is prevalent.

Figure 1. When α > 0, the reward function causes the distribution to be exponentially sharpened, and the output space contracts into a few high-likelihood modes.

Thus, the authors propose a core argument: mode collapse is not an algorithmic accident or an optimization flaw, but an inherent product of human preference data. Typicality bias causes language models to gradually lose generative diversity in their pursuit of being "more human-like."

Verbalized Sampling Mechanism

Having understood the mechanism by which typicality bias leads to distribution sharpening, the authors further propose Verbalized Sampling (VS), a sampling method that verbalizes its own distribution during the inference phase. Its goal is not to retrain the model but to make the model actively "express its perceived diversity" during generation.

2.1 Core Idea

Traditional sampling relies on the model's internal logits distribution for random draws. A higher temperature parameter T results in a flatter distribution and greater diversity; a lower T leads to more concentrated outputs. However, this temperature adjustment is merely mathematical noise control and does not truly change the model's "thinking" – it still cannot realize where it has uncertainty.

The key to Verbalized Sampling is to make the model verbalize this distribution using language. Researchers use a simple prompt to ask the model:

Generate N possible responses to the question below.For each response, assign a probability that representshow likely you think it is correct or reasonable.

The model is guided to generate N candidate answers and provide an explicit probability for each. For example:

(1) Answer A — “I think this is likely correct with probability 0.6.”

(2) Answer B — “… probability 0.25.”

(3) Answer C — “… probability 0.15.”

These verbalized probabilities are estimated by the model itself and then normalized into a set of actionable sampling weights. The final output is no longer a sample randomly drawn from hidden logits but is resampled from the distribution "declared" by the model itself.

Figure 2. Model asked to generate multiple candidates and verbalize the probability of each answer.

2.2 Language-based Self-Calibration

Through this process, the model performs a "linguistic calibration" during generation: it needs to simultaneously judge "what possible answers exist" and "how confident I am in each of them."

The authors found that these verbalized probabilities are highly correlated with the model's internal confidence – when the model self-assesses 70% certainty, its actual correctness rate is often close to 0.7. Therefore, VS not only restores diversity but also improves the consistency of generative confidence.

The researchers further proposed an "upper bound constraint strategy": when a verbalized probability exceeds a certain threshold (e.g., 0.3), it is re-normalized to encourage the model to allocate more weight to tail-end candidates. This constraint is equivalent to lowering the "sharpening exponent" at the linguistic level, thereby effectively countering the distribution concentration caused by α > 0 in the background section.

2.3 Comparison with Temperature Sampling

In experiments, the authors systematically compared VS with traditional temperature sampling. Results show that VS can significantly enhance generative diversity without compromising factual correctness and safety. In Creative Writing tasks, VS achieved a 1.6–2.1× improvement in diversity; human evaluation metrics increased by 25.7%, and 66.8% of the base model's original distribution entropy was recovered.

Figure 3. VS achieves a more robust balance between diversity and factuality through verbalized distribution resampling.

2.4 Implementation Features

Verbalized Sampling is completed entirely at the inference stage: no retraining, no parameter modification, no additional reward models required. Its implementation only requires adding an instruction to the prompt template, enabling the model to verbalize probabilities during generation and sample accordingly. This process is not only lightweight and interpretable but can also be directly combined with any aligned language model.

Experimental Results

Verbalized Sampling (VS) has been systematically validated across multiple open-ended generation tasks, showing that it significantly enhances output diversity without sacrificing factuality and safety. Experiments covered typical scenarios such as creative writing, open-ended QA, social simulation, and synthetic data generation, all conducted under identical model and prompt conditions to ensure fair comparison.

3.1 Creative Writing

Across three tasks—poetry, stories, and jokes—VS-Standard and its variants (VS-CoT, VS-Multi) significantly improved semantic diversity (approx. 1.6–2.1×) compared to baselines like Direct / Sequence, and yielded a 25.7% gain in human evaluation scores.

Simultaneously, VS-CoT / VS-Multi are closer to the Pareto front in terms of the "diversity-quality" trade-off; by setting a probability threshold in the prompt, diversity can be adjusted as needed (lower threshold means bolder exploration).

Figure 4. a–c show average semantic diversity comparisons across three tasks; d illustrates the diversity-quality trade-off; e–f indicate that larger models benefit more from VS; g–i demonstrate adjustable diversity through 'probability thresholds.'

3.2 Post-Training Stages

In the longitudinal evaluation of the Tulu-3 series (covering SFT, DPO, RLVR stages), baseline methods showed significant collapse as alignment progressed; VS, however, maintained 30%+ diversity at each stage, achieving an approx. 182.6% improvement over Direct at the Post-DPO node and recovering approx. 66.8% of the base model's original diversity.

This suggests that VS is not just a "generate multiple versions" prompting trick but an effective mechanism to counteract distribution sharpening in the post-training pipeline.

Figure 5. VS continuously mitigates diversity collapse along the SFT→DPO→RLVR training progression.

3.3 Intuitive Examples

Given the same theme "An Astronaut on a Horse," Direct often converges to a narrow, realistic style; VS descriptions, however, naturally branch into distinctly different narrative and visual approaches like watercolor, retro neon, and Baroque oil painting, demonstrating significant cross-style and cross-tone diversity.

Figure 6. Visual comparison of the same topic: Direct has a single style, while VS presents broad diversity.

VS's improvement comes from "letting the model first express its distribution and then choosing accordingly," rather than simply increasing random temperature. It brings the creativity compressed by alignment steadily back to a perceivable level for the reader in an interpretable and controllable way.

Try It Yourself

The authors encourage researchers and developers to experience the effects of Verbalized Sampling (VS) firsthand and provide complete Colab access and example tasks that can be run directly to visualize results.

You can launch VS with one click via this Colab: https://colab.research.google.com/drive/1UDk4W5w6gF0dQ9Tpu0sPQethEht51GXL#offline=true&sandboxMode=true

Code Example:

# Minimal VS examplefrom verbalized_sampling import sample # pip install verbalized-samplingprompt = "Write a short story about a bear."# Generate k responses with verbalized probabilitiesresponses = sample(prompt, k=5, return_probs=True)# responses is an iterable of (text, probability) pairsfor i, (text, p) in enumerate(responses, 1):print(f" {i}. p={p:.3f} → {text[:100]}…")

This Colab supports:

Using any OpenAI / Anthropic / Gemini model;

Switching VS modes (Standard / CoT / Multi);

Controlling the number of generations and probability threshold;

Visualizing "diversity-quality" curves and sample distributions.

Example One: System Prompt

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10.

Example Two: Direct Use in Chat Interface

Paste the following prompt into a chat interface (ChatGPT, Claude, Gemini, etc.) to use:

Generate 10 responses to the user query, each within a separate <response> tag. Each response should be 50-100 words.Each <response> must include a <text> and a numeric <probability>. Randomly sample the responses from the full distribution.<user_query>Write a short story about a bear.</user_query>

After running, you will see the model generate 10 story versions with probability annotations—from "bear in the forest" to "tax accountant bear" to "interstellar bear," different styles, different settings, but all retaining the model's explicit estimation of multiple possibilities.

Example Three: API Call

Use the following curl command to experience Verbalized Sampling (VS-Standard) via the OpenAI API. You can replace gpt-4.1 with any model version you wish to test.

export OPENAI_API_KEY="your_openai_key"curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-4.1", "messages": [ { "role": "system", "content": "Generate 10 responses to the input prompt, each within a separate <response> tag. Each response should be 50-100 words. Each <response> must include a <text> and a numeric <probability>. Randomly sample the responses from the full distribution. Return ONLY the responses, with no additional explanations or text." }, { "role": "user", "content": "Write a short story about a bear." } ], "temperature": 1.0 }'

Conclusion

Verbalized Sampling (VS) demonstrates a highly inspiring path: without adjusting parameters or requiring additional training, merely through prompt design, models can regain the generative space compressed by the alignment process. It prompts us to rethink "what a large model's output truly represents" – not just an optimized answer, but also the model's expression of uncertainty.

Through explicit verbalization, models can find a new balance between factual correctness and expressive diversity: maintaining reliability while showcasing breadth of thought. In the tension between alignment and creativity, VS offers a pragmatic engineering solution. It reminds us that improving model capabilities doesn't always require larger networks or more expensive training; it can also come from smarter ways of asking questions.

GPT models becoming more conservative? Stanford Manning team proposes Verbalized Sampling to make models "think a bit more"

Research Background

Verbalized Sampling Mechanism

Experimental Results

Try It Yourself

Conclusion

Share Short URL