Research: LLM's Prefilling Feature Has Become Its Jailbreak Vulnerability!

Research: LLM's prefilling feature has become its jailbreak vulnerability!

A recently published study reveals a shocking fact:

The "prefilling" feature in Large Language Models (LLMs), originally intended to enhance output control, has instead become the most effective tool for bypassing security restrictions, with an attack success rate as high as99.82%!

The study, titled "Prefill-Based Jailbreak," demonstrates a new type of jailbreak attack method that no longer focuses on the traditional user input side, but insteaddirectly manipulates the AI assistant's first response text, thereby cleverly bypassing security review mechanisms.

Paper address:

https://arxiv.org/pdf/2504.21038v1

This discovery overturns our understanding of AI security, and we need to rethink the security boundaries of large language models.

What is Prefilling Technology?

The prefilling feature was originally designed for large language models to improve output quality, allowing users topreset the starting text of the AI assistant's response.

This feature is widely present in major mainstream models:

Prefilling in Claude

When using the Claude API, users can guide the model's response by prefilling the Assistant message.

This technique allows users to direct Claude's actions, skip preambles, force specific formats (like JSON or XML), and even help Claude maintain consistency in role-playing scenarios.

Example of Claude's prefilling implementation:

import anthropicclient = anthropic.Anthropic()message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, messages=[ {"role": "user","content": "Extract the name, size, price, and color from the following product description and output them as a JSON object. <description>SmartHome Mini is a compact smart home assistant, available in black or white, priced at just $49.99. At only 5 inches wide, it lets you control lights, thermostats, and other connected devices with your voice or app—wherever you place it in your home. This affordable little hub brings convenient hands-free control to your smart devices. </description>" }, {"role": "assistant", "content": "{"# Prefilling the curly brace forces JSON output } ])

DeepSeekPrefilling

DeepSeek API now supports the Chat Prefix Completion feature, allowing users to specify a prefix for the last assistant message for the model to complete. This feature can also be used to connect messages truncated due to reaching the max_tokens limit and resend the request to continue the truncated content.

Example of DeepSeek's prefilling implementation:

# Ensure the last message is of the assistant role and set its prefix parameter to True# For example: {"role": "assistant", "content": "Once upon a time,", "prefix": True}# The following is an example using Chat Prefix Completion# In this example, the beginning of the assistant message is set to '```python ' to force the output to start with a code blockimport requestsimport jsonurl = "https://api.deepseek.com/beta/v1/chat/completions"headers = {"Content-Type": "application/json","Authorization": f"Bearer {api_key}"}data = {"model": "deepseek-chat","messages": [ {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence"}, {"role": "assistant", "content": "```python\n", "prefix": True} ],"stop": ["```"],"max_tokens": 500}response = requests.post(url, headers=headers, data=json.dumps(data))print(response.json())

GeminiPrefilling

Although the Gemini API does not have an explicit prefilling feature in the official documentation, researchers found similar vulnerabilities exist in some cases.

According to the study, a similar effect can be achieved through specific message construction.

From Helper Function to Security Vulnerability: The Principle of Prefilling Attacks

The research team found that this feature, originally used to enhance output control, could become the most powerful jailbreak tool.

They proposed two attack variants:

Static Prefilling (SP): Using fixed generic text like "Okay, here's how to" to guide the model into generating harmful responses
Optimized Prefilling (OP): Iteratively optimizing the prefilling text to maximize the attack success rate

These methods are effective becauseprefilling directly interferes with the model's autoregressive generation mechanism.

Matthew Rogers (@rogerscissp) also pointed out:

It's just sending fake context. Why do people describe simple vectors with complex vocabulary? It's clever though.

Experimental Results: Astonishing Success Rate

The research team conducted experiments on six state-of-the-art large language models, and the results were shocking:

On DeepSeek V3, theOptimized Prefilling (OP) attack achieved a success rate as high as 99.82%
When combined with existing jailbreak techniques, the success rate further increased to99.94%

The study used two evaluation metrics:

String Matching (SM): Detecting whether the output contains predefined harmful content strings
Model Judging (MJ): Using another LLM to evaluate whether the output contains harmful information

Below is a comparison of attack success rates for some models:

Notably, the Claude model showed stronger resistance, possibly suggesting that Claude implements some kind of external harmful content detection mechanism.

Why are Prefilling Attacks So Effective?

To prove that the prefilling technique is indeed a key factor for attack success, researchers conducted a control experiment comparing four methods:

Irrelevant Prefilling: Adding irrelevant text to the response
Prompt Suffix: Requesting a specific starting phrase in the user prompt
Static Prefilling (SP): The method proposed in this study
Optimized Prefilling (OP): The improved method proposed in this study

The results showed that the attack success rates of the first two control methods were extremely low (only 0.5%-7%), while the prefilling methods were significantly effective (up to 99.61%). This strongly indicates thatprefilling technology can indeed break the security boundaries of language models.

This is because the prefilling feature directly manipulates the model's initial generation state, which is equivalent toforcing a specific thought path for the model, making subsequent generated content highly likely to deviate from security boundaries. As the study states:

Unlike traditional jailbreak methods, this attack bypasses LLM's security mechanisms by directly manipulating the probability distribution of subsequent tokens, thereby controlling the model's output.

Defense Challenges and Security Recommendations

This research finding has significant implications for the field of AI security. Researchers pointed out that existing security measures primarily focus on detecting issues at the user input side, while neglecting security risks at the AI assistant response side.

For model providers, the researchers offer the following recommendations:

Implement strict content validation: Conduct strict review when processing prefilled content
Introduce response monitoring mechanisms: Monitor AI responses in real-time to promptly interrupt potentially harmful content
Redesign the prefilling feature: Balance functionality with security

For end-users, vigilance is advised:

Use the prefilling feature cautiously: Especially when dealing with sensitive tasks
Regularly update APIs and clients: Ensure the latest security patches are obtained
Implement multi-layered defense: Do not rely on a single security mechanism

Technical Principle: How Prefilling Affects Model Generation

From a technical perspective, the effectiveness of prefilling attacks is key to the autoregressive nature of large language models – where the generation of subsequent tokens heavily depends on the preceding content.

Some APIs (like Claude) allow users to directly prefill the LLM's response using a specified beginning, which makes the aforementioned optimization process unnecessary. In this case, it can be achieved by prefilling strings of the target behavior (e.g., "Okay, here's how to make a bomb").

Researchers found that even simple prefilling text can significantly alter the model's behavior:

Initial probability distribution interference: Prefilling text directly changes the probability distribution of initial tokens
Conditional generation trajectory setting: Once the initial trajectory is set, the model tends to continue generating in that direction
Security check bypass: Prefilling text may bypass security checks at the input stage

Future Research Directions

This research opens a new perspective in the field of AI security and points to several important future research directions:

Defense Mechanism Development: How to strengthen prefilling security without affecting functionality

Multimodal Prefilling Attacks: Is the prefilling technique applicable to multimodal LLMs?

Cross-model Attack Transfer: How effective is prefilling text optimized on one model for other models?

Conclusion

The security risks of the prefilling feature once again prove that AI security is an endless game of offense and defense.

As the capabilities of large language models continue to improve, we must not only focus onwhat they can answer, but also considerhow they answer.

This research also serves as a warning to us:in the field of AI, sometimes the most convenient features may hide the biggest security risks.

To truly build reliable AI systems, we need to find a better balance between functionality, performance, and security.

How should model providers improve the prefilling feature?

Should it be completely canceled, or should a safer implementation method be sought?

What do you think?

👇

In addition, I have also used AI to collect AI information from the entire network, and used AI to select, review, translate, and summarize it before publishing it in the Knowledge Planet of "AGI Hunt".

This is an AI information stream that contains only information and no emotions (not a recommendation stream, no selling courses, no lecturing, no teaching you how to live, only providing information)

Welcome to join! Also welcome to join the group and chat with 2000+ group members