Have you noticed a strange phenomenon: when it comes to "vibe coding," some people easily get a complete Flask application, while others only receive a few lines of if-else statements? Researchers from the University of Cambridge's Department of Computer Science and Technology recently published a study that scientifically confirms our intuition – AI indeed "tailors its output to the user." They designed two comprehensive evaluation systems: the first is a "synthetic evaluation pipeline," which tests AI's sensitivity by artificially creating various prompt variations; the second is "role evaluation," where AI impersonates users with different backgrounds to generate prompts, then observes the differences in code quality. The study covered four mainstream models: GPT-4o mini, Claude 3 Haiku, Gemini 2.0 Flash, and Llama 3.3 70B, confirming that AI can truly "read" your technical proficiency from your prompting style, and then generate code accordingly.
Synthetic Evaluation: Three Ways to "Torment" AI
The researchers designed three methods to "torment" prompts, to see how sensitive AI's reactions are. Let's look at a concrete example to understand how "cruel" these methods are:
1. Keyboard Typographical Errors
The most ruthless trick, based on QWERTY keyboard distance, randomly replaces characters, simulating real typing errors.
Original Prompt: "Write a Python function to calculate factorial"
Transformed: "Wrtie a Pytjon functuon to calculsre factorual"
Looks like a mess from typing in a hurry on a phone, but can AI understand it?
2. Synonym Replacement
Uses the WordNet database to replace words with others of similar meaning.
Original Prompt: "Create a simple web application"
Transformed: "Build a basic internet program"
The meaning is exactly the same, but the expression is completely different.
3. Paraphrasing
Lets another AI rephrase the original prompt, maintaining semantic meaning but changing the expression.
Original Prompt: "Implement a sorting algorithm"
Transformed: "Could you help me develop a method that arranges data elements in a specific order?"
Changes from a concise technical instruction to a polite request for help.
Evaluation Metrics: Uses TSED (Tree Similarity Edit Distance), which is more suitable than traditional BLEU or BERT scores for evaluating code similarity, accurately reflecting differences in syntax tree structure.
Validation of Paraphrasing Method EffectivenessThe generated paraphrases achieved text diversity (Sacre BLEU 0-1.0) while maintaining high semantic similarity (BERT Score 0.95-1.0), proving the effectiveness of the paraphrasing method.
Four-Role "Social Experiment"
More interesting is the role evaluation section, where researchers created four typical users:
Role: Junior Engineer; Background Characteristics: Computer science major, internship experience; Typical Expression Style: Focuses on implementation details and testing
Role: Chief Engineer; Background Characteristics: Experienced industry expert; Typical Expression Style: Emphasizes "cloud deployment" and "scalability"
Role: Astrophysicist; Background Characteristics: Scientist doing research with Python; Typical Expression Style: Values "data processing efficiency" and "scientific computing precision"
Role: English Teacher; Background Characteristics: Regular user with no programming experience; Typical Expression Style: "Can you help me make a program that simulates a calculator?"
They had AI play these roles to describe the same programming task, such as "write code for a calculator," and then observed the differences in generated prompts and final code.
Results show: The expression styles of different roles varied greatly, and the quality of the resulting code was also vastly different.
Experimental Results: Data Doesn't Lie
The data revealed some unexpected patterns:
Keyboard Errors = Fatal Blow
Code similarity decreased sharply between an error rate of 0.0 and 0.6
Eventually stabilized at a TSED value of around 0.3
Meaning the generated code has fundamental differences from the original version
Impact Comparison of Keyboard Errors vs. Synonym ReplacementAll models were extremely sensitive to keyboard errors (left figure), with code similarity sharply declining; however, they were relatively robust to synonym replacement (right figure), with Gemini 2.0 Flash performing most stably.
Synonyms and Paraphrasing = Relatively Mild
Most models could maintain over 0.5 similarity
Gemini 2.0 Flash performed most stably
Mild Impact of Paraphrasing EnhancementThe paraphrasing enhancement experiment showed a similar trend to synonyms – an initial significant drop followed by a slow decrease, proving that prompt variations maintaining semantics have a relatively mild impact on AI.
Role Differences: Technical Background Determines Code Quality
The results of the role evaluation were even more dramatic:
Code Quality Staircase Effect
Role: Chief Engineer; Type of Code Obtained: Complete Flask application; Characteristics: Includes database design and deployment considerations
Role: Junior Engineer; Type of Code Obtained: Structured classes and tests; Characteristics: Focuses on implementation details and code conventions
Role: Astrophysicist; Type of Code Obtained: Scientific computing code; Characteristics: Emphasizes numerical precision, lacks engineering considerations
Role: English Teacher; Type of Code Obtained: Operational instructions; Characteristics: Sometimes no code was generated at all
The Unique Case of the Astrophysicist
This role is particularly interesting – although not a professional developer, they use Python for research purposes to process data:
Prompt Characteristics: Clearly specifies programming language and data processing requirements
Typical Expression: "Implement in Python, needs to handle large numerical arrays"
Code Features: Focuses on the accuracy of scientific computing and data processing efficiency
Defects: Lacks systematic software engineering considerations
Linguistic Validation
The researchers confirmed the objective existence of these differences using a linguistic analysis framework: the stronger the technical background of the role, the closer the generated prompts were in vocabulary selection and sentence structure to the expression habits of professional developers.
Visualization of Language Usage Patterns for Four RolesLDA topic modeling analyzes the language usage patterns of the four roles. Line thickness indicates the number of common entities, and the map on the right shows the distribution of entities for different concepts. Interestingly, the astrophysicist (Ethan) shares some common language with the two software engineers, but significantly less than the connection between the engineers, and the difference with the English teacher (Harold) is the largest.
Interesting Point: The Astrophysicist's Middle Ground
The astrophysicist role demonstrated an interesting middle ground in the experiment:
Language Features
More professional than the English teacher: Uses terms like "numerical computation," "data analysis," "algorithm efficiency"
Less systematic than a software engineer: Lacks considerations for "architecture design" or "user experience"
Practical Case
Expression in a Calculator Task:
"Needs a calculator tool that supports high-precision floating-point operations and can handle scientific notation"
Final Code Characteristics:
Includes the use of NumPy library and considerations for numerical stability
Lacks error handling and user interface design
Real-world Significance
This type of role, "having programming experience but not being a professional developer," is actually very common in reality – many researchers and data analysts fall into this category.
Practical Insights for Vibe Coding Users
What insights does this discovery offer for your daily use of programming tools?
Core Insight
It turns out the "chemical reaction" between us and AI varies so much! If you often feel that others can generate more elegant code with the same tools, now you know why – it might not be the tool's problem, but the "level" of the prompt.
Immediately Usable Upgrade Strategies
1. Professionalize Your Language
Avoid: "Help me write a program"
Instead: "Implement a RESTful API, including user authentication and data validation"
2. Specify Technical Requirements
Specify architectural patterns: "Use MVC architecture"
State performance requirements: "Support concurrent processing"
Mention deployment considerations: "Containerized deployment"
3. Multi-Version Comparison Method
Try multiple ways of phrasing and then choose the best result:
Version A: Describe from a functional perspective
Version B: Describe from a technical implementation perspective
Version C: Describe from a system architecture perspective
Key Principle
AI will judge what level of code to provide based on your prompting style – since you know this "unspoken rule," why not make good use of it?
Data Contamination: The "Backdoor" of Training Data
The study also unexpectedly revealed an important issue: data contamination is more serious than we imagined. Classic LeetCode problems showed abnormal stability across all models, even with poorly written prompts, indicating that these problems have been "memorized." This raises questions about the validity of benchmark testing.
What is LeetCode? LeetCode is the world's most renowned programming practice website, containing thousands of algorithm and data structure problems, and is an essential practice platform for programmer interviews. Because these problems are widely circulated online, they are likely to be included in AI model training data.
Researchers' Solution
The researchers specifically created 22 original programming tasks covering simulation, algorithms, data science, and other fields, which are more reflective of real sensitivity.
Visual Evidence of Data Contamination PhenomenonThe phenomenon of data contamination is clear at a glance – old LeetCode problems (top) maintain high similarity even with severely corrupted prompts, while the original dataset (bottom) sharply declines after only 10% prompt modification.
Technical Implementation Details: Reproducible Evaluation Framework
From a technical perspective, the design of this evaluation framework is quite ingenious. The synthetic evaluation pipeline is fully modular, allowing enhancement functions and distance functions to be independently replaced, supporting any LLM and programming language; role evaluation uses LDA topic modeling and visual analysis to quantify differences in language use across different roles. All experiments were set with temperature 0, and each condition was repeated 5 times to take the average, ensuring the reliability of the results. The research code is open source: https://anonymous.4open.science/r/code-gen-sensitivity-0D19/README.md
End of article, Author: Xiu Mao