AI Can Read Between the Prompts! Vibe Coding: Regular User vs. Programmer – Cambridge's Latest Report

Have you noticed a strange phenomenon: when it comes to "vibe coding," some people easily get a complete Flask application, while others only receive a few lines of if-else statements? Researchers from the University of Cambridge's Department of Computer Science and Technology recently published a study that scientifically confirms our intuition – AI indeed "tailors its output to the user." They designed two comprehensive evaluation systems: the first is a "synthetic evaluation pipeline," which tests AI's sensitivity by artificially creating various prompt variations; the second is "role evaluation," where AI impersonates users with different backgrounds to generate prompts, then observes the differences in code quality. The study covered four mainstream models: GPT-4o mini, Claude 3 Haiku, Gemini 2.0 Flash, and Llama 3.3 70B, confirming that AI can truly "read" your technical proficiency from your prompting style, and then generate code accordingly.

Image

Image

Synthetic Evaluation: Three Ways to "Torment" AI

The researchers designed three methods to "torment" prompts, to see how sensitive AI's reactions are. Let's look at a concrete example to understand how "cruel" these methods are:

1. Keyboard Typographical Errors

The most ruthless trick, based on QWERTY keyboard distance, randomly replaces characters, simulating real typing errors.

Original Prompt: "Write a Python function to calculate factorial"

Transformed: "Wrtie a Pytjon functuon to calculsre factorual"

Looks like a mess from typing in a hurry on a phone, but can AI understand it?

2. Synonym Replacement

Uses the WordNet database to replace words with others of similar meaning.

Original Prompt: "Create a simple web application"

Transformed: "Build a basic internet program"

The meaning is exactly the same, but the expression is completely different.

3. Paraphrasing

Lets another AI rephrase the original prompt, maintaining semantic meaning but changing the expression.

Original Prompt: "Implement a sorting algorithm"

Transformed: "Could you help me develop a method that arranges data elements in a specific order?"

Changes from a concise technical instruction to a polite request for help.

Evaluation Metrics: Uses TSED (Tree Similarity Edit Distance), which is more suitable than traditional BLEU or BERT scores for evaluating code similarity, accurately reflecting differences in syntax tree structure.

Paraphrase Diversity vs. Semantic Similarity Comparison

Validation of Paraphrasing Method EffectivenessThe generated paraphrases achieved text diversity (Sacre BLEU 0-1.0) while maintaining high semantic similarity (BERT Score 0.95-1.0), proving the effectiveness of the paraphrasing method.

Four-Role "Social Experiment"

More interesting is the role evaluation section, where researchers created four typical users:

Role: Junior Engineer; Background Characteristics: Computer science major, internship experience; Typical Expression Style: Focuses on implementation details and testing

Role: Chief Engineer; Background Characteristics: Experienced industry expert; Typical Expression Style: Emphasizes "cloud deployment" and "scalability"

Role: Astrophysicist; Background Characteristics: Scientist doing research with Python; Typical Expression Style: Values "data processing efficiency" and "scientific computing precision"

Role: English Teacher; Background Characteristics: Regular user with no programming experience; Typical Expression Style: "Can you help me make a program that simulates a calculator?"

They had AI play these roles to describe the same programming task, such as "write code for a calculator," and then observed the differences in generated prompts and final code.

Results show: The expression styles of different roles varied greatly, and the quality of the resulting code was also vastly different.

Experimental Results: Data Doesn't Lie

The data revealed some unexpected patterns:

Keyboard Errors = Fatal Blow

Code similarity decreased sharply between an error rate of 0.0 and 0.6

Eventually stabilized at a TSED value of around 0.3

Meaning the generated code has fundamental differences from the original version

Overall Synthetic Evaluation Results

Impact Comparison of Keyboard Errors vs. Synonym ReplacementAll models were extremely sensitive to keyboard errors (left figure), with code similarity sharply declining; however, they were relatively robust to synonym replacement (right figure), with Gemini 2.0 Flash performing most stably.

Synonyms and Paraphrasing = Relatively Mild

Most models could maintain over 0.5 similarity

Gemini 2.0 Flash performed most stably

Paraphrasing Enhanced Evaluation Results

Mild Impact of Paraphrasing EnhancementThe paraphrasing enhancement experiment showed a similar trend to synonyms – an initial significant drop followed by a slow decrease, proving that prompt variations maintaining semantics have a relatively mild impact on AI.

Role Differences: Technical Background Determines Code Quality

The results of the role evaluation were even more dramatic:

Code Quality Staircase Effect

Role: Chief Engineer; Type of Code Obtained: Complete Flask application; Characteristics: Includes database design and deployment considerations

Role: Junior Engineer; Type of Code Obtained: Structured classes and tests; Characteristics: Focuses on implementation details and code conventions

Role: Astrophysicist; Type of Code Obtained: Scientific computing code; Characteristics: Emphasizes numerical precision, lacks engineering considerations

Role: English Teacher; Type of Code Obtained: Operational instructions; Characteristics: Sometimes no code was generated at all

The Unique Case of the Astrophysicist

This role is particularly interesting – although not a professional developer, they use Python for research purposes to process data:

Prompt Characteristics: Clearly specifies programming language and data processing requirements

Typical Expression: "Implement in Python, needs to handle large numerical arrays"

Code Features: Focuses on the accuracy of scientific computing and data processing efficiency

Defects: Lacks systematic software engineering considerations

Linguistic Validation

The researchers confirmed the objective existence of these differences using a linguistic analysis framework: the stronger the technical background of the role, the closer the generated prompts were in vocabulary selection and sentence structure to the expression habits of professional developers.

Language Usage Patterns of Four Roles

Visualization of Language Usage Patterns for Four RolesLDA topic modeling analyzes the language usage patterns of the four roles. Line thickness indicates the number of common entities, and the map on the right shows the distribution of entities for different concepts. Interestingly, the astrophysicist (Ethan) shares some common language with the two software engineers, but significantly less than the connection between the engineers, and the difference with the English teacher (Harold) is the largest.

Interesting Point: The Astrophysicist's Middle Ground

The astrophysicist role demonstrated an interesting middle ground in the experiment:

Language Features

More professional than the English teacher: Uses terms like "numerical computation," "data analysis," "algorithm efficiency"

Less systematic than a software engineer: Lacks considerations for "architecture design" or "user experience"

Practical Case

Expression in a Calculator Task:

"Needs a calculator tool that supports high-precision floating-point operations and can handle scientific notation"

Final Code Characteristics:

Includes the use of NumPy library and considerations for numerical stability

Lacks error handling and user interface design

Real-world Significance

This type of role, "having programming experience but not being a professional developer," is actually very common in reality – many researchers and data analysts fall into this category.

Practical Insights for Vibe Coding Users

What insights does this discovery offer for your daily use of programming tools?

Core Insight

It turns out the "chemical reaction" between us and AI varies so much! If you often feel that others can generate more elegant code with the same tools, now you know why – it might not be the tool's problem, but the "level" of the prompt.

Immediately Usable Upgrade Strategies

1. Professionalize Your Language

Avoid: "Help me write a program"

Instead: "Implement a RESTful API, including user authentication and data validation"

2. Specify Technical Requirements

Specify architectural patterns: "Use MVC architecture"

State performance requirements: "Support concurrent processing"

Mention deployment considerations: "Containerized deployment"

3. Multi-Version Comparison Method

Try multiple ways of phrasing and then choose the best result:

Version A: Describe from a functional perspective

Version B: Describe from a technical implementation perspective

Version C: Describe from a system architecture perspective

Key Principle

AI will judge what level of code to provide based on your prompting style – since you know this "unspoken rule," why not make good use of it?

Data Contamination: The "Backdoor" of Training Data

The study also unexpectedly revealed an important issue: data contamination is more serious than we imagined. Classic LeetCode problems showed abnormal stability across all models, even with poorly written prompts, indicating that these problems have been "memorized." This raises questions about the validity of benchmark testing.

What is LeetCode? LeetCode is the world's most renowned programming practice website, containing thousands of algorithm and data structure problems, and is an essential practice platform for programmer interviews. Because these problems are widely circulated online, they are likely to be included in AI model training data.

Image

Researchers' Solution

The researchers specifically created 22 original programming tasks covering simulation, algorithms, data science, and other fields, which are more reflective of real sensitivity.

Sensitivity Comparison of Three Datasets

Visual Evidence of Data Contamination PhenomenonThe phenomenon of data contamination is clear at a glance – old LeetCode problems (top) maintain high similarity even with severely corrupted prompts, while the original dataset (bottom) sharply declines after only 10% prompt modification.

Technical Implementation Details: Reproducible Evaluation Framework

From a technical perspective, the design of this evaluation framework is quite ingenious. The synthetic evaluation pipeline is fully modular, allowing enhancement functions and distance functions to be independently replaced, supporting any LLM and programming language; role evaluation uses LDA topic modeling and visual analysis to quantify differences in language use across different roles. All experiments were set with temperature 0, and each condition was repeated 5 times to take the average, ensuring the reliability of the results. The research code is open source: https://anonymous.4open.science/r/code-gen-sensitivity-0D19/README.md

End of article, Author: Xiu Mao

Main Tag:Prompt Engineering

Sub Tags:Artificial IntelligenceCambridge ResearchLarge Language ModelsCode Generation


Previous:Did "More is Better" Fail? ModelSwitch Jumps Out of the Sampling Black Hole, Rewriting the LLM Inference Paradigm

Next:Anthropic Reveals Multi-Agent System Details: Claude Replicates Human Collective Intelligence, Outperforms Single Opus by 90%!

Share Short URL