Notes compiled by: Liu Yi, Master's student at Tianjin University, research focus on knowledge graphs and large models
Paper link: https://doi.org/10.1162/coli_a_00548
Published conference: Computational Linguistics (2025) 51 (2): 467–504.
1. Motivation
Causal reasoning is a core ability of human thinking, but large language models (LLMs) still face challenges in handling complex causal reasoning tasks (such as abductive reasoning and counterfactual reasoning). The core difficulty lies in the fact that complex causal structures in text (such as multi-event branching relationships) are usually implicitly embedded in temporal descriptions, making it difficult for LLMs to learn them explicitly. In contrast, conditional statements in programming code can express causal relationships more directly and frequently, so code language models (Code-LLMs) may possess stronger causal reasoning abilities. Therefore, it is urgent to explore whether Code-LLMs are better at causal reasoning than general LLMs, and whether code prompts can describe causal structures more clearly than text prompts.
2. Contributions
The main contributions of this paper are:
(1) Designed code prompts to handle causal reasoning tasks by utilizing conditional statements in code to represent causal structures.
(2) Comprehensively evaluated multiple general LLMs (such as LLAMA-2, GPT-3) and Code-LLMs (such as CODELLAMA, CODEX) on abductive reasoning and counterfactual reasoning tasks under zero-shot and one-shot settings, and verified that code large language models outperform general large language models in causal reasoning, and for most models, code prompts are more effective than text prompts.
(3) Through intervention experiments, such as modifying the information, structure, format, and programming language of prompts, revealed that programming structures (especially conditional statements) are the key factors making code prompts effective.
(4) Demonstrated that fine-tuning large language models on code corpora with only conditional statements can improve their causal reasoning abilities.
3. Method
Mainly considering two challenging causal reasoning tasks: abductive reasoning and counterfactual reasoning. Abductive reasoning requires the model to generate a reasonable cause for the outcome consistent with the premise. Counterfactual reasoning asks what would happen in a counterfactual branch. An overview of the tasks and research questions is shown in Figure 1, where the causal relationships between events in the tasks are as shown in the left panel of Figure 1, and the research questions discussed in this work, including how to activate and how to enhance the causal reasoning abilities of large language models, are as shown in the right panel of Figure 1. These research questions are as follows:
• Are code language models better at causal reasoning than general language models?
• Are code prompts better than text prompts in describing causal structures?
• Which aspects of code prompts make them effective?
• How to use code data to improve the causal reasoning abilities of large language models?
Figure 1: Overview of tasks and research questions
The code prompt construction for abductive reasoning is shown in Figure 2, where the causal structure is represented through the execution flow in the main function: execute the premise, if the hypothesis is satisfied, then execute the outcome. Event content is embedded in the function as comments, and the target function is placed at the end for the model to generate.
Figure 2: Code prompt construction for abductive reasoning
The code prompt construction for counterfactual reasoning is shown in Figure 3, using if-elif structure to distinguish the original branch from the counterfactual branch, ensuring clear causal logic, and clarifying task requirements through comments, such as minimizing modifications to the original outcome.
Figure 3: Code prompt construction for counterfactual reasoning
For text prompt design, natural language is used to describe the same causal structures; for model comparison, pairs of large language models like <LLAMA-2, CODELLAMA> and <GPT-3, CODEX> were selected, which have the same architecture but differ in the proportion of text/code training corpora.
For exploring construction factors in code prompts, four aspects were selected for intervention: information, structure, format, and language.
4. Experiments
In the result evaluation experiments, abductive reasoning uses the ART dataset, counterfactual reasoning uses the TimeTravel dataset, and evaluation metrics include BLEU-4, ROUGE-L, CIDEr, BERTScore. Under zero-shot settings, automatic evaluation results for abductive reasoning are shown in Table 1, for counterfactual reasoning in Table 2; under one-shot settings, results are shown in Table 3; human evaluation results in Table 4.
In the prompt intervention experiments, for information, two types of prior information were studied: task instructions and function names. In "no instruction", task instructions were removed from the prompt; in "function name perturbation", the original function name was replaced with anonymous function X; for structure, one converted conditional structures to sequential structures, another randomly shuffled function positions in conditional structures; for format, besides the original format, two formats were tested: class and print; for language, the original Python program was converted to two other programming languages, Java and C, to assess the impact of programming languages. Experimental results are shown in Table 5.
In the fine-tuning experiments, the existing code corpus CodeAlpaca-20k (Chaudhary, 2023) was used to construct fine-tuning data. CodeAlpaca-20k contains 20,000 instruction-following data points used to fine-tune LLAMA-2 into the Code Alpaca model. Three 7B models, LLAMA-2, QWEN1.5, and DEEPSEEK-LLM, were fine-tuned with a batch size of 128 for one epoch, using AdamW optimizer, learning rate of 2e-5, warmup ratio of 0.03. Maximum length was set to 512, covering most CodeAlpaca data. Hyperparameters like batch size, epochs, and learning rate were selected via grid search on the validation sets of ART and TimeTravel. Automatic evaluation results for models fine-tuned on conditional statements are shown in Table 6. To study the trend of performance improvement with training data volume, the proportion of training data was controlled from 0% to 100%, and the performance of fine-tuned models was evaluated, with results shown in Figure 4.
Table 1: Automatic evaluation results for abductive reasoning
Table 2: Automatic evaluation results for counterfactual reasoning
Table 3: Evaluation results under one-shot settings
Table 4: Human evaluation results
Table 5: Prompt intervention experiment results
Table 6: Automatic evaluation results for models fine-tuned on conditional statements
Figure 4: Model performance after fine-tuning with different proportions of training data
Experiments show that describing complex causal structures with code is clearer and easier to understand for most models, and all models perform better under one-shot settings than zero-shot. Code language models outperform corresponding general language models in most settings, and for most models, code prompts outperform text prompts; the advantages of code language models and code prompts are robust across different settings. Conditional structures that reasonably describe relationships between events are crucial for model reasoning, while models are more robust to format and language interventions, indicating that performance can be further improved through fine-grained prompt engineering. Additionally, in most cases, the largest performance gains are observed from 0% to 20% training data, suggesting that causal reasoning abilities can be substantially improved with only a small amount (less than a thousand) of conditional statement code; conditional statements are key to enhancing causal reasoning abilities.
5. Conclusion
This paper investigates the causal reasoning abilities of code language models (Code-LLMs) and the effectiveness of using code prompts in causal reasoning tasks. It is proven that in performing complex causal reasoning tasks, code language models outperform general language models with the same architecture. Compared to text prompts, code prompts are more effective in describing causal structures, improving the performance of a wide range of language models. This paper further analyzes the importance of different aspects of code prompts and finds that providing reasonable causal structures in code helps generate reasonable outputs. Based on these observations, this paper hypothesizes that fine-tuning models on code corpora with conditional statements can improve causal reasoning abilities, and validates this hypothesis through experiments. These findings indicate that code, particularly conditional statements in code, can play an important role in eliciting and enhancing the causal reasoning abilities of language models through prompting and fine-tuning.