PKU, Tsinghua, UvA, CMU, etc. Jointly Release: Latest Survey on Logical Reasoning Abilities of Large Models

Current large model research is gradually shifting from pre-training relying on Scaling Laws to focusing on post-training logical reasoning capabilities. Given the effectiveness and universality of symbolic logical reasoning, improving the logical reasoning ability of large models has become a key way to solve the hallucination problem.

To advance the research on the logical reasoning capabilities of large language models, researchers from 5 universities including Peking University, Tsinghua University, University of Amsterdam (UvA), Carnegie Mellon University (CMU), and MBZUAI conducted a comprehensive survey of the latest research methods and evaluation benchmarks in this field. They jointly published the survey report "Empowering LLMs with Logical Reasoning: A Comprehensive Survey," which systematically organizes existing methods and explores future research directions for two key scientific questions - logical question answering and logical consistency.

The survey paper has been accepted by IJCAI 2025 Survey Track, and the author team will give a Tutorial presentation on the same topic at IJCAI 2025, comprehensively discussing the challenges, methods, and opportunities in this research field.

论文标题：Empowering LLMs with Logical Reasoning: A Comprehensive Survey

论文链接：https://arxiv.org/abs/2502.15652

全文概要

Although Large Language Models (LLMs) have achieved significant success in many natural language tasks, recent studies show that their logical reasoning capabilities still have significant deficiencies. This paper mainly summarizes the logical reasoning dilemma of large models into two aspects:

Logical Question Answering: LLMs often struggle to generate correct answers when performing complex reasoning such as deduction, induction, or abduction under given premises and constraints. For example, if the premise is "Metals conduct electricity; insulators do not conduct electricity; if something is made of iron, then it is a metal; nails are made of iron," and the question is "Is the following statement true, false, or undetermined: Nails do not conduct electricity." To answer this question correctly, a large language model needs to deduce the logical reasoning chain "Nails → made of iron → metal → conduct electricity," thereby concluding that the statement is actually "false."
Logical Consistency: LLMs are prone to generating contradictory answers between different questions. For example, the Macaw question answering model answers "yes" to both "Is a magpie a bird?" and "Do birds have wings?" but gives a negative answer to "Do magpies have wings?"

To advance research in this field, we have systematically reviewed the latest technical methods and established a corresponding classification system. Specifically, for logical question answering, existing methods can be classified into categories such as those based on external solvers, prompting techniques, pre-training, and fine-tuning. For logical consistency, we discuss common concepts of logical consistency, including negation consistency, implication consistency, transitivity consistency, factual consistency, and their composite forms, and summarize their corresponding technical means for each type of logical consistency.

In addition, we summarized commonly used benchmark datasets and evaluation metrics and discussed several promising research directions, such as extending to modal logic to handle uncertainty and developing efficient algorithms that can simultaneously satisfy multiple logical consistencies.

The specific structure of the article is shown in the figure below.

Figure 1: Large Model Logical Reasoning Survey Classification System, including two key scientific problems: logical question answering and logical consistency

Two aspects of the large model logical reasoning dilemma

Although large language models have shown remarkable performance in a wide range of natural language tasks such as text generation, classification, and translation, they still face significant challenges in complex logical reasoning. This is because the pre-training corpora of large language models mainly consist of texts written by humans, which lack high-quality logical reasoning samples (such as deductive proofs), and tasks like next token prediction or masked language modeling learn grammar, semantics, and world knowledge but do not ensure that large language models possess logical reasoning capabilities. The above limitations can lead to poor performance of large language models in the following two tasks that require logical reasoning ability.

Logical Question Answering

Large language models often fail to generate correct answers in logical question answering, which requires them to perform complex deduction, induction, or abduction reasoning given a series of premises and inference rules. Specifically, these logical problems can be roughly divided into two categories:

Judging whether a statement can be derived from the given information, i.e., outputting the truth value of the statement: true, false, or undetermined.
Finding all options from multiple choices that do not violate the given premises and constraints.

Surprisingly, on the logical problem dataset FOLIO, the LLaMA 13B parameter model achieved an accuracy of only 33.63% with 8-shot, which is only slightly higher than the random guessing accuracy of 33.33% from true, false, and undetermined. This greatly limits the practical application of large language models in scenarios such as intelligent question answering and autonomous decision-making.

Logical Consistency

When reasoning about complex problems, large language models are prone to producing contradictory answers to different questions, or contradicting knowledge bases/logical rules, which we call violating logical consistency.

It should be noted that the form of logical consistency can be diverse. For example, the LLaMa-2 70B parameter model answers "true" to both "Is an albatross a creature?" and "Isn't an albatross a creature?" These two questions both answer "true", which violates the law of non-contradiction. Another example is the Macaw question answering large model, which answers "yes" to both "Is a magpie a bird?" and "Do birds have wings?" but answers "no" to "Do magpies have wings?" This does not conform to the syllogistic inference rule.

Many studies have shown that training only on large question answering datasets cannot ensure the logical consistency of large language models. These contradictory answers raise concerns about the reliability and trustworthiness of large language models, especially limiting their practical deployment in high-risk scenarios such as medical diagnosis, legal consultation, and industrial process control.

We can view logical question answering and logical consistency as two sides of the same coin for the logical reasoning ability of large language models. Next, we will summarize the latest research progress in these two aspects.

Methods to improve logical question answering ability

To better understand the boundaries of large language model logical reasoning capabilities and explore more effective technical methods, researchers have developed many relevant evaluation tasks and benchmark datasets to assess the performance of large models in logical question answering tasks. Based on this, many studies have explored methods to enhance the logical reasoning capabilities of large language models. These methods can be broadly divided into three categories: methods based on external solvers, prompting methods, and pre-training and fine-tuning methods. The specifics are introduced below.

1. Methods based on external solvers

The general idea is to translate the logical problem expressed in natural language (NL) into symbolic language (SL) expressions, then use an external solver for logical reasoning, and finally generate the final answer based on ensemble algorithms such as majority voting, as shown in Figure 2.

Figure 2: Methods based on external solvers to improve the logical question answering ability of large models

2. Prompt-based methods

One idea is to design reasonable prompts to make LLMs explicitly construct a logical reasoning chain when answering questions; another idea is to design prompts to achieve the conversion of NL and SL expressions, thereby increasing the logical reasoning ability of large models.

3. Pre-training and fine-tuning methods

Considering the lack of high-quality logical multi-step reasoning or proof samples in the pre-training corpus, pre-training and fine-tuning methods enhance the dataset by incorporating deductive proofs or natural language examples containing logical reasoning processes, and then pre-train or fine-tune the large model based on this dataset.

Methods to improve logical consistency

Developing reliable large language models and ensuring their safe deployment is becoming increasingly important, especially when they are used as knowledge sources. In terms of trustworthiness, logical consistency is crucial: large models with logical consistency can effectively avoid contradictions between answers to different questions, thereby reducing large model hallucinations and enhancing end-users' confidence in the reliability of large models in practice.

Logical consistency requires large models not to contradict their own answers, knowledge bases, or logical rules when reasoning about complex problems and answering different questions. Ensuring that large models can reason without contradicting themselves is also known as self-consistency. A large number of studies have shown that training only on large datasets cannot guarantee that their answers satisfy logical consistency.

We classify various types of logical consistency based on the logical relationships that should exist between one, two, and multiple propositions, and discuss different methods and evaluation benchmarks for enhancing the logical consistency of large models.

1. Negation Consistency

Negation consistency requires that the reasoning results for a single proposition cannot be contradictory, i.e., p and ¬p cannot both be true, and only one of them is true: p ∨ ¬p, which is equivalent to ¬(p ∧ ¬p).

2. Implication Consistency

Implication consistency is based on the logical rule p → q. This means that given the constraint p → q and the premise p, it can be concluded that "q is true." If the model outputs "q is false," then we say the answer violates implication consistency.

For example, given the physical fact "All iron is metal (iron → metal)", a large model should not simultaneously answer "This material is iron (p)" as "true" and "This material is metal (q)" as "false."

PKU, Tsinghua, UvA, CMU, etc. Jointly Release: Latest Survey on Logical Reasoning Abilities of Large Models

Share Short URL