Synced Report
Editor: Chen Ping
This research delves into common failure modes of LLMs: greediness, frequency bias, and the knowing-doing gap.
The success of Large Language Models (LLMs) has sparked interest in various agents. A key assumption in using LLMs for agents is that LLMs use common sense and Chain-of-Thought (CoT) for reasoning, allowing agents to effectively explore and efficiently solve problems in complex domains.
However, LLM agents suffer from suboptimal exploration and the knowing-doing gap, which is the inability to effectively translate knowledge within the model into action.
In this paper, researchers from Google DeepMind systematically investigate why LLMs perform suboptimally in decision-making scenarios. In particular, this paper delves into three common failure modes: greediness, frequency bias, and the knowing-doing gap.
Building on this, this paper proposes fine-tuning automatically generated CoT reasoning processes through Reinforcement Learning (RL) to alleviate these shortcomings. Experiments show that RL fine-tuning can effectively improve the decision-making capabilities of LLMs – enhancing agents' exploratory behavior and narrowing the knowing-doing gap.
Paper Title: LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Paper Link: https://www.alphaxiv.org/abs/2504.16078
Method Introduction
This paper systematically analyzes three typical deficiencies in small and medium-sized LLMs: greedy strategies, frequency bias, and the knowing-doing gap. Analysis shows that because LLMs prematurely fall into greedy action selection strategies, action coverage stagnates (up to 55% unexplored), and final performance consistently falls below the optimal level.
Specifically, this paper finds that smaller LLMs (2B) tend to mechanically replicate high-frequency actions from the context (regardless of their reward differences), a phenomenon defined as frequency bias.
In contrast, larger LLMs (27B), although significantly reducing frequency bias, still maintain greedy behavior.
It is also noteworthy that this paper quantifies the knowing-doing gap and finds that while LLMs can correctly understand task requirements, they are unable to effectively execute known solutions due to their adherence to greedy actions.
To overcome these deficiencies, this paper proposes a Reinforcement Learning Fine-Tuning (RLFT) method based on automatically generated Chain-of-Thought (CoT) reasoning.
The RLFT method relies on rewards obtained from environmental interaction to fine-tune the self-generated CoT principles. During the RLFT process, the model learns to iteratively optimize its reasoning process, thereby favoring CoT patterns and actions that yield higher rewards (see Figure 1). This method is more focused on decision-making scenarios.
Context Representation: At step t, the input tokens include the input instruction, the output instruction, and the recent interaction history. The history representation contains the trajectory of the recent C states, actions, and rewards.
Fine-tuning Objective: This paper uses the clipped objective introduced by Schulman et al. for fine-tuning and applies an additional KL constraint to the reference policy:
Experimental Results
Comparison Models: Experiments compared three sizes of the Gemma2 model: 2B, 9B, and 27B.
Environment: Multi-Armed Bandit (MAB) and Tic-Tac-Toe games.
Why Do LLMs Perform Poorly in Decision Making?
Previous studies have found that LLM agents perform poorly in interactive environments and lack sufficient exploration. Therefore, this paper first investigates the reasons for poor model performance and identifies three common failure modes: (1) greediness, (2) frequency bias, and (3) the knowing-doing gap. These three failure modes are found to persist across various model sizes.
Greediness is the first and most prevalent failure mode, characterized by LLMs excessively favoring the best-performing action observed so far among a small subset of actions. To illustrate this failure mode, this paper shows the average action coverage achieved by Gemma2 2B/9B/27B with and without CoT enabled, across 64 MABs (with 10 and 20 arms) over 50 interaction steps (see Figure 3 a and b).
Results show that models adopt a greedy strategy prematurely, causing action coverage to stagnate after 10 steps. Increasing the number of arms makes greediness more pronounced, with the largest model covering only 45% of all actions. Thus, although these models show significant improvement over random agents (see Figure 3c), the regret remains high compared to UCB (Upper-confidence Bound).
The next common failure mode explored in this paper is frequency bias, characterized by the model repeatedly selecting the most frequent action in the context, even if that action yields low rewards.
Results show that Gemma2 2B is severely affected by repeated actions, with entropy continuously decreasing as the number of repetitions increases (see Figure 4a). Conversely, the 27B model overcomes frequency bias (see Figure 4c). In fact, for 2B, frequency bias continuously increases with the number of repetitions. While 27B overcomes frequency bias, it is severely affected by greediness.
Knowing-doing gap. The agent clearly knows how to solve the task, with 87% of all reasoning being correct (see Figure 5). However, even for correctly computed reasoning, the model often chooses the greedy action (58%) instead of the optimal action (21%). This discrepancy highlights the LLM's defect in acting inconsistently despite knowing the algorithm.
Effectiveness of RL Fine-tuning
Next, this paper investigates the impact of RLFT on cumulative regret (relative to the optimal strategy) and whether it can alleviate these failure modes.
Results show that RLFT reduces regret. In various environments, LLMs perform significantly better than random baselines, and RLFT reduces regret for 2B and 9B.
Furthermore, RLFT can alleviate greediness; through RLFT, agents learn to explore, thus alleviating greediness.
© THE END
Please contact this official account for authorization to reprint
Submissions or press inquiries: liyazhou@jiqizhixin.com