MCP Tool Stacking is a Trap! Developer Guru: Command Line's 'Brittleness' Crushes AI! Better to Axe It Down to a Single Code Executor: 7 Calls Become 1! Netizens: Should've Abandoned Black Box Tools Long Ago!

Editor | Yifeng

Are you using MCP incorrectly?

MCP is often seen as a "USB interface" for large models. Many developers' first reaction is to stack more specialized tools (grep, sed, tmux...) into it, as if this would make AI more powerful.

However, a trending post on Hacker News put forward a completely opposite conclusion:

👉 The more tools, the messier. The optimal solution for MCP is—to keep only one code executor.

Developers know that command-line tools are actually very "brittle."

• Poor cross-platform/version compatibility

• Line breaks, special characters cause errors frequently

• Session chaos, processes run out of control

The author keenly realized that these are not minor bugs, but fundamental structural problems.

So the question arises: What exactly is wrong with command-line tools? Why is the answer not more small tools, but a "super tool"—an interpreter that can directly run Python/JS?

Why do MCP calls to command-line tools always crash?

The author states that calling command-line tools is most frustrating because:

Once AI makes a mistake, it either has to start from scratch or switch to another tool, just because a small detail wasn't handled correctly.

Behind this are two obvious flaws:

First, poor platform and version compatibility.

Command-line tools often depend on specific environments and sometimes even lack documentation. The result is—almost every first call hits a snag.

A more typical example is handling non-ASCII characters: Claude Sonnet and Opus sometimes don't know how to pass line breaks or control characters in the shell.

This situation is not uncommon; in C language compilation, a newline character often needs to be retained at the end, and AI tools tend to get stuck here, leading to a host of "amazing" tool loops to resolve it.

Second, too long call chains and difficult state management.

Some agents (especially Claude Code) add an extra "safety pre-check" before executing a shell call. Claude first uses the smaller Haiku model to determine if the call is dangerous before deciding whether to execute it.

Even more challenging are multi-round calls. For example, asking it to use tmux to remotely control LLDB, theoretically it should work, but it often "forgets": it changes the session name midway, forgets it has an active session, and thus cannot complete the task normally.

In summary, once command-line tools enter a multi-round calling scenario, stability becomes their biggest weakness.

And this, in turn, obscures the original advantages of CLI tools.

Command line's strength is "composability," which MCP is weakening

Command-line tools are not essentially single tools, but a whole set of tools that can be combined through a programming language (bash).

In bash, you can chain grep, awk, sed, tmux, and other small tools together, where the output of one tool directly becomes the input of the next, solving complex problems with a single command line.

This is the "composability" of the command line.

However, once switched to MCP, this combination without extra reasoning disappears (at least with today's implementations).

Why?

Because MCP's calling model treats tools as black boxes: it calls one tool at a time, gets the result, and then enters the next round of reasoning.

This means that if AI wants to replicate the flexible composition of bash, it must re-reason and call tools step-by-step, a process that is both slow and prone to errors.

A classic example is using tmux to remotely control lldb. In CLI, AI would chain it like this:

• It first uses tmux send-keys to input commands

• Then uses tmux capture-pane to capture output

• It might even insert sleep to wait, then continue capturing, to avoid reading results too early

When it encounters complex character encoding issues, it will also switch to another method, such as converting to base64 and then decoding.

Under MCP, this process would be broken into many rounds. Each step requires re-reasoning the state (e.g., session name, breakpoint location, last output fragment), and if any link in the chain breaks, the whole process restarts.

The author also emphasized another CLI strength: allowing AI to first write small scripts, then reuse them, then assemble them, ultimately forming a stable set of automation scripts.

In MCP's black box calls, this "scripting + reuse" self-growing path is currently difficult to emerge naturally.

A better MCP approach

The author's radical solution: forget dozens of tools, MCP only needs one "super tool."

This super tool is a Python/JS interpreter, stateful, and capable of executing code.

Shell tools have limits; you will eventually find yourself "fighting" with tools, especially when an agent needs to maintain complex sessions.

MCP is inherently stateful. A more practical idea is to: expose only one "super tool"—a stateful Python interpreter. It executes code via eval() and maintains context, allowing the agent to operate in a familiar way.

The author's experiment is pexpect-mcp. Ostensibly called pexpect_tool, it is essentially a persistent Python interpreter environment running on the MCP server, pre-installed with the pexpect library. pexpect is a Python port of the classic expect tool, capable of scripting interactions with the command line.

Thus, the MCP server becomes a stateful Python interpreter, and its exposed tool interface is very simple and direct: execute the incoming Python code snippets and inherit the cumulative context state from all previous calls.

The tool interface description is roughly as follows:

Executes Python code in a pexpect session, capable of launching processes and interacting with them.

Parameters:

code: Python code to execute. Use the variable 'child' to interact with the process.

pexpect is already imported, can directly use pexpect.spawn(...) to launch.

timeout: Optional, timeout in seconds, default 30 seconds.

Example:

child = pexpect.spawn('lldb ./mytool')

child.expect("(lldb)")

Returns:

Code execution result or error message

Under this model, MCP's role is no longer a "toolset" but a code executor, bringing several direct benefits:

• MCP handles session management and interaction

• The code written by the agent is almost identical to the script itself

• After the session ends, it can be easily organized into reusable debugging scripts

Practical Validation: A Leap in Efficiency and Reusability

To validate the effect of pexpect-mcp, the author used it to debug a known crashing C program (demo-buggy).

The process is as follows:

1. First debug (simulating traditional MCP mode): AI interacts with LLDB via pexpect_tool to pinpoint the crash cause (unallocated memory, array out-of-bounds). This took about 45 seconds and involved 7 tool calls.

2. Scripting: AI automatically exports the entire debugging process as an independent, readable Python script (debug_demo.py).

3. Reuse validation: In a new session, only 1 tool call was used to execute uv run debug_demo.py. The script reproduced the crash analysis within 5 seconds, accurately pinpointing the root cause.

The author states that the most crucial point is: this script is independent; I, as a human, can run it directly, even without relying on MCP!

The success case of pexpect-mcp reveals a more universal MCP design direction: instead of exposing a pile of fragmented and error-prone black box tools, it's better to use the programming language itself as the interaction interface.

Innovation: Hand-rolling a small MCP

A common problem with MCP is: the more tools, the more likely context is to rot, and input limits are significant.

But if MCP exposes not a pile of tools, but a programming language, then it indirectly opens up all the capabilities the model learned during training.

When you need to build something completely new, at least the programming language is familiar to AI. You can completely hand-roll a small MCP that allows it to:

• Export the application's internal state

• Provide database query assistance (even for sharded architectures)

• Provide data reading API

In the past, AI could only understand these interfaces by reading code; now, it can also directly call and further explore them through a stateful Python/JavaScript session.

Even better: this also gives the agent an opportunity to debug MCP itself. Thanks to the flexibility of Python and JavaScript, it can even help you troubleshoot MCP's internal state.

Netizen Debate: How should AI operate code?

The discussion in this blog post has actually touched upon the underlying philosophy of AI programming.

How exactly should AI operate code:

Should it remain at the text level (strings), or understand and manipulate it through more structured interfaces?

We know that the brittleness of CLI tools (newline errors, chaotic session management) is essentially a limitation based on string operations.

So the question arises: if it's better for AI to write "real code," should it go further and understand AST? Note: AST (Abstract Syntax Tree): a representation of code as a tree structure. Each node represents a variable, function, or statement. For compilers and IDEs, AST is a more precise structured interface than plain text.

Some netizens believe:

Editors should leverage language server and other structured capabilities more, rather than letting agents circle around old tools like grep, sed, and awk. And for most languages, operations shouldn't be on strings, but on token streams and ASTs.

Another faction points out:

Reality dictates that AI is still better suited to operating on code itself: agreeing that current tool usage is inefficient, but AI primarily operates on code rather than syntax trees for several reasons:

1. Training sets contain far more code than syntax trees.

2. Code is almost always a more concise representation.

In the past, there were attempts to train AST edge information using graph neural networks or transformers, but surpassing mainstream LLMs would likely require significant breakthroughs (and huge capital). Experiments show that having agents use ast-grep (a syntax-aware search-and-replace tool) works well, essentially treating everything as code but replacing it in a syntax-aware way.

Others emphasize the universality of strings:

Strings are dependency-free, universal interfaces. You can accomplish almost anything across any language, across any file. Other abstractions severely limit what you can do. Furthermore, Large Language Models (LLMs) are not trained on ASTs, but on strings—just like programmers.

This reveals a problem:

LLMs learn the way "humans write code," not the machine's optimal structured way.

If someone in the future truly uses ASTs to train models on a large scale, it would require enormous computational power and funding, and might also sacrifice general world knowledge.

But perhaps in the future, a more efficient, machine-centric paradigm will emerge.

Do you think this approach will revolutionize our AI IDE programming experience today? Feel free to discuss in the comments section.

MCP Tool Stacking is a Trap! Developer Guru: Command Line's 'Brittleness' Crushes AI! Better to Axe It Down to a Single Code Executor: 7 Calls Become 1! Netizens: Should've Abandoned Black Box Tools Long Ago!

Share Short URL