Karpathy Forms a Large Model 'Parliament': GPT-5.1, Gemini 3 Pro, and Others Become the Strongest Think Tank

Image

From short videos to AI models, people's content consumption habits are shifting toward efficiency once again.

When reading long articles, papers, or vast amounts of information, more and more people are no longer patiently browsing from start to finish, but instead prefer to directly obtain high-density, quickly absorbable knowledge. Having a large model provide a summary directly—for example, a comment like "@Yuanbao, summarize it"—has become a common practice.

This isn't necessarily a bad thing. It precisely shows that in the AI era, efficiently acquiring information is itself a leap in human capability.

Even big names in the AI field are no exception. Former OpenAI co-founder and Tesla AI Director Andrej Karpathy is the same. A few days ago, he tweeted that he has "started the habit of using LLM to read everything."

Image

This is very similar to most people's reading habits. By combining personal insights from reading with the large model's information summaries, we can form a series of more complete cognitions.

Of course, with so many large language models, their capabilities vary greatly when handling different types of content for information retrieval and viewpoint organization. To get higher-quality results, Karpathy decisively decided to put the latest and strongest four large models to work together.

So, on Saturday, Karpathy vibe-coded a new project, forming four of the latest large models into an LLM parliament to serve as his think tank.

He believes: Instead of asking questions to just one favorite LLM service provider, why not assemble them all into your own "LLM parliament".

Image

This LLM parliament is a web application with an interface that looks exactly like ChatGPT, but every user query actually goes through the following process:

1) The question is distributed to multiple models in the parliament (via OpenRouter), for example, currently:

• openai/gpt-5.1

• google/gemini-3-pro-preview

• anthropic/claude-sonnet-4.5

• x-ai/grok-4

2) Then all models can see each other's anonymized responses and review and rank them;

3) Finally, a "Chairman model (Chairman LLM)" takes this content as context and generates the final response.

This looks very familiar, almost telepathic with the 'large model committee' made by famous gaming blogger PewDiePie using vibe coding.

Specifically, he used 8 instances of the same model (gpt-oss-20b) configured with different prompts (thus different personalities) to form a committee. When PewDiePie asks a question, each model gives an answer, then they vote on the answers to select the best one.

Karpathy's project, however, uses different large models, making it more diverse.

Viewing multiple models' responses side-by-side to the same question is very interesting. Especially with the mutual evaluation and voting mechanism between multiple large models, it's like a whole new "cyber cricket fight".

Often, these models are willing to admit that another model's response is better than their own, making this process a very interesting model evaluation method.

For example, when Karpathy reads with the "LLM parliament," they consistently praise GPT 5.1 as the best-performing with the richest insights, always ranking Claude last, with others fluctuating in between. But Karpathy doesn't fully agree with this ranking—for instance, subjectively, GPT 5.1 feels a bit verbose and overly expansive to him, while Gemini 3 is more concise and handles it better. Claude seems too succinct in this area.

Who doesn't love watching debates between large models?

Specifically, the entire project has three stages:

Stage 1: Initial Opinions

The user's question is sent individually to all models in the parliament, and their responses are collected. All responses are displayed in a "tab view" for users to check one by one.

Stage 2: Peer Review

Each LLM sees the other models' responses. The backend anonymizes model identities to avoid "self-bias" or preference for specific models. Each LLM is asked to rank the other responses based on accuracy and insightfulness.

Stage 3: Final Response

The designated "parliament chairman" LLM receives all responses and rankings, organizes this information into a final output, and presents it to the user.

Some netizens think this format could ultimately become a benchmark test:

Image

That said, the LLM parliament's data flow design may still have a vast unexplored design space. Ways to build multi-model ensembles may be far from fully researched.

If you're interested in this project, Karpathy has open-sourced it.

• Project address: https://github.com/karpathy/llm-council

But a reminder: Karpathy won't provide any support for this project; it's provided as-is, a small tool for inspiring others, and he doesn't plan to improve it further.

In our previous tests, we also vibe-coded a similar project replicating something comparable, slightly similar to Karpathy's LLM parliament, using two different model deployments.

Maybe we can also open-source this small project for everyone to play with?

Reference links:

https://x.com/karpathy/status/1992381094667411768

https://github.com/karpathy/llm-council

Main Tag:LLM Council

Sub Tags:Andrej KarpathyAI ToolMulti-Model EnsembleLarge Language Models


Previous:Can Large Models Handle Precision Work Too?! MIT Top Conference Paper Teaches AI to Operate Industrial CAD Software

Next:Detour to AGI: Shanghai AILab's Bombshell Finding - Self-Evolving Agents May 'Misevolve'

Share Short URL