Google Enters the CUA Battleground, Launches Gemini 2.5 Computer Use: Allowing AI to Directly Operate the Browser

Synced Review Report

Editor: Panda

Google’s Computer Use model is here!

Early this morning, Google DeepMind announced the launch of Gemini 2.5 Computer Use, a computer usage model based on Gemini 2.5.

Given that Google just released Chrome DevTools (MCP) recently, the emergence of Gemini 2.5 Computer Use is not entirely surprising. Simply put, similar to OpenAI’s Computer-Using Agent (CUA), this DeepMind model allows AI to directly control the user’s browser. Building upon its capabilities in visual understanding and reasoning, the model can help users perform actions like clicking, scrolling, and typing within the browser.

Image

Let’s look at two official demonstrations.

Prompt: From https://tinyurl.com/pet-care-signup , get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment.

Prompt: My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not.

As seen in the demos, whether collecting information online and executing actions, or organizing messy notes, Gemini 2.5 Computer Use accurately completed the tasks, and the speed was quite fast.

On relevant benchmarks, the performance of Gemini 2.5 Computer Use also achieved SOTA (State-of-the-Art) level:

Image

Furthermore, its speed performance surpasses the other compared models:

Image

Currently, developers can access these capabilities through the Gemini API in Google AI Studio and Vertex AI. Users can also try it in the Browserbase-hosted demo environment (limited to a 5-minute workflow and does not support user intervention mid-process): https://gemini.browserbase.com/

We conducted several attempts using the demo environment. Overall, Gemini 2.5 Computer Use showed high accuracy in simple tasks, but slightly more complex tasks often resulted in failure.

For example, when executing a simple task like “Find the John Wick page on Wikipedia,” the model performed very successfully.

However, when tasks became slightly more complex, the model failed, such as “Find the John Wick page on Wikipedia, summarize its information, and provide a Chinese version.” Furthermore, the task of asking it to “Open the official Nobel Prize website and provide the schedule for this year’s Nobel announcements,” and the following task, all failed to complete successfully.

Prompt: Browse jiqizhixin.com, find reports about Gemini from the last six months, organize them into a Markdown file, and summarize them.

In addition, DeepMind has also released the Gemini 2.5 Computer Use System Card: https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Computer-Use-Model-Card.pdf

Image

How Gemini 2.5 Computer Use Works

The core capability of this model is implemented through the new computer_use tool in the Gemini API. Developers must run it within a loop process.

The inputs should include:

  • The user request;
  • A screenshot of the current environment;
  • The history of recently executed actions.

Additionally, inputs can specify whether to exclude specific functions from the default supported UI actions and/or add custom functions.

Gemini 2.5 Computer Use Model Workflow

After analyzing these inputs, the model generates a response, typically a function call representing a UI action (such as clicking or typing). For certain operations (like purchasing), the model will also request user confirmation. The client then executes these actions.

Once the action is complete, the system returns the latest screenshot and the current URL as the function response to the model, restarting the loop.

This iterative process continues until the task is complete, an error occurs, or it is terminated due to safety mechanisms or a user decision.

Google states that the current Gemini 2.5 Computer Use model is primarily optimized for web browsers but also shows strong potential for control over mobile UI. However, it is not yet optimized for desktop operating system-level control.

Safety Mechanism Design

Google also shared their safety mechanism design for the model in their blog post.

Google stated: “Building agents responsibly is the only way for AI to benefit everyone. AI agents capable of directly operating a computer introduce unique risks, including user misuse, unexpected model behavior, and prompt injection and scams in web environments. Therefore, we highly prioritize safety precautions in the design.”

In the Gemini 2.5 Computer Use model, Google integrated safety mechanisms directly into the training phase to address three main risk categories (detailed in the System Card).

Furthermore, Google provides developers with safety control options to prevent the model from automatically performing potentially high-risk or harmful operations, such as:

  • Damaging system integrity;
  • Jeopardizing safety;
  • Bypassing captchas;
  • Controlling medical devices.

The controls implemented by Google include:

  • Per-step Safety Service: During inference, an independent safety service evaluates every action proposed by the model.
  • System Instructions: Developers can set rules requiring the agent to decline or request user confirmation before executing specific high-risk actions.

Conclusion

Google DeepMind's high-profile entry with Gemini 2.5 Computer Use not only demonstrates leading performance across multiple benchmarks but also officially escalates the competition in the AI agent domain to a fierce stage.

From OpenAI to Anthropic, and now Google, tech giants are vying to define the future of how we interact with computers. Although current models still appear nascent when facing complex real-world tasks, this is precisely the reality leading up to the technological dawn. What we see today is not just a new model, but a clear signal: the dominance of the keyboard and mouse is being challenged, and an era driven by natural language directly controlling the digital world is accelerating towards us.

References

https://blog.google/technology/google-deepmind/gemini-computer-use-model/

https://x.com/GoogleAIStudio/status/1975648565222691279

https://x.com/GoogleDeepMind/status/1975648789911224793

Main Tag:AI Agents

Sub Tags:Google DeepMindAI SafetyLarge Language ModelsBrowser Automation


Previous:LLMs in Document Intelligence: Survey, Progress, and Future Trends

Next:Just Released! Tsinghua and Partners Open Source UltraRAG 2.0! Performance Soars by 12%

Share Short URL