Microsoft Open-Sources Browser Agent for Real-time Tracking and Control, Exceeding 4000 Stars

Microsoft has open-sourced Magentic-UI, an agent specifically designed for browser-based web tasks, on its official website.

Magentic-UI is developed based on Microsoft's previously open-sourced Magentic-One, and it supports human-in-the-loop control methods to enhance the agent's execution efficiency and accuracy.

According to GAIA test data, when equipped with a simulated user providing auxiliary information, Magentic-UI's task completion rate increased from 30.3% in autonomous mode to 51.9%, improving accuracy by 71%. Furthermore, Magentic-UI sought help from the simulated user in only 10% of task executions, requiring assistance an average of only 1.1 times per task.

Open-source address: https://github.com/microsoft/magentic-ui

Magentic-UI: Human-Centered Design

One of Magentic-UI's biggest highlights is its human-centered approach. Unlike traditional agents that solely pursue full automation, Magentic-UI deeply integrates humans into all stages of task execution.

Traditional agents often aim for autonomous task completion, emphasizing machine independence and automation. Users may not fully understand the agent's specific operations and decision-making basis, making it difficult to intervene and correct issues promptly.

Magentic-UI, on the other hand, adopts a human-computer collaboration model, fully considering the role and value of humans in task execution. It completes tasks by closely collaborating with users, allowing users to control the agent's behavior in real-time and make adjustments and provide guidance as needed.

In the planning phase, Magentic-UI engages in collaborative planning with users. Instead of directly following preset programs or algorithms to create a task plan, it first communicates with the user to understand their needs and expectations. It then generates a preliminary step-by-step plan and allows users to directly modify this plan through a plan editor or by providing text feedback.

Users can add, delete, or reorder steps in the plan, or even rewrite certain steps, based on their experience and understanding of the task, to ensure the plan better meets actual needs. This collaborative planning approach enables users to incorporate their professional knowledge and experience into the task plan, thereby improving the quality and efficiency of task completion.

During task execution, Magentic-UI also emphasizes collaborative execution with users. It provides real-time updates to the user on the specific actions it is about to take, such as which button to click, what content to input, or which webpage to visit. Simultaneously, it feeds back observed webpage information to the user in real-time.

Users can pause the Agent's operations at any time, provide feedback to the Agent through natural language, point out problems, suggest improvements, or make corrections. They can even take direct control of the browser to complete certain steps themselves, then hand control back to the Agent. This collaborative execution method allows users to timely identify and resolve issues that the Agent might encounter during execution, preventing task failures or undesirable consequences due to incorrect agent operations.

Magentic-UI also features a unique "action protection" mechanism, which seeks user permission before performing potentially irreversible actions. These actions may include closing tabs, clicking buttons with side effects, or submitting forms.

Users can decide whether to allow the Agent to perform these operations based on their judgment, thereby avoiding risks associated with the Agent's blind actions. Magentic-UI also employs sandbox technology, running the browser and code executors in isolated environments, further ensuring operational security and preventing potential security threats from the Agent.

Brief Introduction to the Magentic-UI Framework

When a user submits an automation task request to Magentic-UI, the system first receives the user's input, which can be simple text commands or complex requests with attached images. The Magentic-UI's core component, the orchestrator, utilizes its underlying Large Language Model (LLM) capabilities based on user input to generate a preliminary step-by-step plan. This plan details all the steps required to complete the task, including webpages to visit, operations to perform, and other tools that might need to be invoked.

After generating the preliminary plan, Magentic-UI does not immediately begin execution. Instead, it enters a crucial collaborative planning phase. During this stage, users can directly modify the plan generated by Magentic-UI through an intuitive plan editing interface. Users can add, delete, or reorder steps in the plan, or even completely rewrite certain steps.

Magentic-UI provides real-time feedback on user modification suggestions and adjusts the plan based on user input. This process ensures that users can integrate their expertise and expectations into the task plan, thereby improving the accuracy and efficiency of task completion.

The plan, once confirmed or modified by the user, is sent to the execution stage. Magentic-UI's execution process is highly transparent and collaborative. The system provides real-time updates to the user on the specific actions it is about to take, such as clicking a button, entering search terms, or visiting a specific webpage.

Concurrently, Magentic-UI also feeds back observed webpage information to the user in real-time. Users can pause Magentic-UI's operations at any time and provide feedback through natural language, pointing out problems or suggesting improvements. If users believe a step requires manual intervention, they can even take direct control of the browser, complete the specific step, and then return control to Magentic-UI.

Another important feature of Magentic-UI is its self-learning capability for plans. After completing a task, it can learn from user feedback and the task execution process, saving step-by-step plans to form a plan library.

For future tasks, when a user inputs a similar task, Magentic-UI can quickly retrieve and invoke the corresponding plan, significantly improving task execution efficiency. Furthermore, users can view and modify saved plans at any time, adjusting and optimizing them as needed to better handle different task scenarios.

Currently, Magentic-UI has over 4000 stars on Github and supports commercial use under the MIT license.

The content for this article is sourced from Microsoft. Please contact us for removal if there is any infringement.

END

Click on the image to register now 👇️

Microsoft Open-Sources Browser Agent for Real-time Tracking and Control, Exceeding 4000 Stars

Share Short URL