Google Releases 76-Page AI Agent Whitepaper! Your "AI Avatar" Is Online

Recently, Google published a 76-page whitepaper on AI agents!

Agents perceive their environment and strategically take action using tools to achieve specific goals.

Their core principle is to integrate reasoning ability, logical thinking, and the ability to acquire external information to complete tasks that are difficult for basic models to achieve and make more complex decisions.

These agents possess autonomous capabilities; they can pursue goals, proactively plan subsequent actions, and act without explicit instructions.

圖片

Reference link: https://www.kaggle.com/whitepaper-agent-companion

The whitepaper delves into methods for evaluating agents and introduces the real-world applications of Google's agent products.

Anyone involved in generative AI development knows that progressing from an idea to a proof of concept is not difficult, but ensuring the final product is high quality and deploying it to production is much more challenging.

When deploying agents to production environments, quality and reliability are the biggest issues. Agent Operations (AgentOps) processes are an effective solution for optimizing the agent building process.

Agent Operations

Over the past two years, generative AI (GenAI) has undergone tremendous transformation, and enterprise customers are increasingly focused on how to truly apply solutions to actual business needs.

Agent and Operations (AgentOps) is a branch of generative AI operations, focusing on how to make agents run more efficiently.

AgentOps adds several key components, including the management of internal and external tools, the setup and orchestration of core agent prompts (like goals, profiles, action instructions), the implementation of memory functions, task decomposition, and more.

Development Operations (DevOps) is the cornerstone of the entire technical operations system.

Model application development inherits some of the concepts and methods of DevOps, while Machine Learning Operations (MLOps) is developed on the basis of DevOps, tailored to the characteristics of models.

图片

Operations are inseparable from version control, automated deployment through Continuous Integration / Continuous Delivery (CI/CD), testing, logging, security assurance, and metric measurement capabilities.

Each system is usually optimized based on metrics, measuring how the system is performing, evaluating results and business metrics, and then using automated processes to obtain more comprehensive metrics, improving system performance step by step.

Whether called "A/B testing," "Machine Learning Operations," or "metric-driven development," the essence is based on the same principles, and AgentOps will also follow these principles.

图片

It should be noted that new technical practices do not completely replace old ones.

Best practices from DevOps and MLOps are still indispensable for the smooth operation of AgentOps; they are the foundation upon which AgentOps runs.

For example, when agents call tools, APIs are involved, and the APIs used in this process are the same as those used by non-agent software.

Agent Success Metrics

Most agents are designed around achieving specific goals, and goal completion rate is a key metric.

A large goal can often be broken down into several key tasks or involve some key user interaction points. These key tasks and interactions should be monitored and evaluated separately.

Each business metric, goal, or key interaction data is typically summarized in common ways, such as calculating the number of attempts, successes, success rate, etc.

In addition, metrics obtained from application telemetry systems, such as latency and error rate, are also very important for agents.

Monitoring these high-level metrics is an important way to understand the running status of agents.

图片

User feedback is also a metric that cannot be ignored.

During the process of agent or task execution, a simple feedback form can help understand where the agent is performing well and where it needs improvement.

This feedback can come from regular users, or it can be from enterprise employees, quality inspectors, or experts in relevant fields.

Agent Evaluation

To turn an agent from the proof-of-concept stage into a product that can be truly put into production, a robust automated evaluation framework is essential.

Evaluating Agent Capabilities

Before evaluating specific agent application scenarios, one can first refer to some public benchmarks and technical reports.

There are public benchmarks for many basic capabilities, such as model performance, propensity for hallucination, tool calling, and planning abilities.

For example, benchmarks like Berkeley Function Calling Leaderboard (BFCL) and τ-bench can demonstrate the agent's tool calling capabilities.

The PlanBench benchmark focuses on evaluating planning and reasoning abilities in multiple domains.

Tool calling and planning are just part of an agent's capabilities. Agent behavior is influenced by the LLM and other components it uses.

The interaction methods between agents and users are also traceable in traditional dialogue design systems and workflow systems. The evaluation metrics and methods of these systems can be borrowed to measure agent performance.

Comprehensive agent benchmarks like AgentBench provide a thorough evaluation of agents in various scenarios, testing overall performance from input to output.

图片

Currently, many companies and organizations have established dedicated public benchmarks for specific application scenarios, such as Adyen's data analysis leaderboard DBAStep.

Most benchmark reports discuss common agent failure modes, which can provide ideas for establishing an evaluation framework suitable for the application scenario.

In addition to referencing public evaluations, it is also necessary to test agent behavior in various different scenarios.

One can simulate user interaction with the agent and observe its response. It is important to evaluate not only the final answer provided but also the process by which it arrived at the answer, which is the action trace.

Software engineers can link agent evaluation to automated code testing. In code testing, automated testing saves time and gives developers more confidence in software quality.

The same applies to agents regarding automated evaluation.

Carefully preparing evaluation datasets is very important; they must accurately reflect the situations the agent will encounter in real-world applications. This point is even more crucial than dataset preparation in software testing.

Evaluating Action Traces and Tool Usage

Before responding to a user, an agent typically performs a series of operations.

For example, it might compare the user input with conversation history to disambiguate a term; it might also look up policy documents, search knowledge bases, or call APIs to save tickets.

Each of these operations is a step along the path to achieving its goal and is also referred to as an action trace.

Every time an agent executes a task, there is such an action trace.

图片

For developers, comparing the agent's actual action trace with the expected action trace is very helpful in identifying issues.

By comparison, one can find errors or inefficiencies and improve the agent's performance.

However, not all metrics are applicable to every situation.

Some application scenarios require the agent to strictly follow the ideal action trace, while others allow for a certain degree of flexibility and deviation.

This evaluation method also has obvious limitations, namely, the need for a reference action trace as a basis for comparison.

Evaluating Final Responses

The core of final response evaluation is: did the agent achieve the set goal?

Custom success criteria can be set according to one's own needs to measure this.

For example, evaluating whether a retail chatbot can accurately answer product-related questions; or judging whether a research agent can effectively summarize research results in an appropriate tone and style.

To automate the evaluation process, automated evaluators can be used. An automated evaluator is essentially an LLM that plays the role of a judge.

Given an input prompt and the agent's generated response, the automated evaluator evaluates the response based on a set of criteria predefined by the user, thereby simulating the human evaluation process.

However, it is important to note that since this evaluation may not have an absolute factual basis as a reference, precisely defining the evaluation criteria is crucial.

Human-in-the-Loop Evaluation

Human-in-the-loop evaluation is of great value in tasks that require subjective judgment and creative problem-solving.

At the same time, it can also be used to calibrate and verify automated evaluation methods to see if they are truly effective and meet expectations.

Human-in-the-loop evaluation mainly has the following advantages:

Subjectivity: Humans can evaluate qualities that are difficult to quantify, such as creativity, common sense, and nuances, which are challenging for machines to grasp.

Contextual Understanding: Human evaluators can consider the context and impact of the agent's actions from a broader perspective, making more comprehensive judgments.

Iterative Improvement: Feedback from humans can provide valuable insights for optimizing the agent's behavior and learning process, helping the agent to continuously improve.

Evaluating the Evaluators: Human feedback can also serve as a reference for calibrating and optimizing automated evaluators, making the automated evaluators' assessments more accurate.

The evaluation of multimodal generation (such as images, audio, video) is even more complex and requires specialized evaluation methods and metrics.

Multi-Agent Systems and Their Evaluation

Today, AI systems are undergoing transformation towards a multi-agent architecture.

In this architecture, multiple agents with specialized capabilities collaborate to collectively achieve complex goals.

A multi-agent system is like a team of experts, each leveraging their expertise in their respective fields.

Each agent is an independent entity; they may use different LLMs, assume unique roles, and have different task contexts.

These agents communicate and collaborate with each other to achieve common goals.

This is significantly different from traditional single-agent systems, where all tasks are handled by a single LLM.

Understanding Multi-Agent Architecture

Multi-agent architecture decomposes a complex problem into different tasks, which are assigned to specialized agents for processing.

Each agent has a defined role, and they interact dynamically to optimize decision-making processes, improve knowledge retrieval efficiency, and ensure the smooth execution of tasks.

This architecture enables more structured reasoning, decentralized problem-solving models, and scalable task automation.

Multi-agent systems utilize modularity, collaboration, and hierarchical design principles to build a powerful AI ecosystem.

Agents can be divided into different types based on their functions, for example:

Planning Agents: Responsible for breaking down high-level goals into structured subtasks and developing detailed plans for subsequent work.

Retrieval Agents: Optimize the knowledge acquisition process by dynamically retrieving relevant data from external sources, providing information support to other agents.

Execution Agents: Undertake specific computational work, generate response content, or interact with APIs to perform various actual operations.

Evaluation Agents: Monitor and verify the responses generated by other agents to ensure they meet task objectives and are logically consistent and accurate.

Through the collaborative work of these components, multi-agent architecture is no longer limited to simple prompt-based interaction methods, achieving adaptive, explainable, and efficient AI-driven workflows.

Multi-Agent Evaluation

Multi-agent system evaluation is developed on the basis of single-agent system evaluation.

The success metrics for agents have not changed in essence; business metrics remain the core focus, including goal and key task completion status, as well as application telemetry metrics such as latency and error rate.

Tracking the operational process of multi-agent systems is helpful in identifying issues and debugging the system during complex interactions.

Evaluating action traces and evaluating final responses are two methods that are also applicable to multi-agent systems.

In a multi-agent system, a complete action trace may involve the participation of multiple or even all agents.

图片

Even when multiple agents collaborate to complete a task, the final answer presented to the user is a single one, and this answer can be evaluated separately.

Since the task flow of multi-agent systems is usually more complex and involves more steps, it is possible to delve into each step for detailed evaluation. Action trace evaluation is a feasible and scalable evaluation method.

Agent-Augmented Retrieval Generation

In Agentic Retrieval Augmented Generation (Agentic RAG), agents retrieve necessary information through multiple searches.

In the healthcare sector, agent-augmented retrieval generation can help doctors browse complex medical databases, research papers, and patient records, providing them with comprehensive and accurate information.

图片

Vertex AI Search is a fully managed provider of Google-quality search and Retrieval Augmented Generation (RAG) services. It covers processes such as data collection, processing, embedding, indexing/ranking, generation, validation, and serving.

图片

Vertex AI Search has components like a layout parser and vector ranking API, and also provides a RAG engine that can be orchestrated via the Python SDK, supporting numerous other components.

For developers who wish to build their own search engine, each of the above components is available as a separate API, and the RAG engine can easily orchestrate the entire process using Python interfaces similar to LlamaIndex.

Agents in the Enterprise

Enterprises develop and use agents to assist employees in specific tasks or run automatically in the background.

Business analysts can easily uncover industry trends and create highly persuasive data-driven presentations with the help of AI-generated insights; HR teams can use agents to optimize employee onboarding processes.

Software engineers can rely on agents to proactively detect and fix vulnerabilities, iterate development more efficiently, and accelerate deployment processes.

Marketers can leverage agents to deeply analyze marketing effectiveness, optimize content recommendations, and flexibly adjust marketing campaigns to improve performance.

Currently, two types of agents are emerging:

Assistant Agents: These agents interact with users, receive tasks and execute them, and then provide results back to the user.

Assistant agents can be general-purpose or specialized for specific domains or tasks.

For example, agents that help schedule meetings, analyze data, write code, draft marketing copy, assist salespeople in capturing sales opportunities, and even agents that conduct in-depth research on specific topics based on user requests.

Their response methods differ; some can quickly return information or complete tasks synchronously, while others require longer running times (e.g., deep research agents).

Automation Agents: These agents run in the background, listening for events, monitoring changes in systems or data, and then making rational decisions and taking action.

These actions include operating backend systems, performing test validation, solving problems, notifying relevant employees, etc.

Today, knowledge workers are no longer just simply calling agents to perform tasks and waiting for results; they are gradually transitioning into agent managers.

For ease of management, new user interfaces will emerge in the future to facilitate the orchestration, monitoring, and management of multi-agent systems. These agents can both execute tasks and call or even create other agents.

NotebookLM Enterprise

NotebookLM is a research and learning tool designed to simplify the understanding and integration of complex information.

Users can upload various source materials, such as documents, notes, and other relevant files. NotebookLM, with the help of AI technology, helps users understand this content more deeply.

Imagine, when researching complex topics, NotebookLM can integrate scattered information into an organized workspace.

Essentially, NotebookLM is like a dedicated research assistant, accelerating the research process and helping users move from simple information collection to deep understanding.

NotebookLM Enterprise brings these capabilities into the enterprise environment, simplifying how employees interact with data and helping them extract valuable insights.

图片

For example, the AI-generated audio summary feature allows users to improve understanding efficiency and promote knowledge absorption by "listening" to research content.

NotebookLM Enterprise incorporates enterprise-level security and privacy features to strictly protect sensitive company data and comply with relevant policies.

Agentspace Enterprise

Google Agentspace provides a suite of AI-driven tools designed to enhance enterprise productivity by making information accessible to employees and automating complex agent workflows.

Agentspace effectively addresses the inherent limitations of traditional knowledge management systems by integrating fragmented content sources, generating grounded and personalized responses, simplifying business processes, and helping employees efficiently access information.

The architecture of Agentspace Enterprise is built upon several core principles.

Security is always a top priority for Google Agentspace.

Employees can use it to get answers to complex questions and also access various information sources uniformly, whether unstructured data like documents and emails or structured data like tables.

Enterprises can configure a series of agents according to their needs for in-depth research, creative generation and optimization, data analysis, and other tasks.

图片

Agentspace Enterprise also supports the creation of customized AI agents to meet specific business needs.

The platform can develop and deploy context-aware agents to help employees in various departments like marketing, finance, legal, and engineering conduct research efficiently, generate content quickly, and automate repetitive tasks (including multi-step workflows).

Custom agents can connect to internal and external systems and data, align with company business domains and policy requirements, and can even train models based on proprietary business data.

Multi-Agent Architecture Real-World Applications

To illustrate the application of the multi-agent concept in practice, let's look at a comprehensive multi-agent system designed specifically for automobiles.

图片

In this system, multiple specialized agents work collaboratively to provide users with a convenient and smooth in-car experience.

Conversational Navigation Agent: Specifically designed to help users find locations, recommend places, and navigate using APIs like Google Places and Maps.

Conversational Media Search Agent: Focused on helping users find and play music, audiobooks, and podcasts.

Messaging Agent: Helps users draft, summarize, and send messages or emails while driving.

Car Manual Agent: Specifically answers questions related to the car using a Retrieval Augmented Generation (RAG) system.

General Knowledge Agent: Answers factual questions about the world, history, science, culture, and other general topics.

Multi-agent systems decompose complex tasks into multiple specialized subtasks.

In this architecture, each agent specializes in a particular domain. This specialization makes the entire system more efficient.

The navigation agent focuses on location and route planning; the media search agent is proficient in finding music and podcast resources; the car manual agent is skilled at solving vehicle-related issues.

The system allocates resources based on task difficulty, using low-configuration resources for simple tasks and calling high-performance resources for complex tasks.

图片

Key functions (like adjusting temperature, opening windows, etc.) are quickly responded to by edge agents, while less urgent tasks like restaurant recommendations are handled by cloud agents.

This design also inherently provides fault tolerance. When the network connection is interrupted, edge agents can still ensure basic functions operate normally, such as temperature control and basic media playback remaining unaffected, although restaurant recommendations may be temporarily unavailable.

References:

https://x.com/aaditsh/status/1919383594533072974

https://www.kaggle.com/whitepaper-agent-companion

Main Tag:AI Agents

Sub Tags:AgentOpsEnterprise AIRetrieval Augmented GenerationMulti-Agent Systems


Previous:Predictive Processing and the Epistemological Hypothesis: Solving the Hard Problem of Consciousness by Simulating a Brain Facing It

Next:Stripe Conference | Zuckerberg: AI "Input Goal, Get Transaction," Trillion-Dollar Ad Market Begins Reshuffle

Share Short URL