Understanding RAG, Agent, and Multimodality: Industry Practices and Future Trends

ImageImage

👉Contents

1 RAG: The Tentacles of Large Models

2 Agent: The Integrator of Large Models

3 Applications of Multimodal Technology

4 Future Development Trends of Large Models

Large models serve as the core engine of industrial transformation. Through RAG, Agent, and multimodal technologies, they are redefining the boundaries of AI interaction with reality. The synergistic evolution of these three not only overcomes core challenges such as data timeliness and domain-specific adaptation but also propels industries from efficiency revolution towards business restructuring. This article will analyze the evolution of these technologies, practical experiences, and future prospects, providing readers with a global perspective on cutting-edge trends and practical guidance for industrial upgrading.

Follow Tencent Cloud Developers to unlock cutting-edge technical insights first 👇

Large model technology is accelerating its penetration into core industrial scenarios, becoming the intelligent engine driving digital transformation. The Global Machine Learning Summit (ML-Summit) focuses on innovative breakthroughs and industrial practices of large model technology, deeply exploring its frontier directions and implementation paths. As a core driving force for AI development, Retrieval-Augmented Generation (RAG) breaks through the static knowledge boundaries of large models through dynamic knowledge fusion technology; Agent redefines the human-machine collaboration paradigm with autonomous decision-making and multi-task coordination capabilities; multimodal large models unlock the potential for complex scene deployment by relying on cross-modal semantic understanding technology. The synergistic evolution of these three not only overcomes key challenges such as data timeliness, privacy security, and domain-specific adaptation but also fosters industry-level transformations from efficiency revolution to business restructuring in fields such as medical diagnosis, financial risk control, and intelligent manufacturing.

Image

ML-Summit Conference Large Model Content Distribution

RAG: The dynamic knowledge engine of large models, solving issues of static knowledge boundaries, timeliness, and trustworthiness.

Agent: The intelligent execution hub of large models, empowering models with autonomous planning, decision-making, and tool invocation capabilities.

Multimodality: The perception upgrade foundation for large models, breaking through single-modal understanding limitations to achieve holistic cognition of the real world.

Knowledge Enhancement (RAG) → Behavioral Intelligence (Agent) → Perception Upgrade (Multimodality) → Complete Intelligent Agent

01

RAG: The Tentacles of Large Models

RAG (Retrieval-Augmented Generation) is a technique that combines information retrieval with generative models. Its core idea is to first retrieve relevant evidence from an external knowledge base (such as documents, databases, the internet) before generating an answer, and then produce more accurate and reliable responses based on the retrieved results and user input. The figure below shows a simplified RAG diagram.

Image

(Note: Image source network)

In terms of form, LLM acts as the brain for generating answers, and retrieval acts as the tentacles for collecting evidence. RAG is a large model system with tentacles (external knowledge base).

1.1 Why is RAG Needed?

Large models perform well in many areas, but they still have limitations, which make RAG an important supplement to them.

Model capability: Once a large model is trained, its capabilities are fixed. For example, if we ask ChatGPT about the Dongfang Zhenxuan essay incident, ChatGPT states it doesn't know. The reason is that GPT-4's training data knowledge cutoff was October 2023. RAG can effectively improve such issues by connecting to real-time knowledge bases.

Image

ChatGPT Timeliness

Data privacy: Large models find it difficult to cover private and proprietary data. Deploying a local RAG system can also mitigate such issues.

Parsability: RAG retrieval results provide factual basis, reducing speculative answers. At the same time, generated answers can be annotated with source documents, enhancing trustworthiness.

Cost optimization: Long-context models are costly to process full text input. RAG retrieves key fragments to compress input length, making RAG more efficient when handling long texts.

Image

LLM vs. RAG Differences

RAG not only addresses the limitations of large models but also brings higher generation quality and cost optimization. RAG can provide customized professional answers according to the needs of different domains.

1.2 RAG Challenges

Although RAG offers many advantages, it faces several challenges in practical applications, especially during the RAG construction process. RAG construction involves 4 main steps: document conversion to data, data chunking, data vectorization, and vector storage.

1.2.1 Difficulties in Text Vectorization

Documents are primarily text-based, but also include images, tables, and formulas. With millions of text characters in documents, how to chunk the data (involving a trade-off between text granularity and context completeness) and choosing an appropriate text granularity (data chunking) can balance retrieval precision and recall.

Image

Challenges in RAG Construction

1.2.2 Difficulties with Multimodal Documents

Processing structured multimodal content such as images and charts in multimodal documents is more complex. How to fuse data from different modalities (text, images, videos) to improve understanding accuracy is a challenge.

Image

Complex Multimodal Document Structure (Note: Image source network)

Currently, the processing pipeline for complex document structures includes four stages: document parser (OCR recognition and coordinates, image recognition and coordinates, tool parser, etc.), document structuring (indexing data in order), and document understanding (organizing data into serializable structures). Overall, the document parsing pipeline is long, involves many steps, and content is difficult to verify.

Image

Conventional Parsing Pipeline for Complex Documents (Note: Image source network)

1.2.2 Difficulties in Controllable Retrieval

Retrieval errors are a common problem in RAG applications, such as noisy data, data chunking (incorrect context handling), feature vectorization process (insufficient BGE capability), etc. Recall rate and precision are contradictory. Therefore, controllable processing is required for the RAG system.

Image

An Approach to Controllable RAG Processing

1.3 RAG Development

Due to technical bottlenecks in multimodal data processing and vectorized retrieval, the stability of RAG systems is often constrained. Therefore, promoting a unified processing paradigm for multimodal documents and a new generation of retrieval architectures has become two key paths to push the boundaries of RAG capabilities.

1.3.1 Multimodal Document Processing

In Visual Question Answering (VAQ) tasks, multimodal document parsing requires integrating text and layout understanding capabilities. For example, when parsing "the difference in resolution parameters between two brands," the model needs to not only recognize the text content in the image but also parse the layout logic and table structure information between texts. To improve accuracy in answering, it's necessary to ensure that the model retains the original structural features of the text when processing it.

Image

Multimodal Model Extracting Text and Visual Question Answering

Multimodal document processing can not only map data from different modalities (text, images, tables) into the same semantic space, thereby improving data usability and retrieval efficiency, but also facilitates the model's understanding of documents.

1.3.2 Memory-Driven RAG

Another development direction for RAG is memory-driven RAG. Compared to traditional vector-based RAG, memory-driven RAG utilizes the LLM's KV cache as a dynamic index, offering greater flexibility and adaptability. As shown in the figure, Standard RAG and Memo RAG have clear differences in principle and usage.

Image

Differences between Vector RAG and Memo RAG

Application scenarios: If the requirement is for fast retrieval of static knowledge (e.g., standard customer service Q&A), prioritize vector RAG; BGE (Zhiyuan General Embedding Model), Jina Embeddings (long text optimization). If the requirement is for dynamic interaction and lifelong learning (e.g., personalized medical assistant), explore memory-driven RAG Memo RAG (Zhiyuan Institute): KV cache compression + dynamic memory indexing.

Image

02

Agent: The Integrator of Large Models

Agent technology is an important integrator of large models, capable of autonomously executing tasks, making decisions, and interacting with environments. As shown in the figure, SpongeBob's image illustrates how a large model gradually evolves into a powerful intelligent agent.

Image

(Note: Image source network)

2.1 Agent Overview

An AI agent refers to a computer program designed and programmed using AI technology, which can independently perform certain tasks and react to the environment. An AI agent can be viewed as an intelligent entity that can perceive its environment, make its own decisions, and act to change the environment. The figure below shows a simplified Agent system diagram.

Image

Agent System Diagram

An Agent forms a complete intelligent system by combining LLM, planning, feedback, and tools. An Agent includes perception, decision, and execution layers, ultimately possessing autonomy, reactivity, proactiveness, and sociality.

Image

2.2 Agent Practices

Many Agent open-source projects exist, and practical experience with these projects can deepen understanding of Agents. Agent practices are divided into two types: autonomous agents and generative agents.

2.2.1 Autonomous Intelligence vs. Generative Intelligence

Autonomous agents: Intelligent systems that autonomously execute tasks, make decisions, and interact with the environment. Generative agents: Intelligent systems that use generative models to create new data or content. As shown in the figure, Auto-GPT (autonomous intelligence) self-queries and self-answers, and the Stanford Town virtual world (generative intelligence).

Image

Distinctions between Autonomous Agents and Generative Agents:

Image

Comparison of Single-Agent and Multi-Agent Systems

2.2.2 Agent Core Frameworks

Mature Agent frameworks can reduce development costs, with MetaGPT and AutoGen being the two most popular frameworks currently. MetaGPT simulates a collaborative software company structure by assigning different roles to GPT models to handle complex tasks; AutoGen, as an open-source framework, focuses on developing large language model applications through multi-agent conversations and enhanced LLM reasoning.

Image

MetaGPT vs. AutoGen Comparison

MetaGPT and AutoGen each have their characteristics: MetaGPT is the "digital CTO" of a software company; AutoGen is the "Lego factory" for customized AI. MetaGPT is more suitable for software development tasks requiring full automation and collaboration, while AutoGen is better for LLM application development that needs flexible customization and conversation.

2.2.3 Multi-Agent Systems

Real-world tasks are often too complex for a single Agent to handle, requiring multiple Agents to collaborate. As illustrated in the comic, from a requirement to the final delivered product: first, planning, requirements analysis, framework design, system solutions, coding implementation, functional testing, and finally product delivery. Such complex systems require multi-person collaboration, and Multi-Agent systems offer significant advantages in handling complex tasks.

Image

Single-agent and multi-agent systems differ significantly in both task types and core technologies.

Image

Comparison of Single-Agent and Multi-Agent

1. Task Decomposition Capability: Through distributed sub-task division and collaboration, Multi-Agent systems can decompose tasks, improving task processing efficiency.

2. Performance Breakthrough: Through parallel architecture and redundant fault-tolerant design, Multi-Agent systems can significantly improve computational efficiency and system robustness.

3. Dynamic Environment Adaptation: Through real-time interaction networks, Multi-Agent systems can quickly adapt to dynamic environments, better coping with complex changing environments.

2.3 Agent Applications

Although Agent technology has demonstrated its powerful application value in multiple fields, we also face some challenges.

2.3.1 Application Challenges

The figure displays challenges in various aspects, such as: technical capability, system design, security, and economic benefits.

Image

Solutions to the above problems:

1. Complex task planning, gradually solving complex tasks through a hierarchical approach.

2. Dynamic environment adaptation: Meta-Learning + World Models can improve Agent's adaptability in dynamic environments.

3. Multi-agent collaboration: Through game theory and federated learning, multi-agent systems achieve efficient collaboration.

4. Improved interpretability: Causal inference models + decision tree distillation can enhance Agent's interpretability, making Agent's decision-making process more transparent.

5. Value alignment: Reinforcement Learning from Human Feedback (RLHF) can address Agent's value alignment issues.

2.3.2 Industry Applications

Agent technology has demonstrated its powerful application value in multiple fields.

Image

Agent Industry Application Effects

The practical application of Agents consistently faces the complexities of the real world. To handle tasks such as visual defect detection in industrial quality inspection or chart parsing in financial reports, the single-modal limitation must be overcome—this is precisely the technical mission of multimodal large models.

03

Applications of Multimodal Technology

Multimodal large models have a wide range of applications, covering multiple industries and domains. This article shares the work of three teams: Zidong Taichu's multimodal pre-training, 360 Team's open-world object detection, and Tencent Team's WeChat Channels multimodal review.

3.1 Zidong Taichu - Multimodal Task Unification

Unifying traditional CV tasks such as object detection, segmentation, and OCR into image-text large models is one of the core technologies in the Zidong Taichu project. Using LLM's autoregressive unified encoding prediction, while achieving unified representation, it explicitly enhances the local perception capability of image-text large models.

Task design: To strengthen the visual local understanding capability of multimodal large models, traditional CV tasks are unified in the MLLM regression task. The dataset adds 900k entries containing box, mask, and fine-grained standard localization data. Different multimodal tasks are achieved through instruction following, such as referring expression detection and referring expression segmentation.

Image

CV and Text Task Unification (Note: Zidong Taichu Team's sharing at ML-Summit Conference)

Training strategy: The first stage uses image-text data pairs to achieve cross-modal alignment of the model; the second stage uses multimodal referring expression tasks and a series of fine-grained tasks to enhance the model's data capability. The third stage applies reinforcement learning to help the model better follow user instructions and understand user intent.

Image

Training Strategies in Different Stages (Note: Zidong Taichu Team's sharing at ML-Summit Conference)

Model effect: The trained multimodal large model not only possesses excellent general capabilities but also visual localization functions. The Visual Grounding task surpasses the contemporary optimal localization optimization model CogVLM-17B, achieving higher precision than multiple specialized object detection and object counting models for the first time in object detection and open object counting tasks.

Image

3.2 360 Research Institute - Open World Object Detection

360 Research Institute's open-world object detection technology has been widely applied in smart hardware, autonomous driving, and other fields. Traditional small models struggle to meet the detection demands of open scenarios due to insufficient generalization capabilities, and this task is precisely a key link for multimodal large models to build universal perception capabilities. Why has detection capability become an essential attribute for multimodal large models? Its necessity is mainly reflected in the following four aspects:

Image

Although object detection can help multimodal large models improve their capabilities, several challenges need to be addressed in practical applications. First, data acquisition and annotation bottlenecks, with scarce data for unknown categories. Second, challenges in data distribution complexity, and difficulties in identifying long-tail categories. Finally, weak cross-category transfer capability of models and insufficient environmental adaptability.

3.3 Tencent - Multimodal WeChat Channels Review

With the rapid expansion of the content ecosystem on the WeChat Channels platform, video content and user comments are continuously growing at a high rate. Manual review is facing obvious efficiency bottlenecks and quality challenges in handling massive review tasks. To effectively improve the timeliness and accuracy of content review, there is an urgent need to build a comprehensive solution covering algorithm model optimization, review mechanism innovation, standard system improvement, and data interpretability enhancement.

Model layer: Introduction of vertical large models.

Powerful natural language processing capabilities to accurately identify potential violations. Multimodal models can process various types of data, comprehensively covering review needs.

Review layer: Segmented review process.

Suspected low violation (white channel): For content suspected of low violation, simplify the review process and reduce manual intervention, thereby significantly improving review efficiency.

Suspected high violation (black channel): For content suspected of high violation, provide early warning of violating information, helping reviewers focus on high-violation content.

Image

WeChat Channels Review System Solution

Multi-dimensional feature input: Video images, text content (titles, image OCR, ASR, comments) and other multi-dimensional data help the model more accurately determine if content is harmful.

Model base pre-training: Constructing vertical scenario pre-training datasets through model assistance + manual annotation, selecting general multimodal bases for pre-training on vertical data.

Data optimization and fine-tuning: Based on manual review feedback, multiple rounds of iterative optimization training were conducted to ensure higher accuracy and robustness in practical applications.

Image

Multimodal Information Data Flow Fusion

Tencent Video review system integrates text RAG (policy library retrieval) with multimodal content understanding, achieving active interception of violating content through a review Agent.

04

Future Development Trends of Large Models

  • Algorithm Level: Models will evolve from network architecture, dynamic learnability, and multimodal alignment to unified full-modal capabilities (AGI).

  • Product Level: We will see more and more complex systems based on large models, featuring human-machine collaborative interaction capabilities.

  • Domain Level: Deep integration in various vertical domains will promote the restructuring of social resources. Capabilities will extend from software to hardware, with AI robots directly used in the real world.

Image

Future large models will exhibit a triple helix development: RAG evolving towards multimodal knowledge graphs, building a virtual-real integrated cognitive network; Agent evolving towards embodied intelligence, forming environment-adaptive decision systems; and multimodality upgrading to neuro-symbolic systems, achieving interpretable perception and reasoning. The deep integration of these three will give rise to a new generation of industrial intelligent agents, realizing a complete closed loop of perception-cognition-decision-execution in scenarios such as surgical robots and smart grids.

Note: Some images in the article are from the internet and public papers. The diagrams in the Multimodal Task Unification section are from the Zidong Taichu team's sharing at the ML-Summit conference.

-End-

Original Author | Jiang Jin

Thank you for reading this far. Why not follow us? 👇

Image

📢📢Claim your exclusive developer benefits! Click the image below to go there👇

Image

Image

What other expectations do you have for the future development of large models? Please feel free to leave a comment to add. We will select 1 high-quality comment and send out a Tencent Cloud customized document bag set (see image below). The prize draw will be at 12:00 PM on May 6th.

Image

ImageImageImageImageImage

Main Tag:AI Technologies

Sub Tags:Retrieval-Augmented Generation (RAG)Future TrendsIndustry ApplicationsMultimodal AIAI Agents


Previous:Fooled by "AI for Science" Hype? A Scientist's Painful Lesson

Next:OpenAI's Big Move! Core API Now Supports MCP, Revolutionizing Agent Development Overnight

Share Short URL