Many of you probably get a headache when thinking about creating PPTs! When writing content, you rack your brain but can't come up with anything appealing. You finally manage to write a few paragraphs, but they feel dry and lack highlights. Even with good formatting, it looks awkward the next day.
Therefore, researchers from the Institute of Software, Chinese Academy of Sciences, the University of Chinese Academy of Sciences, and Shanghai Jiexin Technology have jointly open-sourced PPT Agent.
PPT Agent can analyze excellent reference slides like humans, extract content patterns and layout structures, and then gradually edit and optimize slides based on the content of the input document. It also has a self-correction function to ensure that the generated PPT meets user requirements in terms of content, design, and coherence, greatly saving time and effort.
Open Source Address: https://github.com/icip-cas/PPTAgent
PPTAgent's core technology and innovation lie in its unique two-stage presentation generation method, inspired by the natural process of human PPT creation.
Traditional PPT generation methods often convert directly from text content to slides, which can easily lead to presentations lacking visual appeal and structural coherence. However, PPTAgent cleverly solves this problem by imitating the human working method of “selecting reference slides and editing them step by step”.
In the first stage, PPTAgent conducts an in-depth analysis of reference presentations. It first clusters slides, dividing them into structural slides and content slides. Structural slides support the overall organization of the presentation, such as title pages and table of contents pages;
Content slides are used to convey specific information, such as bullet points or charts. By leveraging the powerful capabilities of large models, PPTAgent can identify the structural role of slides and group them based on their text features.
For content slides, they are converted into images, and a hierarchical clustering method is applied to group similar slide images. Subsequently, PPTAgent uses a multi-modal large model to analyze these images and identify the layout patterns in each cluster. This process not only provides a clear reference for subsequent slide generation but also ensures the structural consistency and logical flow of the generated presentation.
In terms of content pattern extraction, PPTAgent further defines a detailed extraction framework. Each slide element is assigned a category, description, and content, which makes the organization of slide content clearer and more intuitive.
For example, a slide may include elements such as title, body text, and images, each with a clear description and data content. This detailed content pattern extraction provides a solid foundation for subsequent slide generation, enabling PPTAgent to better understand the layout and content organization of slides.
Entering the second stage, PPTAgent's innovation lies in its editing-based generation method. Unlike traditional methods that generate slides from scratch, PPTAgent creates new slides by selecting appropriate reference slides and editing them step by step. This method not only preserves the carefully designed layouts and styles of the reference slides but also achieves content updates and optimizations through editing operations. PPTAgent has designed a series of editing APIs that support editing, deleting, and copying slide elements.
These APIs, combined with HTML rendering technology, enable large models to understand and modify slide content in a more intuitive way. Compared to the traditional XML format, the HTML format is more concise and easier to operate, thereby improving the efficiency and accuracy of the generation process.
Furthermore, PPTAgent also introduces a self-correction mechanism to enhance the robustness of the generation process. During the slide generation process, the generated editing operations will be executed in a REPL environment. When an operation cannot be applied to the reference slide, the REPL will provide execution feedback to help the large model adjust its editing operations.
Through this iterative correction method, PPTAgent can effectively avoid generating erroneous or inconsistent slides, ensuring the high quality of the final generated presentation in terms of content and structure.
To test PPTAgent's performance, researchers selected 50 reference presentations from the Zenodo10K dataset and collected 50 documents from the same domain as input, generating 500 presentation tasks covering 5 domains, 10 types of input documents, and 10 combinations of reference presentations.
The results show that PPTAgent significantly outperforms existing presentation generation methods in terms of content, design, and coherence. For example, compared to rule-based DocPres and template-based KCTV, PPTAgent improved content quality by 12.1% to 28.6%, design by 13.2% to 40.9%, and coherence by a substantial 25.5% to 36.6%. These results indicate that PPTAgent can generate high-quality, visually appealing, and structurally coherent presentations.