Ushering in the Era of On-Device Long Text! OpenBMB's New Architecture Boosts MiniCPM up to 220x Faster

Reported by Synced

Editor: Zenan

On-device large models are undergoing a qualitative change.

On-device language models have finally ushered in a transformative innovation.

Last Friday, at the 2025 Zhipu AI Conference, OpenBMB, a well-known AI startup in China, officially released its latest generation of the 'Pocket Rocket' model, MiniCPM 4.0, pushing the development of AI into 'full throttle'.

图片

At the launch event, OpenBMB CEO announced that MiniCPM 4.0 achieved the industry's first system-level context-aware sparse language model innovation, realizing an extremely high sparsity of 5%, enabling long-text inference on devices and ushering in the era of on-device long text.

The released MiniCPM 4.0 comes in two parameter versions, 8B and 0.5B, both pushing the boundaries of on-device model capabilities.

According to reports, through multi-dimensional innovations at the architecture, algorithm, data, and system levels, the new generation of context-aware sparse efficient architecture model MiniCPM 4.0 8B achieves a stable 5x increase in long-text inference speed compared to models of similar scale like Qwen-3-8B, Llama-3-8B, and GLM-4-9B, with a maximum acceleration of 220x in extreme scenarios, delivering best-in-class model performance. Concurrently, it significantly reduces long-text caching requirements, with MiniCPM 4.0-8B needing only 1/4 of the cache storage space compared to Qwen3-8B in 128K long-text scenarios.

图片

The model, pre-training data, and on-device inference framework are all open-source.

GitHub Link: https://github.com/openbmb/minicpm

Technical Report: https://github.com/OpenBMB/MiniCPM/blob/main/report/MiniCPM_4_Technical_Report.pdf

Huggingface Link: https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b

Model Scope Link: https://www.modelscope.cn/collections/MiniCPM-4-ec015560e8c84d

While MiniCPM 4.0 series defends its title as the world's strongest on-device model, it also marks another technological breakthrough originating from the underlying architecture in the large model field, following DeepSeek.

Speed Increased Hundredfold

Strongest On-Device, Punching Above Its Weight

MiniCPM 4.0's improvements are comprehensive, further solidifying OpenBMB's 'Pocket Rocket' series models' leading position in various on-device inference tasks.

OpenBMB reported that MiniCPM 4.0-8B matches the performance of Qwen-3-8B and surpasses Gemma-3-12B in popular AI benchmarks such as MMLU, CEval, MATH500, and HumanEval.

MiniCPM 4.0-0.5B, a smaller language model designed for more on-device equipment, can achieve high-speed inference of 600 tokens per second, with performance surpassing Qwen-3 0.6B.

图片

It's worth noting that the Qwen3-0.6B model, launched just in April, already surpassed Gemma 4B in performance. This 'punching above its weight' approach is highly welcomed, meaning more applications in the future can afford large models.

To further enhance efficiency and adapt to more scenarios, OpenBMB designed an 'efficient dual-frequency shifting mechanism' for the new model, allowing it to automatically switch attention modes based on task characteristics: enabling sparse attention to reduce computational complexity when processing long texts and deep thinking tasks, and switching to dense attention in short text scenarios to ensure accuracy. This enables efficient responses across different tasks.

图片

MiniCPM 4.0 also significantly reduces caching requirements for long-text tasks. In 128K scenarios, MiniCPM 4.0-8B requires only 1/4 of the cache storage space compared to Qwen3-8B.

Furthermore, MiniCPM 4.0 further improves operational efficiency. From algorithms and systems to hardware inference, it is the first large model to achieve a full chain of self-developed on-device solutions, truly realizing deployable system-level software and hardware sparsity.

Based on MiniCPM-4.0, OpenBMB continues to emphasize its application-oriented advantages: this generation of 'Pocket Rocket' models has been adapted to mainstream chip platforms including Intel, Qualcomm, MediaTek, and Huawei Ascend, and can be deployed using open-source frameworks such as vLLM, SGLang, llama.cpp, LlamaFactory, and XTuner. Enhanced MCP support ensures convenient model application.

图片

It appears that with the technological breakthrough in on-device small models, AI on-device models embedded in various manufacturers' mobile phones and in-car systems may soon see a wave of updates, leading to many apps being 'rewritten'.

Behind the Powerful Performance

OpenBMB Achieves Architecture-Level Innovation

As is well-known, DeepSeek has recently led technological breakthroughs in the AI field, with architectural innovations in its V3, R1, and other models significantly enhancing AI's deep thinking capabilities.

Today, advanced capabilities such as strong inference and long-text processing have become standard for large model applications: only when models can deeply understand long-text structures and semantics can generated content achieve better consistency; in applications, long-text understanding also means AI can become a true 'personal assistant,' capable of remembering more personal information and context.

And only by deploying models on devices can AI reaction latency be reduced, enabling the creation of future intelligent products while ensuring personal data security.

'Current cloud-based large model technologies still have some limitations at the application level; using them is like using old search engines,' said Liu Zhiyuan, co-founder and chief scientist of OpenBMB. 'If AI's ultimate goal is AGI (Artificial General Intelligence), then its form should be like Jarvis in Iron Man, knowing your personal information and understanding your preferences. These things require large models to have long-term memory'.

However, on the other hand, how to run such high-IQ AI on devices has become a new challenge for engineers.

In the technical report for MiniCPM-4, OpenBMB engineers introduced their systematic innovations across four key dimensions: on-device model architecture, training data, training algorithms, and inference systems.

图片

In terms of model architecture, OpenBMB proposed InfLLM v2, a trainable sparse attention layer that simultaneously accelerates both pre-filling and decoding phases of long-context processing, achieving efficient long-text handling while maintaining model performance.

For long-context content processing, InfLLM has already gained recognition in the AI field. Last February, the initial InfLLM, published by OpenBMB co-founder and Tsinghua University's Liu Zhiyuan's team, discussed improvements in sparse attention. This February, DeepSeek's long-text processing architecture, NSA (Native Sparse Attention), also adopted a similar approach and cited and compared InfLLM in its paper.

However, previous industry methods still suffered from slow inference speeds for short texts. The emergence of InfLLMv2 solved the short-text inference bottleneck. Its hybrid sparse attention structure has been upgraded again, changing the traditional Transformer model's relevance calculation method. After processing text in blocks and regions, it uses an intelligent selection mechanism to 'spot check' only the most relevant key areas for attention calculation.

图片

At the inference layer, MiniCPM 4.0 achieves on-device inference acceleration through self-developed technological innovations such as the CPM.cu inference framework, BitCPM's extreme low-bit width quantization, and ArkInfer's self-developed cross-platform deployment framework.

The CPM.cu inference framework achieves an efficient combination of sparsity, speculative decoding, and quantization, resulting in a 5x speed improvement. Specifically, FR-Spec's lightweight speculative sampling is akin to a small model acting as an 'intern' for a large model, reducing the vocabulary burden and accelerating computations for the small model. Through an innovative vocabulary pruning strategy, the small model focuses on generating drafts with high-frequency basic vocabulary, avoiding wasted computation on low-frequency, difficult words, which are then verified and corrected by the large model.

BitCPM quantization algorithm achieves industry SOTA-level 4-bit quantization and explores a 3-value quantization (1.58bit) scheme. Through refined mixed-precision strategies and adaptive quantization algorithms, the model maintains excellent performance even after a 90% size reduction.

The ArkInfer cross-platform deployment framework optimizes for multi-platform on-device chips, enabling efficient speculative sampling and constrained decoding for major platforms, ensuring seamless use of the on-device multi-platform Model zoo.

At the model training and data level, OpenBMB proposed UltraClean, an efficient and accurate pre-training data filtering and generation strategy, achieving a 90% reduction in verification costs. It establishes strict admission criteria for internet corpus, ensuring that only data truly capable of improving model performance is included in the pre-training corpus. Using the lightweight FastText tool for large-scale data quality inspection, processing 15 trillion tokens of data in the workflow requires only 1000 hours of CPU time.

OpenBMB utilized UltraChat-v2 to synthesize tens of billions of high-quality aligned tokens, strengthening key capabilities such as knowledge, instruction following, long-text processing, and tool use.

In the MiniCPM 4 series, OpenBMB applied 'ModelTunnel V2,' enabling more efficient training strategy search. Training experiments conducted on smaller models (0.01B-0.5B) are then migrated to larger models. For MiniCPM 4, OpenBMB optimized the number of searches for small models; compared to ModelTunnel V1, only half the number of experiments are needed to find the optimal configuration.

With the support of high-quality data and efficient training strategies, MiniCPM 4.0 achieved the same capability level as similarly sized open-source models (Qwen-3 8B) with only 22% of the training cost.

Through multi-dimensional optimization, MiniCPM 4 truly achieved the industry's only end-to-end on-device full-process optimization, becoming another milestone in the AI field for exploring high-efficiency language models.

OpenBMB reported that through further adaptation, MiniCPM 4 successfully supports various applications, including trusted survey questionnaire generation and tool use based on model context protocols, fully demonstrating its wide applicability.

This year marks the explosion of large model applications. As a startup, OpenBMB insists on building foundational models, laying a solid groundwork for future intelligent on-device applications.

OpenBMB's High-Efficiency Model Exploration

Another Path Beyond DeepSeek

As competition in large model technology escalates, the scaling laws-driven approach has entered deep waters. On one hand, increasingly large model parameters are hitting computational power and parallelization bottlenecks; on the other hand, the volume of training data challenges companies' acquisition and processing capabilities. In such circumstances, a small number of players who have long researched new model paradigms are gradually coming to the forefront.

Among domestic AI startups, DeepSeek has driven a new round of global large model technological progress with its V3, R1, and other model innovations. Meanwhile, in the direction of on-device models, OpenBMB has consistently been in the spotlight.

Interestingly, both OpenBMB and DeepSeek pursue a path of high-efficiency, strong-inference large models starting from hardware-software co-optimization and spanning the entire process. Unlike DeepSeek, which focuses on strengthening the upper limits of model capabilities and cloud deployment, OpenBMB's team has consistently explored on-device sparsity solutions.

图片

Improving AI efficiency and reducing usage costs is OpenBMB's founding mission. With the success of the Transformer architecture, language models have continuously expanded in scale, and people have been seeking more effective model paradigms. Model sparsity is considered a very promising solution. OpenBMB is one of the earliest teams in China to explore the path of sparsity, and its research has consistently led the industry.

As early as 2019, OpenBMB's founding team began exploring sparse FFN related work, and their research was followed by companies like Google and Apple.

In June 2021, the team participated in the release of the hundred-billion-parameter efficient and easy-to-use large MoE model, CPM-2. In the same year, OpenBMB's team proposed in their work 'MoEfication: Transformer Feed-forward layers are Mixtures of Experts' that converting dense models into MoE models with equivalent parameters could achieve significant inference acceleration.

In July 2024, OpenBMB open-sourced the MiniCPM-S model, which uses sparse activation to reduce the inference energy consumption of large models under equivalent parameter conditions.

Late last year, Tsinghua University and OpenBMB's team proposed the brain-inspired efficient sparse architecture Configurable Foundation Model, revolutionizing the previous MoE architecture. It emphasizes decomposing large models into several modules based on functionality, achieving complex capabilities through module retrieval, combination, updating, and growth. From an implementation perspective, the new architecture significantly enhances the 'knowledge density' of large models and promotes low-power inference for on-device models.

From a broader perspective, while tech giants are investing heavily in cloud computing infrastructure for large models, the ability to deploy advanced models on devices, reaching over 7 billion smartphones worldwide, as well as future AI PCs and intelligent in-car systems, is equally self-evident.

Interestingly, in a recent series of studies, OpenBMB researchers have summarized the 'Densing Law' for large models, suggesting that as technology continues to evolve, the capability density of language models doubles on average every 100 days, and people can continue to train more computationally efficient and powerful foundational large models.

图片

MiniCPM-4.0 advances AI capability density to a higher level, echoing DeepSeek R1's high point in model capabilities.

Moving in this direction, OpenBMB plans to continue releasing more MiniCPM series foundational models and multimodal models in the near future.

The next generation of 'Pocket Rocket' will bring us even greater surprises.

© THE END

Please contact this official account for authorization to reproduce.

For submissions or media inquiries: liyazhou@jiqizhixin.com

Main Tag:On-Device AI

Sub Tags:Large Language ModelsLong-Context ProcessingModel OptimizationSparse Attention


Previous:Musk's Starlink Satellites Suddenly Falling in Large Numbers!

Next:AI Surpasses Humans in Mathematics in Seven Months, Breaking Through Mathematicians' "Siege"! 14 Mathematicians Delve into Raw Reasoning Tokens: Not by Rote Learning, but by Intuition

Share Short URL