Mianbi MiniCPM4: 3x Inference Speed, Outperforming Same-Size Qwen3, Putting Pressure on Alibaba

This official account primarily focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, and Agent, offering free industry practical cases and courses to help you fully embrace AIGC.

Mianbi (Smart) released the MiniCPM4 series, including 10 models such as MiniCPM4-8B and MiniCPM4-MCP.

Image

The main characteristic of MiniCPM4 is its fast inference speed.Image

MiniCPM4

MiniCPM4's Overall Architecture

Image

MiniCPM4's Innovations

Efficient Model Architecture:

Image

InfLLM v2: Adopts a trainable sparse attention mechanism architecture, where in 128K long text processing, each Token only requires less than 5% of Token correlation calculations, significantly reducing computational overhead for long texts.

Efficient Learning Algorithms:

Image

Model Wind Tunnel 2.0: Introduces a scaling prediction method for downstream task performance, enabling more precise model training configuration searches.

BitCPM: Compresses model parameter bit-width to 3 values, achieving an extreme 90% reduction in model bit-width.

Training Engineering Optimization: Adopts FP8 low-precision computation technology combined with a Multi-token Prediction training strategy.

High-Quality Training Data:

UltraClean: Builds an iterative data cleaning strategy based on efficient data validation, open-sourcing the high-quality Chinese and English pre-training dataset UltraFinweb.

UltraChat v2: Constructs a large-scale, high-quality supervised fine-tuning dataset, covering multiple dimensions such as knowledge-intensive data, inference-intensive data, instruction-following data, long-text understanding data, and tool-calling data.

Efficient Inference System:

Image

CPM.cu: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient pre-filling and decoding.

ArkInfer: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities.

MiniCPM4 in Practice

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path)

model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface

# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)

# print(responds)

# User can also use the generate interface

messages = [

{"role": "user", "content": "Write an article about Artificial Intelligence."},

]

prompt_text = tokenizer.apply_chat_template(

messages,

tokenize=False,

add_generation_prompt=True,

)

model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(

**model_inputs,

max_new_tokens=1024,

top_p=0.7,

temperature=0.7

)

output_token_ids = [

model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))

]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]

print(responses)

MiniCPM4-8B supports a sparse attention mechanism, aiming to achieve efficient long-sequence inference. It requires the infllmv2_cuda_impl library.

The installation method is as follows:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git

cd infllmv2_cuda_impl

git submodule update --init --recursive

pip install -e . # or python setup.py install

To enable InfLLM v2, you need to add the following fields to sparse_configconfig.json in the model file:

{

...,

"sparse_config": {

"kernel_size": 32,

"kernel_stride": 16,

"init_blocks": 1,

"block_size": 64,

"window_size": 2048,

"topk": 64,

"use_nope": false,

"dense_len": 8192

}

}

Parameter Description:

kernel_size (default: 32): The size of the semantic kernel.

kernel_stride (default: 16): The stride between adjacent kernels.

init_blocks (default: 1): The number of initial blocks processed for each query token. This ensures attention is focused on the beginning of the sequence.

block_size (default: 64): The block size for key-value blocks.

window_size (default: 2048): The size of the local sliding window.

topk (default: 64): Specifies that each Token only uses the top-k most relevant key-value blocks for attention calculation.

use_nope (default: false): Whether to use NOPE technology in block selection to improve performance.

dense_len (default: 8192): Since Sparse Attention has limited benefits for short sequences, this model can use standard (dense) Attention for shorter texts. The model will use dense attention for sequences with token lengths below this length and switch to sparse attention for sequences exceeding this length. Set this to -1 to always use sparse attention, regardless of sequence length.

https://arxiv.org/pdf/2506.07900

https://huggingface.co/openbmb/MiniCPM4-8B

Recommended Reading

19.2K Star Super Agent, Outperforming LangGraph by 5000x

GraphRAG Performance Lagging, DeepSearcher Ready-to-Use

3.7K Star! GraphRAG is Not So Good Anymore~

Fix Low-Quality Scanned PDFs: No Fear of Page Distortion, Blurry Fonts

From HuggingFace: Minimalist Yet Powerful Agent

From Alibaba: OmniParser for OCR Extraction in Complex Universal Document Scenarios

Tsinghua, Mianbi Smart Release: Proactive Agent 2.0

Alibaba Releases: Editable CoT, Surpassing ReAct by 20%

Microsoft Releases: Industrial-Grade Agent Deployment Solution RDAgent

Alibaba Open Sources UReader: Universal OCR-Free Document Understanding

PDF to Chinese: Layout Restoration, Text, Formula Recognition, English-to-Chinese All in One

Document OCR Layout Recognition: Balancing Speed and Accuracy, YOLO is the First Choice

Main Tag:Artificial Intelligence

Sub Tags:Large Language ModelsNatural Language ProcessingInference SpeedModel Optimization


Previous:Stanford-NYU Joint Study: Surprising Discoveries on AI and Human Thought Differences — Why Large Models Are 'Smart' but Not 'Wise'?

Next:The Gentle Singularity | Sam Altman's Latest Seminal Article

Share Short URL