Mianbi MiniCPM4: 3x Inference Speed, Outperforming Same-Size Qwen3, Putting Pressure on Alibaba

This official account primarily focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, and Agent, offering free industry practical cases and courses to help you fully embrace AIGC.

Mianbi (Smart) released the MiniCPM4 series, including 10 models such as MiniCPM4-8B and MiniCPM4-MCP.

The main characteristic of MiniCPM4 is its fast inference speed.

MiniCPM4

MiniCPM4's Overall Architecture

MiniCPM4's Innovations

Efficient Model Architecture:

InfLLM v2: Adopts a trainable sparse attention mechanism architecture, where in 128K long text processing, each Token only requires less than 5% of Token correlation calculations, significantly reducing computational overhead for long texts.

Efficient Learning Algorithms:

Model Wind Tunnel 2.0: Introduces a scaling prediction method for downstream task performance, enabling more precise model training configuration searches.

BitCPM: Compresses model parameter bit-width to 3 values, achieving an extreme 90% reduction in model bit-width.

Training Engineering Optimization: Adopts FP8 low-precision computation technology combined with a Multi-token Prediction training strategy.

High-Quality Training Data:

UltraClean: Builds an iterative data cleaning strategy based on efficient data validation, open-sourcing the high-quality Chinese and English pre-training dataset UltraFinweb.

UltraChat v2: Constructs a large-scale, high-quality supervised fine-tuning dataset, covering multiple dimensions such as knowledge-intensive data, inference-intensive data, instruction-following data, long-text understanding data, and tool-calling data.

Efficient Inference System:

CPM.cu: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient pre-filling and decoding.

ArkInfer: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities.

MiniCPM4 in Practice

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path)

model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface

# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)

# print(responds)

# User can also use the generate interface

messages = [

{"role": "user", "content": "Write an article about Artificial Intelligence."},

]

prompt_text = tokenizer.apply_chat_template(

messages,

tokenize=False,

add_generation_prompt=True,

)

model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(

**model_inputs,

max_new_tokens=1024,

top_p=0.7,

temperature=0.7

)

output_token_ids = [

model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))

]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]

print(responses)

MiniCPM4-8B supports a sparse attention mechanism, aiming to achieve efficient long-sequence inference. It requires the infllmv2_cuda_impl library.

The installation method is as follows:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git

cd infllmv2_cuda_impl

git submodule update --init --recursive

pip install -e . # or python setup.py install

To enable InfLLM v2, you need to add the following fields to sparse_configconfig.json in the model file:

{

...,

"sparse_config": {

"kernel_size": 32,

"kernel_stride": 16,

"init_blocks": 1,

"block_size": 64,

"window_size": 2048,

"topk": 64,

"use_nope": false,

"dense_len": 8192

}

Parameter Description:

kernel_size (default: 32): The size of the semantic kernel.

kernel_stride (default: 16): The stride between adjacent kernels.

init_blocks (default: 1): The number of initial blocks processed for each query token. This ensures attention is focused on the beginning of the sequence.

block_size (default: 64): The block size for key-value blocks.

window_size (default: 2048): The size of the local sliding window.

topk (default: 64): Specifies that each Token only uses the top-k most relevant key-value blocks for attention calculation.

use_nope (default: false): Whether to use NOPE technology in block selection to improve performance.

dense_len (default: 8192): Since Sparse Attention has limited benefits for short sequences, this model can use standard (dense) Attention for shorter texts. The model will use dense attention for sequences with token lengths below this length and switch to sparse attention for sequences exceeding this length. Set this to -1 to always use sparse attention, regardless of sequence length.

https://arxiv.org/pdf/2506.07900

https://huggingface.co/openbmb/MiniCPM4-8B

Mianbi MiniCPM4: 3x Inference Speed, Outperforming Same-Size Qwen3, Putting Pressure on Alibaba

Share Short URL