This official account primarily focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, and Agent, offering free industry practical cases and courses to help you fully embrace AIGC.
Mianbi (Smart) released the MiniCPM4 series, including 10 models such as MiniCPM4-8B and MiniCPM4-MCP.
The main characteristic of MiniCPM4 is its fast inference speed.
MiniCPM4
MiniCPM4's Overall Architecture
MiniCPM4's Innovations
Efficient Model Architecture:
InfLLM v2: Adopts a trainable sparse attention mechanism architecture, where in 128K long text processing, each Token only requires less than 5% of Token correlation calculations, significantly reducing computational overhead for long texts.
Efficient Learning Algorithms:
Model Wind Tunnel 2.0: Introduces a scaling prediction method for downstream task performance, enabling more precise model training configuration searches.
BitCPM: Compresses model parameter bit-width to 3 values, achieving an extreme 90% reduction in model bit-width.
Training Engineering Optimization: Adopts FP8 low-precision computation technology combined with a Multi-token Prediction training strategy.
High-Quality Training Data:
UltraClean: Builds an iterative data cleaning strategy based on efficient data validation, open-sourcing the high-quality Chinese and English pre-training dataset UltraFinweb.
UltraChat v2: Constructs a large-scale, high-quality supervised fine-tuning dataset, covering multiple dimensions such as knowledge-intensive data, inference-intensive data, instruction-following data, long-text understanding data, and tool-calling data.
Efficient Inference System:
CPM.cu: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient pre-filling and decoding.
ArkInfer: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities.
MiniCPM4 in Practice
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)
# User can also use the generate interface
messages = [
{"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
model_outputs = model.generate(
**model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
MiniCPM4-8B supports a sparse attention mechanism, aiming to achieve efficient long-sequence inference. It requires the infllmv2_cuda_impl library.
The installation method is as follows:
git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install
To enable InfLLM v2, you need to add the following fields to sparse_configconfig.json in the model file:
{
...,
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"init_blocks": 1,
"block_size": 64,
"window_size": 2048,
"topk": 64,
"use_nope": false,
"dense_len": 8192
}
}
Parameter Description:
kernel_size (default: 32): The size of the semantic kernel.
kernel_stride (default: 16): The stride between adjacent kernels.
init_blocks (default: 1): The number of initial blocks processed for each query token. This ensures attention is focused on the beginning of the sequence.
block_size (default: 64): The block size for key-value blocks.
window_size (default: 2048): The size of the local sliding window.
topk (default: 64): Specifies that each Token only uses the top-k most relevant key-value blocks for attention calculation.
use_nope (default: false): Whether to use NOPE technology in block selection to improve performance.
dense_len (default: 8192): Since Sparse Attention has limited benefits for short sequences, this model can use standard (dense) Attention for shorter texts. The model will use dense attention for sequences with token lengths below this length and switch to sparse attention for sequences exceeding this length. Set this to -1 to always use sparse attention, regardless of sequence length.
https://arxiv.org/pdf/2506.07900
https://huggingface.co/openbmb/MiniCPM4-8B
Recommended Reading
19.2K Star Super Agent, Outperforming LangGraph by 5000x
GraphRAG Performance Lagging, DeepSearcher Ready-to-Use
3.7K Star! GraphRAG is Not So Good Anymore~
Fix Low-Quality Scanned PDFs: No Fear of Page Distortion, Blurry Fonts
From HuggingFace: Minimalist Yet Powerful Agent
From Alibaba: OmniParser for OCR Extraction in Complex Universal Document Scenarios
Tsinghua, Mianbi Smart Release: Proactive Agent 2.0
Alibaba Releases: Editable CoT, Surpassing ReAct by 20%
Microsoft Releases: Industrial-Grade Agent Deployment Solution RDAgent
Alibaba Open Sources UReader: Universal OCR-Free Document Understanding
PDF to Chinese: Layout Restoration, Text, Formula Recognition, English-to-Chinese All in One
Document OCR Layout Recognition: Balancing Speed and Accuracy, YOLO is the First Choice