面壁MiniCPM4推論速度快3倍，碾壓同尺寸Qwen3，讓阿里巴巴也感到壓力了~

本公眾號主要關注自然語言處理(NLP)、電腦視覺(CV)、大型語言模型(LLM)、檢索增強生成(RAG)、智慧代理(Agent)等AI前瞻技術，免費分享業界實戰案例與課程，助您全面擁抱AIGC。

面壁智慧發布了MiniCPM4系列模型，包含：MiniCPM4-8B、MiniCPM4-MCP等10個模型。

MiniCPM4的主要特點仍然是推論速度快。

MiniCPM4

MiniCPM4的整體架構

MiniCPM4的創新點

高效的模型架構：

InfLLM v2：採用可訓練的稀疏注意力機制架構，在128K長文本處理中，每個Token只需要用到不到5%的Token計算相關性，顯著降低長文本的計算開銷。

高效的學習演算法：

Model Wind Tunnel 2.0：引入下游任務效能的擴縮容預測方法，實現更精確的模型訓練配置搜尋。

BitCPM：將模型參數位元寬壓縮為3個值，實現90%的極端模型位元寬縮減。

訓練工程優化：採用FP8低精度計算技術結合Multi-token Prediction訓練策略。

高品質的訓練資料：

UltraClean：基於高效的資料驗證建立迭代資料清洗策略，開源高品質的中英文預訓練資料集UltraFinweb。

UltraChat v2：建立大規模高品質監督微調資料集，涵蓋知識密集型資料、推論密集型資料、指令遵循資料、長文本理解資料、工具呼叫資料等多個面向。

高效的推論系統：

CPM.cu：整合稀疏注意力、模型量化和推測取樣，實現高效的預填充和解碼。

ArkInfer：支援跨多個後端環境的高效部署，提供靈活的跨平台適配能力。

MiniCPM4實戰

from transformers import AutoModelForCausalLM, AutoTokenizer

import torch

torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path)

model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface

# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)

# print(responds)

# User can also use the generate interface

messages = [

{"role": "user", "content": "Write an article about Artificial Intelligence."},

]

prompt_text = tokenizer.apply_chat_template(

messages,

tokenize=False,

add_generation_prompt=True,

)

model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(

**model_inputs,

max_new_tokens=1024,

top_p=0.7,

temperature=0.7

)

output_token_ids = [

model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))

]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]

print(responses)

MiniCPM4-8B支援稀疏注意力機制，旨在實現高效的長序列推論。它需要infllmv2_cuda_impl程式庫。

安裝方法如下：

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git

cd infllmv2_cuda_impl

git submodule update --init --recursive

pip install -e . # or python setup.py install

若要啟用InfLLM v2，需要在模型檔案中的sparse_configconfig.json新增以下欄位：

{

...,

"sparse_config": {

"kernel_size": 32,

"kernel_stride": 16,

"init_blocks": 1,

"block_size": 64,

"window_size": 2048,

"topk": 64,

"use_nope": false,

"dense_len": 8192

}

參數說明：

kernel_size（預設值：32）：語義核心的大小。

kernel_stride（預設值：16）：相鄰核心之間的步幅。

init_blocks（預設值：1）：每個查詢令牌處理的初始區塊數。這確保了注意力集中在序列的開頭。

block_size（預設值：64）：鍵值區塊的區塊大小。

window_size（預設值：2048）：本地滑動視窗的大小。

topk（預設值：64）：指定每個Token僅使用前k個最相關的鍵值區塊計算注意力。

use_nope（預設值：false）：是否在區塊選擇中使用NOPE技術以提高效能。

dense_len（預設值：8192）：由於稀疏注意力（Sparse Attention）對短序列的益處有限，因此該模型可以對較短的文本使用標準（密集）注意力（Dense Attention）。該模型將對標記長度低於此長度的序列使用密集注意力，並對超過此長度的序列切換到稀疏注意力。將此項設定為-1以始終使用稀疏注意力，而不管序列長度如何。

https://arxiv.org/pdf/2506.07900

https://huggingface.co/openbmb/MiniCPM4-8B

面壁MiniCPM4推論速度快3倍，碾壓同尺寸Qwen3，讓阿里巴巴也感到壓力了~

分享短網址