Chinese Team Trains "Spiking Large Model," Boosting Inference Speed by 100 Times

The Chinese Academy of Sciences Institute of Automation (CASIA) recently unveiled a groundbreaking project: SpikingBrain, billed as a "brain-inspired large model."

Simply put, it introduces the working mechanism of brain neurons into the AI model. Human brain neurons "don't work unless activated"; they only fire when there is a signal. This is the so-called "Spiking" mechanism.

Traditional Transformer models have a critical flaw: computational complexity grows quadratically, O(n²), with sequence length. Processing the content of an entire book might require waiting half a day.

SpikingBrain uses three methods to solve this problem:

1. Linear Attention Reduces the original O(n²) computational complexity to O(n). To process 1 million tokens, which originally required one trillion calculations, now only requires one million calculations.

2. Spiking Encoding Converts continuous values into discrete pulses (spikes). For example, for the number 5, instead of directly calculating 5 × weight, it emits 5 pulses, and each pulse only performs addition. It is estimated that this can save 97.7% of energy consumption.

3. Mixture of Experts (MoE) In the 76B parameter model, only 12B are activated at a time. This is similar to how different areas of the human brain are responsible for different functions, and not all neurons work simultaneously.

They released two models:

  • SpikingBrain-7B: Purely linear model
  • SpikingBrain-76B: Mixed model (12B active parameters)

Under a 4M token input (approximately 4 million characters), the Time to First Token (TTFT) for the 7B model was over 100 times faster than the original Qwen2.5. 1 second versus 100 seconds—a massive difference.

More interestingly, they only used 150B tokens for training (the original model required 10T), which is equivalent to 2% of the data volume, yet achieved 90% of the original model's performance.

This project also holds special significance: the entire training process was conducted on domestic Chinese GPUs from MetaX (沐曦).

The MetaX C550 GPU cluster ran continuously for two weeks without interruption, successfully training the 76B parameter model. This proves that non-NVIDIA platforms can also be used to train large models.

They rewrote a significant amount of CUDA code, adapted Triton operators, and specifically optimized the communication framework. The MFU (Model Flops Utilization) reached 23.4%, which is considered a good achievement on domestic hardware.

This technology is best suited for two scenarios:

1. Ultra-long Text Processing Applications requiring processing hundreds of thousands of characters, such as legal documents, academic papers, or novel creation, see significant speed advantages.

2. Edge Device Deployment They deployed the 1B model onto a CPU, achieving a 15-fold speed increase at a 256k sequence length. Mobile phones and embedded devices can run it.

Several points are worth considering about this work:

First, we don't necessarily have to stick rigidly to Transformer architecture. Although linear attention is theoretically less precise than quadratic attention, the practical difference is not that significant.

Second, biological inspiration remains useful. The brain can think using only 100 watts of power, while GPUs often consume several kilowatts—the gap is too large. The spiking mechanism provides a direction for reducing power consumption.

Finally, domestic substitution is no longer a dream. While there is still a gap compared to NVIDIA, this at least proves feasibility.

Main Tag:Artificial Intelligence

Sub Tags:Neuromorphic ComputingChinese TechComputational EfficiencyLarge Language Models


Previous:NeurIPS'25! AutoPrune: A Plug-and-Play Adaptive Pruning Framework for Large Models

Next:Just Now, GPT-5 Passes the "Gödel Test" for the First Time! Cracking Three Major Mathematical Conjectures

Share Short URL