DeepSeek R2 hasn't arrived yet, but updates on DeepSeek's next-generation model have already been "spoiled" ahead of time in this year's ACL Best Paper.
Yesterday, ACL, the top conference in the global natural language processing field, announced this year's Best Paper.
This conference is known as the "World Cup" of natural language processing; it not only serves as a weathervane for large language models in the next year or two, but the cutting-edge technologies emerging from it are often quickly adopted by the entire industry. The Transformer architecture, which revolutionized the entire AI field, first gained prominence here.
This year, a paper jointly completed by DeepSeek and Peking University won the "Best Paper Award": "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention".
Paper link🔗 https://arxiv.org/abs/2502.11089
As the title suggests, this is a very technical paper, packed with keywords: Sparse Attention, Hardware-Aligned, Trainable...
But even so, it is still worth a thorough read for anyone concerned about the future of large models, because it is the first to bring sparse attention from theoretical inference into a complete training pipeline, while maintaining model performance and achieving up to an 11-fold inference acceleration.
First author of the paper, Jingyang Yuan (third from left), a doctoral candidate from Peking University, and his supervisor Zhang Ming (first from right). Image source: https://x.com/aclmeeting/status/1950745647214161930
For DeepSeek, this is not just academic recognition, but possibly a technical preview for the deployment of its next-generation models.
Why is "Long Text" So Difficult? AI's "Attention Deficit Disorder"
To understand the brilliance of DeepSeek's technology, one must first grasp the "pain points" of current large models when processing long texts.
Currently, one of the core technologies of all large models is "Attention Mechanism," which is also the foundational work for large language models: "Attention Is All You Need."
Regarding the attention mechanism, you can imagine it as a student listening in class.
The traditional "Full Attention" mechanism is like a student with an excellent memory but extremely low efficiency. Every time the teacher says a new word (Query), the student has to compare this word with every word ever spoken since the first day of school (Keys/Values) to understand the new word's meaning.
The comparison chart in the paper shows that NSA (red) performs better than or on par with Full Attention (orange) in various benchmark tests, while achieving significant speed improvements across all stages, including decoding, forward, and backward propagation.
When the text is short, this is not a problem.
However, when the text reaches hundreds of thousands of words, the computational load of this "comparing every word with all preceding words" explodes quadratically. This not only makes the model's response extremely slow but also makes training and inference costs exorbitantly high.
This is why the large models we use now, while having increasingly larger context windows, noticeably slow down once they approach their limits, and their API prices are also higher.
The paper also mentions that for traditional attention mechanisms, at a context length of 64k, softmax attention (a module within the traditional attention mechanism) accounts for 70%–80% of the entire inference latency.
DeepSeek's Solution: "Grasping Key Points" Like Humans Do
To solve this problem, various sparse attention technologies have emerged over the past period.
This month's Kimi K2 technical report mentioned using an automatically adjusted QK-Clip mechanism to achieve "trillions of total parameters with only tens of billions of active parameters, maintaining training-friendly sparsity."
And Manus also published a blog post this month, discussing "six major context engineering principles" to improve KV-Cache hit rates and use file systems to carry persistent context.
A 2024 paper mentioned the context length situation of large language models at that time.
However, whether it's token distance limitations or KV cache pruning, most of them still have two problems:
1. They can only be used during the inference stage; full attention is still required during the training stage.
2. Sparse attention is theoretically fast, but slow in practice, especially in multi-card deployment and on A100/V100.
Scientists proposed "Sparse Attention," with a simple idea: there's no need to look at every word, just focus on the important parts. But this is easier said than done. Many old methods either failed to achieve speed improvements or lost critical information, leading to performance degradation.
Image source: https://x.com/casper_hansen_/status/1950649481617342803
The best paper by DeepSeek and Peking University introduces NSA (Natively Sparse Attention), which solves these problems. Its core idea is to mimic human wisdom when reading long reports:
Skimming Summaries (Token Compression): First, NSA packages earlier content in long texts into "compressed blocks," allowing it to quickly grasp rough global information, much like reading chapter summaries. This ensures the model doesn't forget key premises mentioned hundreds of pages earlier.
In-depth Reading of Key Points (Token Selection): After understanding the general idea, the model will "select" the most relevant original detail blocks from before for in-depth reading, based on the current content to be processed. For example, when answering a question about Chapter 3, it will focus on the original text of Chapter 3, rather than scanning the entire document.
Strong Retention of Recent Information (Sliding Window): Just as we can clearly remember the last few paragraphs we've read, NSA also specifically maintains a "sliding window" to maintain the most refined attention to recent context information.
NSA Architecture Overview: NSA acts like a smart reader, processing information in three ways (Compression, Selection, Sliding Window) and dynamically deciding which information is more important via a "gating mechanism."
Most ingeniously, NSA dynamically learns how to balance these three reading strategies through a "gating mechanism."
Furthermore, NSA is "natively trainable," meaning the model learns this efficient attention allocation method from the beginning of pre-training, rather than having a sparse mechanism forcibly added to it after the model has grown (inference stage).
This allows NSA's sparse mode to perfectly synergize with other parts of the model, ultimately achieving a dual leap in performance and efficiency.
Test Results: Faster Training, Stronger Inference, Performance Boost Instead of Decline
DeepSeek proved NSA's powerful capabilities with detailed experimental data in the paper.
Performance Boost Instead of Decline: In a series of standard tests for general knowledge, reasoning, and coding abilities such as MMLU and GSM8K, the 27B model equipped with NSA outperformed traditional full attention models in 7 out of 9 metrics.
Especially in the DROP and GSM8K tests, which assess reasoning ability, the improvement was significant. This indicates that by sparsifying and filtering out noisy information, the model might instead focus more on key logic.
Outstanding Long Text Understanding: In the classic "Needle in a Haystack" test, NSA achieved 100% information retrieval accuracy in ultra-long texts of 64k (approximately 80,000 characters), precisely locating information no matter where it was hidden.
In the more complex LongBench evaluation, NSA's average score also surpassed most baseline methods, including full attention.
Blazing Fast Speed: This is the most exciting part. Compared to FlashAttention-2, currently the most efficient full attention implementation, NSA when processing sequences of 64k length:
Comparison of Triton-based NSA kernel with Triton-based FlashAttention-2 kernel. NSA's implementation significantly reduces latency across all context lengths, and the improvement becomes more pronounced as input length increases.
Training Speed: Forward computation accelerated by 9.0 times, backpropagation accelerated by 6.0 times. This means significantly improved efficiency in training new models.
Inference Speed: For the generation and response phase (decoding), which users care about most, the speed increased by an astonishing 11.6 times.
This means that long analyses that used to take half a minute to obtain might be completed in just a few seconds in the future.
The Future of DeepSeek: Faster, Stronger, Cheaper?
Context length is becoming the battlefield for new capabilities in large models. Whether it's cross-file code completion, long document summarization, or complex multi-turn dialogues, models need to quickly locate, understand, and reason within context lengths of hundreds of thousands or even millions of tokens.
This research, led by researchers from DeepSeek and Peking University, almost certainly means that NSA technology will become one of the core competencies of future DeepSeek large language models.
NSA Kernel Design ensures that the GPU always computes on its fastest memory.
NSA has completed full pre-training validation on 27B and MoE architectures. The training framework is also based on DeepSeek's self-developed MoE system, compatible with GQA architecture, FlashAttention-2 kernel, and has rewritten key kernels using Triton (NVIDIA's open-source inference service framework).
This means it is not just a "possible" research but a "ready for deployment" system module.
For us ordinary users, in the future, we can directly throw an entire book, dozens of financial reports, or a complete GitHub project codebase to AI for in-depth analysis, summarization, and Q&A, without manual splitting.
DeepSeek's response speed will also be faster, and the significant improvement in computational efficiency will eventually translate into lower API prices, reducing our usage costs.
Comparison of some model pricing, image source: https://artificialanalysis.ai/
From being a "price butcher" to a technology leader, DeepSeek is steadily building its moat through solid technological innovations like NSA.
This appears to be not just a victory for academia but a call to action for the entire AI application ecosystem to accelerate once again.
Now, let's wait and see what surprises the next generation of DeepSeek large models, equipped with "Native Sparse Attention," will bring us.