Deep Learning: Mamba Core Author's New Work Replaces DeepSeek's Attention Mechanism, Designed for Inference

One of the authors of Mamba, which once shook Transformer's dominance, Tri Dao, has just released new work—

proposing two attention mechanisms specially tailored for inference.

While maintaining model performance, these mechanisms can double decoding speed and throughput, greatly optimizing the model's long-context inference capability.

图片

The three authors of this research are all from Princeton University, and the paper has two main contributions:

First, it proposes Grouped-Tied Attention (GTA), which is comparable in quality to the GQA attention mechanism already integrated into LLaMA 3 but reduces KV cache usage by about 50%.

Second, it proposes Grouped Latent Attention (GLA), which matches the quality of the MLA attention mechanism used by DeepSeek but offers faster decoding, in some cases up to 2 times faster than FlashMLA.

According to Ted Zadouri, one of the authors:

GTA is an effective alternative to GQA, and GLA is a practical alternative to MLA.

图片

In a nutshell, by optimizing the attention mechanism's memory usage and computational logic, it can significantly improve the inference efficiency and hardware resource utilization of large language models without sacrificing model generation quality, with a particularly prominent advantage in long-context scenarios.

After the related paper was published, many researchers came to congratulate~

图片

So, what exactly does this research cover?

Introducing Inference-Aware Attention Mechanisms

In summary, the paper's core contribution is the introduction of inference-aware attention mechanisms, which redesign attention mechanisms to address issues such as memory redundancy, computational inefficiency, and long-context bottlenecks during the model inference phase.

According to Tri Dao, this research started with an idea:

In an era where AI development is driven by inference, what should the "ideal" architecture look like?

Especially when it comes to long-context inference, current Large Language Models (LLMs) face two major challenges: memory access bottlenecks and parallelism limitations.

This means that when the model generates text, it needs to retrieve a large amount of "historical data" from memory every time, which not only slows down the generation of each word but also forces sequential generation, preventing multiple chips from working simultaneously.

To address this, the team decided to redesign the attention mechanism in two directions:

Higher hardware efficiency: By increasing the "computation per byte of memory loaded" (arithmetic intensity), reducing reliance on memory bandwidth;

Maintaining parallel scalability: Optimizing decoding speed without sacrificing the model's parallel training/inference capabilities.

The finally proposed GTA and GLA, while reducing KV cache usage, maintain model quality comparable to existing solutions and significantly improve decoding speed.

The "existing solutions" mentioned here primarily refer to two methods well-known in academia:

First is the Grouped Query Attention (GQA) mechanism, which reduces memory footprint by grouping and sharing KV caches. It performs well in tasks such as Vision Transformers (ViT) and is suitable for large-scale data processing, currently applied in open-source models like Llama 3.

Second is the Multi-headed Latent Attention (MLA) mechanism, which can be traced back to the paper "Attention Is All You Need" and was later re-popularized by DeepSeek. It focuses on how to merge attention information across different layers, which can reduce redundant computation at each layer.

However, since GQA still needs to store independent KVs for each query group and MLA lacks sufficient parallel optimization, further improvements are needed.

Below, we will elaborate on the team's new methods, GTA and GLA.

Grouped-Tied Attention (GTA)

The core design idea of GTA is: to combine and reuse the Key and Value states of different query heads, reducing the number of memory transfers.

Specifically (right image), it divides the heads of multi-head attention into several groups, where heads within each group share the same Key and Value parameters. During computation, heads within the same group use the same KV cache, with only the Query parameters being independent.

In contrast, the traditional Multi-Head Attention (MHA) in the middle has independent Keys and Values for each query head, leading to more memory required to store all Keys and Values due to no sharing.

Compared to GQA (left image), GQA shares KV by grouping, but each group still stores independently, while GTA achieves more thorough KV reuse through parameter tying.

图片

Grouped Latent Attention (GLA)

GLA's design adopts a two-layer structure:

Latent Layer: Introduces a fixed number of latent Tokens as a compressed representation of the global context, replacing part of the original Token's KV cache;

Grouped Head Mechanism: Groups query heads, where each group shares the latent Token's KV while retaining interaction with the original Tokens.

During the decoding process, compared to MLA (left image), GLA reduces the amount of KV cache each device needs to load by sharing a joint latent representation, thereby reducing memory access.

And because the amount of KV cache on each device is reduced, more requests can be processed simultaneously.

图片

Effective Alternatives to "GQA and MLA"

So, how effective are GTA and GLA?

The team conducted experiments on models of four scales, including small (183M), medium (433M), large (876M), and XL (1471M). These models were trained on the FineWeb-Edu-100B dataset, using a GPT-3 architecture and Llama 3 tokenizer.

The tested metrics are mainly divided into two categories:

Quality metrics: Perplexity, downstream task accuracy (7 benchmarks like Winogrande, SciQ);

Efficiency metrics: Per-Token decoding latency, throughput, KV cache occupancy.

Experiments compared GQA, MLA, FlashMLA, traditional MHA, and other attention mechanisms.

Perplexity experiments showed that GTA outperformed GQA on medium and large models, indicating that GTA may be more suitable for further model scaling; while GLA was comparable to MLA in most scenarios, indicating that GLA's design is reasonable and can find a good balance between parallel computation and model quality.

图片

The overall performance gap of several schemes in downstream tasks (covering typical common sense reasoning, logical reasoning, and knowledge Q&A scenarios) was not significant.

But from the trend (the figure below shows from medium to large), GTA and GLA can maintain or improve downstream task performance from medium to XL sizes.

图片图片

Regarding KV cache, without sacrificing model quality, GTA reduces KV cache by about 50% compared to GQA, verifying the effectiveness of "parameter tying + grouped reuse".

At the same time, for a query length of 1, MLA is close to the computational bottleneck (reaching 610 TFLOPS/s), while GLA has not yet saturated computational resources (360 TFLOPS/s).

And as sequence length increases from 1K to 64K, GLA's decoding speed is 2 times faster than FlashMLA.

In addition, in real-time server performance tests, for an output throughput of 64 concurrent requests (higher is better), GLA consistently outperformed MLA under the same parallel scheme.

图片

Next, the team also compared the output throughput of GLA and MLA on the DeepSeek Coder V2 Base (236B) model when using FP8 precision, under different prefill lengths and decoding lengths.

The results show that GLA-8's output throughput was significantly higher than MLA's at prefill lengths of 32K and 64K. This indicates that GLA outperforms MLA in throughput when handling long contexts.

GLA-8 also showed higher output throughput when handling unbalanced loads. This indicates that GLA can utilize resources more effectively when processing requests of different lengths, improving overall performance.

图片

All the above experiments confirm the authors' statement: "GTA and GLA" are effective alternatives to "GQA and MLA".

图片

All Paper Authors from Princeton University

There are three authors on the paper, including Tri Dao, all from Princeton University.

图片

Ted Zadouri is currently a Ph.D. student at Princeton University, researching machine learning.

He previously had two internships at Intel (researching deep learning) and a brief stint as a researcher at AI startup Cohere.

图片

Hubert Strauss is a research engineer at Princeton University, researching machine learning and model deep learning.

He graduated from Arts et Métiers, a well-known engineering school in France, and later earned a master's degree in Operations Research from Georgia Tech.

After graduation, he had several internships and work experiences. Before becoming an engineer at Princeton University, he worked as a machine learning engineer at a company, responsible for model training and Transformer optimization.

图片

Tri Dao is currently an Assistant Professor of Computer Science at Princeton University and Chief Scientist at the generative AI startup Together AI.

He is renowned in academia for his work on optimizing Transformer model attention mechanisms.

Among his most influential contributions is co-authoring the Mamba architecture, which has achieved SOTA performance across various modalities including language, audio, and genomics.

Especially in language modeling, the Mamba-3B model outperforms Transformer models of equivalent size in both pre-training and downstream evaluations, and can rival Transformer models twice its size.

He also participated in publishing FlashAttention versions 1-3, which are widely used to accelerate Transformers, having sped up attention by 4-8 times.

图片

Anyway, returning to this research, paper author Ted Zadouri frankly stated:

This is just the first step towards the "ideal" architecture for test-time inference!

图片

Paper: https://arxiv.org/abs/2505.21487

Code: https://github.com/Dao-AILab/grouped-latent-attention

References:

[1]https://x.com/tri_dao/status/1928170648863473892

[2]https://x.com/gm8xx8/status/1927572103806554262

[3]https://x.com/tedzadouri/status/1928167296821854363

End

Main Tag:Deep Learning

Sub Tags:Large Language ModelsInference EfficiencyModel OptimizationAttention Mechanism


Previous:Express | Google Quietly Launches AI Edge Gallery, Open-Sourcing a Local AI Runner

Next:No Manual Annotation Needed! AI Self-Generates Training Data, Unlocking Reasoning Capabilities via "Deduction-Induction-Abduction"

Share Short URL