Mamba Architecture Heads to ICLR 2026: Can AI's Core Brain, Transformer, Maintain Its Throne?

The Transformer architecture dominates the field of large AI models. Some argue that the Transformer alone is sufficient to achieve AGI!

Others believe that several fundamental architectural innovations are still needed. For instance, this view was expressed in Huawei’s "Intelligent World 2035" report.

As the scale of model training, deployment, and computational demands explode, we find that computing power and energy needs seem endless. How can AI become smart, affordable, and fast?

The computational complexity of the Transformer architecture scales quadratically with sequence length. If you double the length of the text processed, the computation required quadruples. Furthermore, its memory consumption grows linearly with sequence length. Achieving AGI using this architecture would demand unimaginable levels of energy and compute.

Academia and industry are searching for alternatives.

Mamba has entered the stage. Mamba-3 is currently undergoing double-blind review for the top conference, ICLR 2026.

Image

Instead of patching up the Transformer framework, Mamba took a different approach, finding inspiration in an older, more foundational theory: the State Space Model (SSM).

The story of Mamba-3 is essentially one of balancing and evolving efficiency and intelligence. Starting from the most critical and practical issue—inference efficiency—it introduces three key improvements: a more expressive recursive method, a smarter set of state update rules, and a Multi-Input Multi-Output (MIMO) architecture designed to maximize hardware performance.

What new tricks does Mamba-3 employ, and how does it challenge the Transformer?

Building a Tower from the Ground Up

To understand the sophistication of Mamba-3, let's first discuss the State Space Model (SSM).

This concept was not originally developed for Natural Language Processing (NLP); its origins lie in predicting continuously changing systems, such as signals in a circuit, shifting weather patterns, or the trajectory of a moving object. Mathematically and conceptually, it is closely related to Recurrent Neural Networks (RNNs), both processing sequences step-by-step and maintaining a 'memory' that influences the next decision.

As the name suggests, the core of SSM is the 'state space'. You can think of it as a snapshot describing all current conditions of a system, containing all key variables. An SSM takes an input sequence x(t), maps it to an unseen latent state h(t)—similar to the hidden state in an RNN—and then predicts the output y(t) based on this state.

All SSMs operate around two core equations:

  • State Equation: h'(t)=A*h(t)+B*x(t)

  • Output Equation: y(t)=C*h(t)+D*x(t)

The four parameters A, B, C, and D are typically weight matrices that define the system's dynamics. In traditional fields like control theory, these matrices are fixed, representing a known system. In deep learning, they become parameters optimized through training, represented by the neural network's learnable weights.

The classic form of the SSM is designed for continuous signals, but the text, images, and sounds we process are discrete data in a computer. This necessitates a 'discretization' step.

You can imagine discretization as periodically sampling a continuously flowing signal. This process introduces a new parameter, the step size (Δ), which dictates how often we sample. There are many discretization methods, but most modern SSMs, including Mamba, use a simple approach called Zero-Order Hold (ZOH).

After discretization, the SSM can process sequential data just like an RNN.

However, early discrete SSMs were impractical because they inherited some of the old flaws of RNNs, such as low training efficiency and poor memory retention, making it hard to capture relationships between elements far apart in a sequence—the so-called 'long-range dependency' problem.

A turning point occurred in 2021 when researcher Albert Gu and colleagues proposed the Structured State Space Sequence Model, or S4. This work paved the way for the later Mamba architecture.

The S4 model achieved two major feats.

First, enabling efficient training via convolution. Although discrete SSMs are fast like RNNs during inference, they are notoriously slow to train. The S4 authors discovered that because SSMs only involve linear addition and multiplication, this sequence of recursive operations can be unfolded into a one-dimensional convolutional kernel. This kernel can map the input sequence x directly to the output y in a single step. Convolutional operations can be computed highly efficiently using an algorithm called the 'Fast Fourier Transform'.

This provided an exquisite benefit: during training, when the entire input sequence is known, S4 can compute in parallel and efficiently like a Convolutional Neural Network (CNN); during inference, when we need to generate tokens one by one, it can revert to the RNN form, enjoying extreme speed and minimal memory footprint. The best of both worlds.

Second, solving the long memory problem using structured matrices. To allow the SSM to remember more distant information, S4 did not initialize its weight matrices A and B randomly like conventional machine learning models. Instead, it adopted a technique called HiPPO, deriving the structure of the matrices from special orthogonal polynomials (like Legendre polynomials). This unique initialization acts like a memory-enhancement plugin for the model, boosting its performance when handling long sequences.

Subsequent variants of S4, such as DSS, S5, and our protagonist Mamba series, while differing in specific initialization schemes, have retained the core idea of HiPPO: imposing a structure, usually diagonal, on matrices A and B to ensure the model can stably update states and remember long-term dependencies.

The Evolution of Mamba

In 2023, Tri Dao and Albert Gu (him again) first introduced the Mamba architecture in their paper, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." This was the first architecture capable of directly competing with Transformer in language modeling.

Mamba's core innovations are twofold.

The first is the 'Selective State Space Model'. It equips the traditional SSM with a 'selection' switch, allowing the model to dynamically decide which historical information to remember and which to ignore based on the importance of the current input. This capability was previously considered the exclusive domain of the Transformer's self-attention mechanism.

The second is 'Hardware-Aware Parallel Scan'. This is a highly engineering-focused optimization that designs an efficient algorithm specifically tailored to the computational characteristics of modern Graphics Processing Units (GPUs) to handle SSM's recursive calculations, maximizing hardware utilization.

A year later, the same two authors published another paper, further exploring the deep connections between SSM and Transformer, proposing a faster and stronger improved version, Mamba-2.

Mamba-2 discovered that the computation process for a large class of SSMs can be equivalently represented as a masked matrix multiplication. This finding allowed Mamba-2 to utilize the highly efficient implementation of matrix multiplication, boosting training speed by 50% compared to Mamba-1. It also supported larger state dimensions, enabling the model to handle more complex tasks, especially with long sequences.

Now, the story progresses to Mamba-3.

Image

Mamba-3 builds upon Mamba-2, representing another evolution focused on inference efficiency. It introduces three core methodological improvements.

The first is 'Trapezoidal Discretization'. It replaces the relatively crude method (Euler method) used in Mamba-2 with a more precise mathematical technique (trapezoidal rule) for converting continuous signals into discrete sequences. This improvement makes the model's recursive updates more expressive.

The second is the 'Complex State Space Model'. By introducing complex numbers to define the SSM, the model's state update capability is greatly enhanced, solving the inability of many linear models to handle tasks requiring precise state tracking (such as parity counting).

The third is the 'Multi-Input Multi-Output SSM'. This design is purely aimed at boosting decoding speed and hardware efficiency. It changes the state updates from being based on outer products to being based on matrix multiplication, drastically increasing the computational 'arithmetic intensity' and preventing the GPU from being 'starved'.

Mamba-3's New Tricks

More Accurate Discretization: The Trapezoidal Rule

Structured SSMs are theoretically defined as continuous-time systems, but the data processed in practice is discrete. The conversion from continuous to discrete—discretization—is a crucial step.

Mamba-2 used the Euler method, which you can imagine as approximating the area under a curve using a rectangle, only considering the endpoint value of the interval. The error of this method is O(Δt²), simple but not precise enough.

Mamba-3 switches to a more advanced method: the Generalized Trapezoidal Rule. Instead of simply using a rectangle, it uses a trapezoid, considering both the start and end points of the interval, connected by a data-dependent convex combination. The error of this method is reduced to O(Δt³), improving the precision by an order of magnitude.

Image

During state updates, Mamba-3 not only considers the input at the current time step but also glances back at the input from the previous time step. This small 'revisit' makes the model’s ability to capture sequence dynamics more nuanced and powerful.

This improvement not only boosts the model's expressiveness but also eliminates the need for a component many previous linear models relied upon—short causal convolution. This makes the overall model architecture simpler and more unified.

Smarter State Updates: Complex Numbers and Rotation

Modern SSMs, in their pursuit of efficiency, have continually simplified their core state transition matrices. The S4 model used complex 'Normal plus Low-Rank' matrices; Mamba simplified this to real diagonal matrices; and Mamba-2 further reduced it to a scalar. While these simplifications did not lead to significant performance degradation on language modeling tasks, they weakened the model's ability in simple state tracking tasks.

For instance, determining whether the count of '1s' in a binary sequence is odd or even (the parity task). This is trivial for a single-layer LSTM (Long Short-Term Memory) but nearly impossible for Mamba-2, whose state transition matrices only have real eigenvalues.

The reason is that real eigenvalues can only represent state 'scaling' and 'flipping', but not 'rotation'. Tasks like parity, where the internal state transition is periodic—like a switch toggling between 'on' and 'off'—are most naturally represented mathematically by rotation.

To restore this capability, Mamba-3 introduced complex numbers.

It proved that a complex-valued SSM, after discretization, is equivalent to a real-valued SSM with doubled state dimension, whose state transition matrix consists of a series of 2x2 rotation matrix blocks.

Furthermore, it demonstrated that this rotation operation can be equivalently 'absorbed' into the input and output projection matrices B and C. This ultimately leads to a surprising conclusion: using complex SSM is equivalent to applying a data-dependent Rotational Positional Embedding (RoPE) to the input (B) and output (C) of a conventional, scalar-transfer-based SSM.

RoPE is used in many large models (like Llama), where it helps the model understand word order by injecting absolute or relative position information into word vectors. What Mamba-3 does here is transform RoPE from a 'data-independent', fixed positional encoding into a 'data-dependent', dynamic state rotator.

This implementation, dubbed the 'RoPE trick' by the authors, allows Mamba-3 to gain powerful state tracking capabilities with minimal computational overhead, easily solving tasks like parity and modular arithmetic that Mamba-2 could not handle.

Ultimate Hardware Efficiency: From Outer Product to Matrix Multiplication

In auto-regressive generation scenarios (generating one token at a time), performance is typically measured by Tokens Per Second (TPS). On this metric, models like Mamba have an inherent advantage because they maintain only a fixed-size hidden state, unlike Transformer, which must maintain a KV cache that grows linearly with sequence length.

However, TPS doesn't account for hardware efficiency. A more fundamental metric is 'arithmetic intensity', defined as the ratio of floating-point operations (FLOPs) per operation to the number of data bytes moved for that operation.

Modern GPUs are like super powerful computing factories; their computational capacity (ops) far exceeds their data movement capacity (byte). If the arithmetic intensity is too low, the GPU wastes significant time waiting for data to be moved from memory rather than performing calculations. This situation is known as 'memory-bound'.

Mamba-2's state update is an outer product operation. Its arithmetic intensity is a constant, far below the ideal value for modern GPUs. This means Mamba-2 cannot fully utilize the power of the GPU during decoding.

Mamba-3 made a seemingly simple but highly effective change: it replaced the state update from an outer product to matrix multiplication.

In the context of signal processing, this corresponds precisely to generalizing from a Single-Input Single-Output (SISO) system to a Multi-Input Multi-Output (MIMO) system.

Image

Under the MIMO formulation, arithmetic intensity is proportional to a newly introduced rank r. By adjusting the size of r, we can flexibly increase the arithmetic intensity, pushing the decoding process from 'memory-bound' towards 'compute-bound', thereby utilizing the hardware more fully and achieving higher TPS. Crucially, this process does not increase inference memory usage (the size of state H remains unchanged).

These three key improvements collectively form the core Mixer primitive of Mamba-3. The overall Mamba-3 architecture also underwent adjustments, alternating between Mamba-3 blocks and SwiGLU blocks, and adopting pre-normalization.

Architecture Performance Showdown

Regarding language modeling performance, the authors used 100 billion tokens from the FineWeb-Edu dataset to pre-train Mamba-3 alongside baseline models like Transformer, Gated DeltaNet, and Mamba-2, across four different parameter scales: 180M, 440M, 820M, and 1.5B.

The results show that Mamba-3 comprehensively outperforms the baseline models across various downstream tasks at all model scales.

Image

In terms of retrieval capability—the ability to accurately find information within long texts—Transformer still holds an advantage due to its KV cache mechanism, which allows lossless access to all historical information. This is a common weakness for all fixed-state size models.

Experiments show that Mamba-3 performs well on tasks like associative recall and question answering, but struggles with tasks requiring information extraction from semi-structured or unstructured data. However, on the synthetic "Needle in a Haystack" (NIAH) task, Mamba-3's performance surpassed or matched the baseline and demonstrated better generalization ability than Mamba-2.

Image

Inference Efficiency:

Image

Under common settings of bf16 precision and 128 state dimension, both the SISO and MIMO versions of Mamba-3 are faster than Mamba-2 and Gated DeltaNet.

Image

This chart more clearly illustrates Mamba-3's advantage. The horizontal axis is state size (acting as a proxy for inference speed, smaller is faster), and the vertical axis is pre-training perplexity (a proxy for model performance, lower is better). The Mamba-3 MIMO version pushes the performance-efficiency Pareto frontier forward without increasing state size (i.e., without sacrificing speed).

Finally, ablation studies verified the effectiveness of Mamba-3's various improvements.

Image

Trapezoidal discretization and the introduced bias term work synergistically to significantly improve model performance. In state tracking tasks, Mamba-3 equipped with RoPE nearly perfectly solves parity and modular arithmetic tasks, whereas Mamba-3 without RoPE and Mamba-2 performed little better than random guessing.

The Mamba-3 story is an exploration of finding a superior balance between computational efficiency and model capability.

For long-text tasks requiring lossless memory and precise retrieval, the fixed-size state memory mechanism remains its Achilles' heel compared to the Transformer. The authors concede that combining Mamba-3 with external retrieval mechanisms to build a hybrid architecture might be an important direction for the future.

Do you think Mamba-3 will replace the Transformer, or will it be a beneficial complement?

References:

https://openreview.net/pdf/a4e02db9a98e8b5cb40d677e00e4c8017a282772.pdf

https://openreview.net/forum?id=HwCvaJOiCj

https://www.ibm.com/think/topics/state-space-model

https://www.ibm.com/think/topics/mamba-model

https://goombalab.github.io/blog/2024/mamba2-part1-model

https://jalammar.github.io/illustrated-transformer

Main Tag:AI Models

Sub Tags:TransformerEfficiencyDeep LearningState Space Models


Previous:Claude Code is Coming to the Claude App

Next:Google Reveals: Scaling Through Multi-Agent Reasoning Is the Future.

Share Short URL