ByteDance Seed's New Work DeltaFormer: An Attempt at Next-Generation Model Architecture

The MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP master's and doctoral students, university professors, and enterprise researchers from around the world.

The community's vision is to promote exchange and progress among the academic and industrial circles of natural language processing and machine learning, as well as general enthusiasts, especially the progress of beginner students.

Source | Zhihu

Author | Xiaoming Tongxue

Here's a brief introduction to my recent work at Seed, hoping to stimulate further discussion.

https://arxiv.org/pdf/2505.19488

An overview of the core component differences between Deltaformer and Transformer. Simply put, the v component in the standard attention's q,k,v is modified, and attention is performed using q,k,u. Then u performs attention using w,k,u, and combines with v to get the final result.

Motivation

Inherent Contradiction Between Expressiveness and Non-Parallelizability

From a high-level perspective, there is an irreconcilable contradiction between expressiveness and parallelizability. The output of correct results for some problems objectively requires a certain depth. Colloquially speaking, when solving problems, some steps can be done in parallel, but some key steps must be done one by one. If the maximum length of these key steps is below a lower bound, then it is impossible to get the correct answer. To this end, scientists studying computational complexity in the last century also began to pay attention to parallel complexity. In P-class problems, they divided several categories based on the allowed operation types for a single node, the allowed fan-in for a single node, and the critical path length of the entire computation graph, such as .

Schematic diagram of different complexity classes. It is worth noting that the true containment relationship on this diagram is not strictly accurate. It has currently been proven that AC^0 != TC^0, and whether other layers are truly contained is not yet strictly proven, but we generally believe NC^1 != TC^0. In addition, there are many smaller classes between NC^1$ and $NC^2, such as SL, NL, etc. And the logarithmic precision Transformer model is proven to be in TC^0.

Perhaps there is a vast opportunity between LSTM and Transformer

LSTM, which gained popularity at the end of the last century, is an intrinsically non-parallel P-model. But in the last decade, GPUs have redefined the environment, making highly parallel Transformer models the most popular backbone in the field of large models today. At the same time, the fundamental contradiction between parallelism and expressiveness also leads to defects in large models, such as deficiencies in counting ability, which must rely on Chain-of-thought to solve complex problems.

So, is there no architecture that is still highly parallelizable, just slightly less parallel than Transformer, but with higher expressiveness? Previous research tells us that there are still many complexity classes between and , which gives us room for imagination. Perhaps there really exists a model that can be implemented with high parallelism on GPUs and is more expressive than Transformer.

Resurgence of Complexity Models

Unlike Transformer and Linear attention, which disregard previous states and mindlessly write or append keys and values, the Delta rule considers modifying based on previous states each time it writes. This topic was extensively studied last century, including Schmidhuber[1], Sutton[2], and Hinton[3]. Although it was called fast weight programming at the time, the core idea is consistent. In 2021, Schimidhuber[4] revisited it. However, in the era of GPUs, an implementation that cannot be highly parallelized on GPUs cannot become the next-generation model; otherwise, we would just revert to LSTM models and continuously increase hidden size. In 2024, Songlin Yang[5] and others discovered the parallel potential of the Delta rule, parallelizing DeltaNet on GPUs, which led to the resurgence of the Delta rule. This model is capable of achieving complexity, thus performing well in state tracking related tasks.

Transformer + Delta rule = Deltaformer

DeltaNet is limited by finite state space, and its most basic long-text information retrieval capability is limited, while Transformer's long-text information retrieval capability is quite good. The purpose of this work is to organically integrate the two, seeking a model that completely surpasses the Transformer architecture.

Method

Deltaformer = Deltarule + Kernel trick

The Kernel trick is also an old method; Kernel SVM has held its ground since the SVM era. This method of implicitly extending features to infinite dimensions might be a good way to increase memory capacity.

Introducing Kernel: , where is a mapping from finite dimensions to infinite dimensions, which we generally do not write explicitly. Then we rewrite the Delta rule as follows:

Delta rule + kernel trick version

The biggest problem is that and S here are infinite-dimensional and cannot be computed on a computer.

Fortunately, after some derivation, these infinite components involving and S can be eliminated, leaving only .

The writing method is:

The reading method is:

Of course, other things can be done inside, for example, using different for the top and bottom, and adding some learnable parameters.

We use softmax as , then we get an upgraded version of Transformer's Deltarule.

Next, we need to answer two questions:

• 1) How to implement it efficiently on GPUs

• 2) How to prove that this expression can perform tasks

Chunk-wise algorithm

The difficult part is the calculation of , while the calculation of can be normally done using Flash attention.

Directly using can be done during the decode phase, but during model training, such recursive calculation is not as good as a non-linear RNN.

However, we can also write it in a more compact form: , where

Then we have . Directly calculating the inverse of such a large matrix, although highly parallel, is not supported by I/O.

c as a subscript indicates the corresponding variable for the current chunk, and p as a subscript indicates the variable for the previous chunk, thus:

Therefore:

Using this method, can be calculated chunk by chunk. If the sequence length is , chunk size is , head dim is , and inverse is calculated using the prior method, then the total Flops is

Can track the exchange of n elements.

We theoretically prove that the upper bound of this model architecture can reach . We studied the task of tracking the exchange of n elements, which is a . We used a constructive method for the proof; for specific details, please refer to the original paper. The conclusion is that it can track the exchange of objects with a head dim of .

Theorem proving Deltaformer's ability to track the exchange of n elements.

Experiment

Deltaformer can track swaps, but Transformer struggles

For example, we can find that it is quite difficult for Transformer to track the exchange of 5 elements. However, the choice of kernel function is very important for doing .

Deltaformer can determine connectivity in directed acyclic graphs

This is also quite reasonable, because the inverse operation within Deltaformer, if it encodes whether node i and node j are adjacent, then it also encodes whether node and node are k-step reachable. Thus, it encodes information about whether node i and node j are connected. (From another perspective, this is also because the inversion operation, which far exceeds , expands the expressiveness of Transformer.)

For more toy model experiments and interesting phenomena, please refer to our original paper.

Conclusion

We propose the Deltaformer model, which combines the memory capacity of Transformer models and the ability to be efficiently trained on GPUs, while also breaking through the expressiveness limitation of Transformer. We hope this work can serve as an inspiration for designing more expressive models in the future.

References

[1] Schmidhuber: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=2f0becffd2f44b198d28074d01722e4c7905dae2

[2] Sutton: https://web.cs.umass.edu/publication/docs/1980/UM-CS-1980-018.pdf

[3] Hinton: https://www.cs.toronto.edu/~fritz/absps/fastweights.pdf

[4] Schimidbuber: https://proceedings.neurips.cc/paper_files/paper/2021/file/3f9e3767ef3b10a0de4c256d7ef9805d-Paper.pdf

[5] Songlin Yang: https://arxiv.org/pdf/2406.06484

Invitation to Technical Exchange Group

△Long press to add assistant

Scan QR code to add assistant's WeChat

Please note: Name-University/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue Systems)

to apply to join Natural Language Processing/Pytorch and other technical exchange groups

About Us

The MLNLP community is a civilian academic community jointly established by machine learning and natural language processing scholars at home and abroad. It has currently developed into a well-known machine learning and natural language processing community both domestically and internationally, aiming to promote progress among the academic and industrial circles of machine learning and natural language processing, as well as general enthusiasts.

The community can provide an open exchange platform for related practitioners for further study, employment, and research. Welcome everyone to follow and join us.

ByteDance Seed's New Work DeltaFormer: An Attempt at Next-Generation Model Architecture

Share Short URL