ICML 2025 | Fast and Powerful Liger! Transformer Instantly Switches to Linear RNN with only 20M Token Fine-tuning

Recently, Shanghai AI Lab, in collaboration with South China University of Technology, Hong Kong University of Science and Technology (Guangzhou), Nanjing University, and The Chinese University of Hong Kong, announced their research achievement: Liger (Lion-Tiger beast), which stands for Linearizing LLMs to gated recurrent structures. This is a linearization technique that can efficiently convert pre-trained large language model architectures into linear models with gated recurrent structures.

This research has been accepted by ICML 2025, and all code and models have been open-sourced.

Paper Title:

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Paper Link:

https://arxiv.org/abs/2503.01496

Code Link:

https://github.com/OpenSparseLLMs/Linearization

Model Link:

https://huggingface.co/collections/linear-moe-hub/liger-67d904bffd7f9b77ade7747d

Large language models (such as Llama, Mistral, etc.) have achieved excellent performance in various sequence modeling tasks. Particularly, Transformer-based large language models have been widely validated for their effectiveness in sequence modeling tasks. However, this architecture also faces its inherent drawbacks:

1. The attention mechanism has a quadratic computational complexity with respect to sequence length. Each time a new word needs to be generated, historical sequence data must be reviewed for attention calculation, leading to low efficiency for Transformer architecture models in long-sequence scenarios;

2. The KV-Cache mechanism needs to store historical sequence data for subsequent calculations, causing GPU memory pressure to increase with sequence length.

The efficiency bottlenecks of traditional Transformer architecture models are becoming increasingly prominent. How to achieve efficient inference without sacrificing performance has become a common focus for both academia and industry.

Against this backdrop, model architectures based on linear sequence modeling methods are emerging, as linear recurrent models have distinct architectural advantages:

1. The linear attention mechanism has a linear computational complexity with respect to sequence length. Generating the next word only requires accessing a fixed-size memory/state each time, leading to high computational efficiency;

2. No KV-Cache is needed, and GPU memory usage during the inference generation phase is constant, remaining unchanged regardless of sequence length.

Thanks to their high efficiency, which can perfectly resolve the inherent flaws of the Transformer architecture, linear recurrent architecture models show promise as a fundamental LLM architecture.

However, validating the effectiveness of an emerging model architecture is not easy. This is because training a model with a very large number of parameters often requires thousands or even trillions of high-quality data, and computational power demands are extremely high, necessitating pre-training large models from scratch on massive GPU clusters with random initialization.

Therefore, training such linear recurrent models from scratch is costly and typically difficult to match the performance of existing Transformer LLMs, deterring most researchers from investing such high costs to train a linear LLM that might not even perform well.

Now that we have well-pre-trained Transformer large models (Llama, Mistral, etc.), adjusting the existing model architecture to a linear recurrent model architecture and further training on this basis might be a lower-cost solution, which we call model architecture linearization.

However, current linear models, in order to fit the effect of Transformer architecture's Softmax Attention, need to add various modules to the original linear attention, including Feature Mapping, gating mechanisms, etc., which can somewhat improve the performance of the original linear models.

Nevertheless, existing linearization methods have not yet explored how to better linearize Transformer into linear models with gated recurrent structures. Furthermore, in linearization scenarios, these additional modules require initialization and training, which increases architectural complexity and differences, adding extra linearization costs.

Against this backdrop, Liger came into being. It is an extremely efficient, concise, and general linearization technique that requires only a minimal fine-tuning cost to linearize pre-trained Transformer LLMs into gated recurrent structures, successfully recovering over 93% of the original model's performance while achieving efficient linear computational complexity for sequence modeling.

Method Description

Liger's core objective is to achieve model structure conversion through concise and low-cost training, directly migrating pre-trained LLM weights to a gated recurrent architecture, avoiding the high cost of training from scratch.

Simplify to Essence: Cleverly Using Model Parameter Redundancy

Linear recurrent models based on gating mechanisms require independently designed gating modules, leading to the introduction of additional trainable parameters and increased model complexity. Liger cleverly utilizes the inherent parameter redundancy in LLMs by transforming the Key Matrix for the construction of gating mechanisms:

Specifically, through a parameter-free pooling operation, gating information is directly extracted from the key projection matrix, thereby eliminating the need for new trainable parameters. Since linear recurrent models remove the Softmax operation, this might cause unnormalized numerical expansion of the QK product, making it unable to fit the distribution of the original output. Therefore, linear recurrent models usually need to introduce a learnable Feature Mapping function to fit Softmax Attention.

In concrete implementation, we simplify Feature Mapping to Softmax functions applied separately to Q and K, providing numerical normalization stability for the QK product, ensuring compatibility with the original LLM attention mechanism. Simultaneously, no trainable parameters are introduced, and by fully reusing LLM weights, model architectural complexity and differences are reduced without multi-stage training, thereby further reducing linearization costs and improving model performance.

The Liger method is compatible with various linear recurrent model architectures with gating mechanisms, making it highly flexible and efficient.

Lightweight Fine-tuning: LoRA Helps Linear Structure Adaptation

After model structure conversion, Liger employs Low-Rank Adaptation (LoRA) technology to fine-tune the model to adapt to the linear recurrent model architecture.

Liger linearization merely changes the operation order of the attention layer's QKV. By using the right-multiplication kernel trick to achieve linear high-efficiency computation, only LoRA needs to be applied to the QKV projection matrix of the attention layer for low-rank decomposition training, without requiring full-parameter fine-tuning of the entire model. The training objective adopts autoregressive Next Token Prediction, and the loss function is Cross-Entropy Loss:

LoRA lightweight fine-tuning allows Liger linearization to fully preserve LLM pre-training knowledge, reducing linearization costs and quickly recovering most of the performance.

Hybrid Mechanism: Liger Attention

To further enhance linearization performance, this paper proposes the Liger Attention hybrid attention mechanism. By combining Sliding Window Attention (SWA) with Gated Recurrent Modeling (GRM), a hybrid of intra-layer linear sequence modeling methods and attention mechanisms is achieved, while retaining the efficiency of linear computational complexity.

Liger can also be used for efficient linearization of inter-layer hybrid architectures. By inserting 1 standard attention module after every 7 gated recurrent modules, it can both capture long-range dependencies and enhance the processing of key information through local attention, further improving model adaptability.

Experimental Analysis

The authors compared Liger with existing various model architecture linearization methods through experiments. The results show that Liger, while having lower training costs than other methods, can recover over 93% of the pre-trained Transformer large model's performance with only 20M training tokens. In various language modeling tasks, it approaches or surpasses existing SOTA linearization methods, very closely matching the performance of Transformer-based LLMs such as Llama and Mistral.

Thanks to the architectural advantages of linear models, Liger's inference time increases linearly with sequence length. For a 16K sequence length, inference is 2 times faster than Flash Attention. When processing a 32K length sequence, Liger's GPU memory usage remains constant at 16.37GB, while the original Llama-3 based on Flash Attention could not complete inference due to out-of-memory (OOM).

Liger demonstrates advantages in model scalability. Across model parameter scales from 1B to 8B, Liger consistently shows stable performance recovery and model scalability.

At the same time, the Liger technique is highly flexible and general, proving very effective for linearizing various linear recurrent model architectures with gating mechanisms. This provides a shortcut for validating the effectiveness of emerging linear model architectures.

Please refer to the original paper for specific technical details and more result analysis.

In summary, Liger is an extremely efficient, concise, and general linearization technique that requires only a minimal fine-tuning cost to linearize pre-trained Transformer-based LLMs into gated recurrent structures.

It not only competes with or even surpasses original Transformer-based large language models in sequence modeling tasks but also benefits from the efficiency of linear model architectures, providing a promising path for more efficient deployment of large-scale LLMs with linear time inference and constant memory footprint.

ICML 2025 | Fast and Powerful Liger! Transformer Instantly Switches to Linear RNN with only 20M Token Fine-tuning

Share Short URL