Microsoft and Others Propose 'Chain-of-Model' New Paradigm, Comparable to Transformer Performance with Better Scalability and Flexibility

Machine Heart Report

Editor: Chen Chen

With the advent of large language models (LLMs), extending the Transformer architecture has been seen as a promising path to revolutionize the existing AI landscape and achieve optimal performance across numerous diverse tasks. Consequently, exploring how to scale Transformer models is increasingly becoming a trend in both industry and academia.

Against this backdrop, the parameter scale of LLMs has grown exponentially, from billions to trillions. Consequently, this explosive growth in parameter scale has imposed an extremely expensive burden on training and cannot provide diverse inference uses for different deployment environments.

Given this escalating scaling law, how to develop and effectively utilize LLMs to handle user instructions in various scenarios has become an open and critical challenge for the entire community.

Currently, extending LLM architectures presents the following issues:

  • Unlike human intelligence, which can progressively acquire new knowledge, existing scaling strategies cannot preserve existing knowledge scales, always requiring training from scratch, leading to inefficiency.

  • Existing LLM architectures (such as dense models or MoE) always activate parameters of a fixed scale, lacking mechanisms to dynamically adapt to problem-solving capabilities.

In this paper, researchers from Microsoft, Fudan University, Zhejiang University, and ShanghaiTech University propose a new concept, CoR (Chain-of-Representation), which generalizes the scope of representation paradigms to a broader range.

Image
  • Paper Title: Chain-of-Model Learning for Language Model

  • Paper Address: https://arxiv.org/pdf/2505.11820

Specifically, this paper observes that any representation can always be seen as a combination of multiple sub-representations in the hidden dimension. Therefore, this paper defines this combination as a chain of representations, where each sub-representation corresponds to a chain. Based on this definition, by using different numbers of preceding chains, their corresponding features can be used to encode different knowledge (referred to as 'scale'), as shown in Figure 1.

Image

Therefore, how to establish connections between CoR features to ensure feature transformation across scales is crucial.

To achieve this goal, this paper then proposes a novel learning paradigm called Chain-of-Model (CoM) for modeling CoR features.

Its core idea is to introduce causal dependencies between different scales, ensuring that each scale can only use information from its preceding scales. To this end, this paper proposes Chain-of-Layer (CoL) to reconstruct the current network layer based on CoR features.

Based on the CoM framework, this paper applies the idea of CoL to each layer of the Transformer, reconstructing the language model architecture and naming it Chain-of-Language-Model (CoLM).

Furthermore, based on the CoL principle, this paper further introduces a key-value sharing mechanism in the attention module, which requires all keys and values to be computed in the first chain, and names it CoLM-Air. Based on this mechanism, CoLM-Air provides higher scalability and flexibility.

Results from multiple benchmark experiments show that CoLM series models can achieve comparable performance while demonstrating better scalability and flexibility.

Method Introduction

First, the definition of Chain of Representation:

Image

According to Definition 1, each chain corresponds to each sub-representation in CoR. By activating the first few chains, CoR can be used to encode scales. Therefore, CoR allows encoding n different scales within a single representation. If n=1, CoR is identical to the original representation. Figure 1 illustrates the concept of CoR.

Based on the above definition, a challenge now arises in how to design layers to establish connections between CoR inputs and CoR outputs, thereby achieving multi-scale feature transformation while ensuring that the output features conform to the CoR standard in Definition 1.

This requires ensuring that each scale can only utilize information from all its preceding scales, and simultaneously introducing Chain-of-Layer to incorporate causality into CoR's hidden states, as shown below:

Image

Among them, CoL has three fundamental properties — universality, causality, and composability.

Most importantly, CoL supports composability, which means stacking multiple CoL layers also preserves the properties of CoL. This property enables extending the scope of CoL from the layer level to the model level.

Next, this paper provides a third definition

Image

According to Definition 3, if a model satisfies the CoM standard, it also inherits all properties of CoL, such as universality and causality. In other words, any model can be regarded as a CoM (i.e., n = 1). CoM can integrate multiple sub-models of different scales into one model, enabling expansion based on existing models. This capability directly provides foundation models with better scalability and flexibility.

Next, the article elaborates on how CoM is applied to language models, including each module in Linear and Transformer (e.g., embedding, self-attention, feed-forward, normalization), and the objective function, naming it CoLM (Chain-of-Language-Model). Furthermore, this paper introduces a key-value sharing mechanism based on the CoLM framework, named CoLM-Air, which provides better flexibility.

Figure 2 describes the comparison between Linear layer and Chain-of-Linear layer.

Image

Figure 3 illustrates the difference between Attention and Chain-of-Attention:

Image

Readers interested in this section can refer to the original paper for more details.

Experimental Results

Table 1 results indicate that CoLM achieved comparable results to the baseline, while providing faster pre-filling speed and higher flexibility.

Image

Considering the universality and causality of CoM, any model with a chain count of 1 can be regarded as a special case of CoM and can be extended to a multi-chain structure. Therefore, this paper proposes the Chain Expansion method: using a fully trained model as an initial chain and expanding by adding new chains.

To validate this view, this paper selected two LLaMA variants (i.e., TinyLLaMA-v1.1 and LLaMA-3.21B) as the initial chains for expansion.

Table 2 results show that compared to TinyLLaMA-v1.1 and LLaMA-3.2-1B, this paper achieved improvements of 0.92 and 0.14 respectively. Since LLaMa-3.2-1B is a stronger baseline, more computation is required to achieve significant improvements, but the method in this paper can still improve it with limited computation. Overall, these results also indicate that the method in this paper remains effective in improving baselines even under resource constraints.

Image

Elastic inference aims to provide dynamic inference capabilities to meet the needs of different deployment scenarios. Table 3 results further highlight CoLM's potential in achieving elastic inference.

Image

As can be seen from Figure 5, CoLM-Air achieved faster pre-filling speed compared to LLaMA with similar parameter counts. As the sequence length increases, CoLM-Air achieves more significant speed improvements during the pre-filling stage. This fully demonstrates that CoLM-Air can effectively accelerate the pre-filling process.

Image

Benefiting from the causal property of the CoM architecture, CoLM is composed of multiple chain-like modules, where each chain can inherit the capabilities of the preceding chains. Based on this characteristic, this paper proposes the Chain Tuning method — fine-tuning only subsequent chains while freezing the first few chains. This method, by preserving the initial chain parameters, can both reduce tuning costs by approximately 42% and effectively mitigate the catastrophic forgetting problem.

Furthermore, when using the CoLM-Air configuration and freezing the first chain, the key-value pairs generated by the fine-tuned model can be seamlessly migrated to the original model without additional computation. Experiments show that chain tuning only needs to fine-tune about 42% of the model parameters to improve performance, and it is compatible with parameter-efficient fine-tuning methods like LoRA.

ImageImage

© THE END

For reprinting, please contact this official account for authorization.

Submissions or media inquiries: liyazhou@jiqizhixin.com

Main Tag:AI Architecture

Sub Tags:Large Language ModelsDeep Learning ResearchModel ScalingTransformer Models


Previous:将心智理论视为思维的思维语言:融合了贝叶斯网络/因果文法模型和编程模式模型的优点 DSL

Next:Peking University Alumna Lilian Weng's Latest Blog Post: Why We Think

Share Short URL