In-depth Dissection of Large Models: From DeepSeek-V3 to Kimi K2, Understanding Mainstream LLM Architectures

Image

MLNLP community is a well-known machine learning and natural language processing community at home and abroad, with a readership covering NLP master and doctoral students, university teachers, and enterprise researchers at home and abroad.

The community's vision is to promote communication and progress among the academic, industrial, and enthusiast communities of natural language processing and machine learning, especially for beginner students.

Source | Ahead of AI, Synced Review

Author | Sebastian Raschka

Seven years have passed since the GPT architecture was first proposed.

Looking back from GPT-2 in 2019 to DeepSeek-V3 and LLaMA 4 in 2024–2025, an interesting phenomenon can be observed: despite continuous improvements in model capabilities, their overall architecture has remained highly consistent over these seven years.

Of course, there have still been many evolutions in detail. For example, positional encoding has evolved from initial Absolute Positional Encoding to Rotational Positional Encoding (RoPE); attention mechanisms have also gradually transitioned from standard Multi-Head Attention to more efficient Grouped-Query Attention; and in terms of activation functions, GELU has been replaced by the more efficient SwiGLU.

However, have there been any "disruptive innovations" among these changes? Over seven years, has the architecture of large language models truly made a qualitative leap, or is it still continuously refining within the original framework?

This blog post is from Sebastian Raschka, a renowned AI researcher and blogger, and author of "Python Machine Learning".

image.png

The blog extensively lists 8 mainstream large language models, including domestic large models such as DeepSeek and Kimi. It rigorously dissects the architectural design and innovative ideas of each large model, and deeply introduces the architectural design of modern large language models and the evolution trend of large model architectures.

image.png

Figure 1: Schematic diagrams of some LLM architectures covered in this article.

DeepSeek V3/R1

DeepSeek R1 caused a huge stir when it was released in January 2025.

DeepSeek R1 is an inference model built on the DeepSeek V3 architecture, which was originally launched in December 2024. Although this article focuses on architectures released in 2025, the author believes DeepSeek V3 should be included.

This section will focus on two key architectural techniques introduced by DeepSeek V3 that enhance its computational efficiency and make it stand out among many large language models.

If you are interested in DeepSeek V3's key technologies, please refer to the technical report:

image.png

Multi-Head Latent Attention (MLA)

Before discussing Multi-Head Latent Attention, it's worth mentioning Grouped-Query Attention (GQA), which has been widely adopted in recent years as a more computationally and parameter-efficient alternative to traditional Multi-Head Attention (MHA).

Here's a brief explanation of GQA: Unlike MHA where each attention head has its own set of keys and values, GQA groups multiple attention heads, allowing them to share the same key and value projections, thereby reducing memory usage.

As shown in Figure 2, assuming there are 2 key-value groups and 4 attention heads, attention heads 1 and 2 can share the first set of keys and values, while attention heads 3 and 4 share the second set. This approach reduces the total key and value computation, lowers memory usage, and improves efficiency.

image.png

Figure 2: Comparison of Multi-Head Attention (MHA) and Grouped-Query Attention (GQA).

The core idea of GQA is: by having multiple query heads share a set of keys and values, the total number of keys and values is reduced. This brings two main benefits:

  1. Reduced total model parameters;
  2. Reduced memory bandwidth usage for key and value tensors in the KV cache during inference, as fewer key-value pairs need to be accessed.

The Multi-Head Latent Attention (MLA) introduced next provides a different memory-saving strategy, and it works more closely with the KV cache mechanism.

Unlike GQA, which "shares key-value heads," MLA compresses key and value tensors into a low-dimensional latent space before storing them in the KV cache. During inference, these compressed tensors are re-projected back to their original dimensions before use (as shown in Figure 3). Although this process introduces an additional matrix multiplication, it significantly saves memory usage.

image.png

Figure 3: Comparison of Multi-Head Latent Attention (MLA, applied in DeepSeek V3 and R1) and conventional Multi-Head Attention (MHA).

It is worth noting that MLA is not a technology first introduced by DeepSeek V3; its predecessor, DeepSeek V2, already used (and even first proposed) this mechanism.

MLA is a very clever technical means that can significantly reduce the memory footprint of the KV cache while improving model performance. In comparison, it even slightly outperforms traditional MHA. Next, we will move on to the analysis of the next architectural module.

Mixture-of-Experts (MoE)

Another important component of the DeepSeek architecture that deserves close attention is its application of MoE (Mixture-of-Experts) layers. Although MoE was not first invented by DeepSeek, this technology has made a comeback in 2025, and its presence can be seen in many architectures introduced later in this article.

The core idea of MoE is to replace each FeedForward module in the Transformer with multiple "expert layers" (each expert layer is essentially a FeedForward network). That is, the original single FeedForward structure is replaced by multiple parallel FeedForward submodules, as shown in Figure 5.

image.png

Figure 5: The right figure shows the structure of the Mixture-of-Experts (MoE) module in DeepSeek V3/R1, compared to the ordinary FeedForward module used in standard LLMs in the left figure.

The FeedForward module within the Transformer block (the dark gray block in the figure above) usually accounts for a large portion of the model's total parameters.

Therefore, replacing one FeedForward module with multiple FeedForward modules (i.e., building an MoE structure) significantly increases the total number of model parameters. However, the key trick is that not all expert modules are activated for every token; instead, a "router" selects a small portion of them for each token to activate. This design of MoE allows the model to have a huge parameter capacity, absorbing more knowledge during the training phase; but due to sparse activation during inference, it greatly reduces computational overhead.

For example: DeepSeek-V3 has 256 experts in each MoE module, with a total parameter count of up to 671 billion. However, during inference, only 9 experts (1 shared expert + 8 experts selected by the router) are actually activated for each token.

image.png

Figure 6: Annotated diagram of DeepSeekMoE

For more details on DeepSeek MoE, please refer to the following paper:

image.png

In the DeepSpeedMoE paper, it was first pointed out that introducing "shared experts" can significantly improve the overall modeling performance of the model. The principle is likely that for general or repetitive patterns, multiple experts do not need to learn separately; instead, they can be uniformly processed by shared experts, thereby freeing up the capacity of other experts to focus on learning more specialized knowledge patterns.

OLMo 2

The OLMo series models, released by the non-profit Allen Institute for AI, have attracted widespread attention due to their high transparency in training data, code, and technical reports.

OLMo models have clear structures and standardized designs. More importantly, due to their extremely high transparency, they provide an excellent reference paradigm for the development of large language models.

What are the notable architectural design choices in OLMo 2?

They mainly focus on normalization strategies: including the placement of RMSNorm layers, and the introduction of QK-norm (Query-Key normalization).

Another point worth mentioning is that OLMo 2 still uses the traditional Multi-Head Attention (MHA) mechanism, and has not adopted newer attention structures such as MLA or GQA.

For more detailed information about OLMo 2, please refer to the paper:

image.png

Normalization Layer Placement Selection

Overall, OLMo 2 largely follows the design of the original GPT model in terms of architecture, similar to most current mainstream large language models. But it also has some notable differences, starting with the design of the normalization layers.

Like Llama, Gemma, and most modern LLMs, OLMo 2 replaces LayerNorm with RMSNorm as the normalization method.

What is truly worth discussing is the placement of RMSNorm. In the original Transformer architecture, the two normalization layers were placed after the attention module and the feed-forward module, respectively. This structure is called Post-LN or Post-Normalization.

However, GPT and most subsequent LLM models place the normalization layers before the attention module and the feed-forward module. This approach is called Pre-LN or Pre-Normalization.

The figure below shows the structural comparison of Post-Norm and Pre-Norm:

image.png

Figure 8: Comparison of Post-Norm, Pre-Norm, and the Post-Norm variant used by OLMo 2.

As early as 2020, Xiong et al. pointed out that Pre-LN can lead to more stable gradients during model initialization. In addition, researchers also mentioned that Pre-LN can be trained normally even without using learning rate warm-up, which is usually difficult for Post-LN to achieve.

In OLMo 2, the normalization layers are not placed before the attention layer and the feed-forward network, but after them, as shown in the figure above. However, unlike the original Transformer architecture, these normalization layers are still nested within the residual layers.

So, why did they adjust the position of the normalization layers?

The reason is that this design helps improve training stability, as will be shown in the figure below.

image.png

Figure 9: Shows the comparison of training stability between Pre-Norm (adopted by GPT-2, Llama 3, etc.) and the Post-Norm variant adopted by OLMo 2.

Unfortunately, the results shown in this figure include both the normalization order adjustment and QK-Norm, the latter being an independent concept. Therefore, it is difficult to clearly determine how much the change in normalization position contributed to the improvement in training stability.

QK-Norm

QK-Norm is essentially another RMSNorm layer, placed inside the Multi-Head Attention module, normalizing the Query and Key before applying Rotational Positional Encoding (RoPE).

As mentioned earlier, QK-Norm combined with Post-Norm helps stabilize the training process. For more details on QK-Norm, please refer to the following paper:

image.png

In short, the main design highlights in the OLMo 2 architecture are the placement of RMSNorm: placing RMSNorm after the attention module and the feed-forward module (a variant of Post-Norm), and introducing an additional RMSNorm for query and key in the attention mechanism (i.e., QK-Norm). These two modifications, used together, help stabilize the training loss.

The figure below shows the architectural comparison between OLMo 2 and Llama 3; it can be seen that, apart from OLMo 2 still using traditional MHA instead of GQA, the overall structures of the two are relatively similar.

image.png

Figure 10: Architectural comparison of Llama 3 and OLMo 2.

Gemma 3

Google's Gemma series models have always performed very well, but compared to popular models like the Llama series, they seem to always receive slightly less attention.

Gemma 3 uses another "trick" in its architecture to reduce computational costs: sliding window attention.

With the help of the sliding window attention mechanism, the Gemma 3 team successfully significantly reduced the memory requirements of the KV cache, as shown in the figure below.

image.png

Figure 11: KV cache memory saving effect of Gemma 3.

If regular self-attention is regarded as a "global" attention mechanism because each element in the sequence can access all other elements, then sliding window attention can be regarded as a "local" attention mechanism because it limits the context range around the current query position. The figure below shows the principle of this mechanism.

image.png

Figure 12: Comparison of conventional attention mechanism (left) and sliding window attention mechanism (right).

It should be noted that the sliding window attention mechanism can be used with both multi-head attention and grouped-query attention (GQA); Gemma 3 uses GQA.

As mentioned above, sliding window attention is also known as "local attention" because its focus context is limited to a local window around the current query position, and this window slides as the query position moves. In contrast, the conventional attention mechanism is "global," where each token can access all other tokens.

Although sliding window attention is the most prominent feature of the Gemma 3 architecture, as a supplement to the previous OLMo 2 section, here is a brief introduction to the placement of normalization layers in Gemma 3.

A small but interesting detail is that Gemma 3 uses both Pre-Norm and Post-Norm forms of RMSNorm around its GQA module.

This is similar to Gemma 2's approach but is still worth emphasizing because it differs from the following mainstream practices:

  1. Original Transformer architecture using Post-Norm;
  2. Pre-Norm popularized by GPT-2 and adopted by many subsequent architectures;
  3. The special Post-Norm variant seen in OLMo 2 earlier.

This dual normalization strategy of Gemma 3 demonstrates an unusual normalization design choice, which may be related to its trade-off between inference efficiency and training stability.

image.png

Figure 14: Architectural comparison of OLMo 2 and Gemma 3; note the additional normalization layers in Gemma 3.

This placement of normalization layers is relatively intuitive because it combines the advantages of Pre-Norm and Post-Norm.

The author believes that adding a bit more normalization does no harm.

For more details on Gemma 3, please refer to the technical report:

image.png

Mistral Small 3.1

Mistral Small 3.1 24B was released in March this year, shortly after Gemma 3. It is noteworthy because it outperforms Gemma 3 27B in multiple benchmarks while having faster inference speed.

The main reason for Mistral Small 3.1's lower inference latency compared to Gemma 3 is likely its custom tokenizer, smaller KV cache, and fewer layers. Besides that, it generally adopts a standard architecture, as shown in the figure below.

image.png

Figure 16: Architectural comparison of Gemma 3 27B and Mistral 3.1 Small 24B.

Interestingly, earlier Mistral models used sliding window attention, but this design seems to have been abandoned in Mistral Small 3.1.

Unlike Gemma 3, which uses sliding windows, Mistral uses conventional GQA.

The author speculates that although sliding window attention can reduce memory usage, it does not necessarily reduce inference latency, which is a priority performance metric for Mistral Small 3.1.

Llama 4

The detailed introduction of Mixture of Experts (MoE) models earlier comes in handy again.

Llama 4 also adopts the MoE architecture, while the rest of its design follows a more standard approach, with the overall architecture being very similar to DeepSeek-V3, as shown in the figure below.

image.png

Figure 17: Architectural comparison of DeepSeek V3 (671 billion parameters) and Llama 4 Maverick (400 billion parameters).

Although the overall architecture of Llama 4 Maverick looks very similar to DeepSeek-V3, there are still some notable differences.

First, Llama 4 uses the same GQA as its predecessor models, while DeepSeek-V3 uses MLA.

Both models are very large architectures, with DeepSeek-V3 having approximately 68% more total parameters than Llama 4 Maverick. However, in terms of the number of parameters actually involved in computation during inference, DeepSeek-V3 activates 37 billion parameters, which is more than twice that of Llama 4 Maverick (17 billion).

In terms of MoE settings, Llama 4 Maverick uses a more traditional architecture: only 2 experts are activated at a time, and each expert has a hidden layer dimension of 8192; while DeepSeek-V3 activates 9 experts at a time, with each expert having a hidden layer dimension of 2048. In addition, DeepSeek inserts MoE layers in every Transformer Block except the first 3 layers, while Llama 4 alternately uses MoE modules and dense modules, i.e., adding MoE every other Block.

One clear point is that MoE architecture has seen significant development and popularization in 2025.

Qwen3

The Qwen team has consistently released high-quality open-source large language models. In the NeurIPS 2023 LLM Efficiency Challenge, all winning solutions were built on Qwen2.

And now, the Qwen3 series has once again become the top performer in its respective parameter scale, still outstanding.

Qwen3 (Dense)

Let's first look at the Qwen3 Dense model architecture. As of now, Qwen3 0.6B is possibly one of the smallest open-source weight models in the current generation.

When run locally, it has a high token per second (token/sec) generation rate and low VRAM usage, making it very suitable for lightweight deployment. And because of its small parameter count, it is also very friendly for those who want to conduct local training experiments (e.g., for teaching purposes).

image.png

Figure 18: Architectural comparison of Qwen3 0.6B and Llama 3 1B. It can be seen that Qwen3 architecture is deeper (has more transformer layers), while Llama 3 architecture is wider (has more attention heads).

Qwen3 (MoE)

As mentioned earlier, the Qwen3 series also includes two MoE (Sparse) variants. So, why would an architecture like Qwen3 release both regular (Dense) and MoE (Sparse) versions?

As stated at the beginning of this article, MoE variants are designed to reduce the inference cost of large-scale foundation models. Providing both Dense and MoE versions allows users to flexibly choose based on different goals and resource constraints.

By releasing both types of models simultaneously, the Qwen3 series can cover a wider range of application scenarios: dense models emphasize robustness, simplicity, and fine-tunability; MoE models are geared towards inference efficiency in large-scale deployments.

image.png

Figure 19: Architectural comparison of DeepSeek-V3 and Qwen3 235B-A22B.

As shown in the figure above, DeepSeek-V3 and Qwen3 235B-A22B are very similar in architecture. However, it is worth noting that the Qwen3 model canceled shared experts (previous models like Qwen2.5-MoE used shared expert mechanisms).

Unfortunately, the Qwen3 team did not publicly state the reason for abandoning shared experts.

The author speculates that it might be because after increasing the number of experts from 2 in Qwen2.5-MoE to 8 in Qwen3, training stability no longer relied on shared experts. Therefore, they chose to omit shared experts to save additional computation and VRAM overhead (avoiding increasing from 8 to 8+1 experts). However, this does not explain why DeepSeek-V3 still retains the shared expert mechanism to this day.

SmolLM3

SmolLM3 may not be as widely known as other large models mentioned in this article, but the author believes it is still worth discussing because it demonstrates very excellent modeling performance with a size of only about 3 billion parameters, positioning it between Qwen3's 1.7 billion and 4 billion parameter models, as shown in the figure below.

In addition, SmolLM3, like OLMo, also publicly discloses a large amount of training details, which is uncommon in the industry, and therefore particularly commendable.

image.png

Figure 20: Comparison of SmolLM3's win rate against Qwen3 1.7B and 4B, and Llama 3 3B and Gemma 3 4B.

As shown in the architectural comparison figure below, SmolLM3's overall structure is relatively standard. However, perhaps the most interesting point is that it uses a No Positional Encoding (NoPE) mechanism.

image.png

Figure 21: Side-by-side architectural comparison of Qwen3 4B and SmolLM3 3B.

In the context of LLMs, NoPE is an earlier proposed concept that aims to remove explicit positional encoding information injection mechanisms, such as absolute positional embeddings commonly used in early GPT architectures, or current mainstream RoPE (Rotational Positional Encoding).

In Transformer-based language models, positional encoding is usually necessary because the self-attention mechanism is by default insensitive to the order of tokens in the input sequence, meaning each token is processed independently. To solve this problem, absolute positional embeddings add an additional embedding layer to combine positional information with token embeddings, thereby providing the model with sequence order awareness.

image.png

Figure 22: Illustrates the mechanism of absolute positional embeddings.

In contrast, RoPE injects positional information by rotating the Query and Key vectors according to the token's position.

In the NoPE layer, no positional encoding information is added at all: no fixed, no learnable, and no relative positional encoding — nothing.

Even without explicit positional encoding, the model can still know which tokens are ahead through the causal attention mask. This mask prevents each token from accessing tokens after it, thus ensuring the correctness of the autoregressive order. That is, a token at position t can only "see" tokens at positions less than or equal to t.

In summary, NoPE not only does not require injecting positional encoding but also has advantages in sequence length generalization. That is, as the input sequence length increases, the model's performance degradation is smaller. As shown in the figure below:

image.png

Figure 23: Shows the superior performance of NoPE in length generalization.

Because of this, the SmolLM3 team did not use NoPE in every layer in practical applications, but instead chose to use NoPE once every 4 layers (or rather, omit RoPE once every 4 layers) as a compromise strategy.

For more details on NoPE, please refer to the following paper:

image.png

Kimi K2

Kimi K2 has recently caused a huge stir in the AI community due to its excellent performance. As an open-source weight model, it performs comparably to top-tier closed-source models like Google's Gemini, Anthropic's Claude, and OpenAI's ChatGPT in multiple benchmarks.

One notable aspect is that it used a variant of the Muon optimizer in training for the first time in a production-grade model of this scale, instead of the traditional AdamW.

To the author's knowledge, this is the first time the Muon optimizer has been applied in an ultra-large model (previously, scalability was only demonstrated on models up to 16 billion parameters). This choice resulted in extremely ideal training loss curves, which is likely a significant reason why Kimi K2 stood out in the aforementioned benchmarks.

Kimi K2's parameter scale reaches 1 trillion (1T), which is undoubtedly impressive. It is likely the largest LLM of this generation (as of this writing), almost unparalleled, not considering the unreleased Llama 4 Behemoth, closed-source models, and Google's 1.6 trillion Switch Transformer (which is an encoder-decoder model) with a different architecture.

Architecturally, Kimi K2 is based on the DeepSeek-V3 architecture mentioned at the beginning of this article, but it has been scaled up and enhanced, as shown in the figure below (figure omitted). This also marks a "cyclical regression": Kimi K2 pushes the design philosophy of DeepSeek-V3 to its extreme.

image.png

As shown in the figure above, Kimi K2's overall architecture is largely consistent with DeepSeek V3, with the main differences being:

  • Kimi K2 uses more experts in the MoE module,
  • and fewer attention heads in the MLA module.

After several years, the release of LLMs is still full of surprises and anticipation. New technologies are always exciting, and we always look forward to more architectural improvements in large models.

For more information, please refer to the original blog post:

Main Tag:Large Language Models

Sub Tags:Deep LearningMixture of ExpertsAttention MechanismsAI Architectures


Previous:Xiaohongshu Open-Sources First Multimodal Large Model, dots.vlm1, Performance Rivals SOTA!

Next:OpenAI Board Chair: "Per-Token Billing" Is Completely Wrong, Market Will Eventually Choose "Outcome-Based Pricing"

Share Short URL