Must-Read: In-depth Comparison of Mainstream LLM Architectures, Covering Llama, Qwen, DeepSeek, and Six Other Models

Large Language Model (LLM) Architecture Comparison

I came across an exceptionally well-written article – a comparison of large LLM architectures. The article lists the architectures of Llama-3.2, Qwen3-4B, SmolLM3-3B, DeepSeek-V3, Qwen3-235B-A22B, and Kimi-K2, and discusses their differences and advantages in detail. I highly recommend it to anyone studying large models!

English article: https://sebastianraschka.com/blog/2025/the-big-llm-architecture-comparison.html

More than seven years have passed since the initial GPT architecture emerged. At first glance, from GPT-2 in 2019 to DeepSeek-V3 and Llama 4 in 2024-2025, one might be surprised by how structurally similar these models remain.

Of course, positional embeddings have evolved from absolute positional encoding to Rotary Positional Embeddings (RoPE), Multi-Head Attention (MHA) has largely been replaced by Grouped-Query Attention (GQA), and more efficient activation functions like SwiGLU have replaced GELU, among others. But beneath these subtle improvements, are we truly seeing groundbreaking changes, or merely refinements upon existing architectures?

Comparing different LLMs and identifying the key factors that influence their performance (for better or worse) is extremely challenging: datasets, training techniques, and hyperparameters vary greatly and are often poorly documented.

However, I believe it is still valuable to delve into the structural changes of the architectures themselves to understand the latest developments among LLM developers in 2025 (some architectures are shown in Figure 1).

img

Figure 1: Some architectures covered in this article.

Therefore, in this article, the author will not discuss benchmark performance or training algorithms, but rather focus on the architectural evolution of current mainstream open-source models.

Table of Contents

1. DeepSeek V3/R1

1.1 Multi-Head Latent Attention (MLA)

1.2 Mixture-of-Experts (MoE)

1.3 DeepSeek Summary

2. OLMo 2

2.1 Normalization Layer Placement

2.2 QK-Norm

2.3 OLMo 2 Summary

3. Gemma 3

3.1 Sliding Window Attention

3.2 Normalization Layer Placement in Gemma 3

3.3 Gemma 3 Summary

3.4 Gemma 3n

4. Mistral Small 3.1

5. Llama 4

6. Qwen3

6.1 Qwen3 (Dense Models)

6.2 Qwen3 (MoE)

7.SmolLM3

7.1 No Positional Embeddings (NoPE)

8. Kimi 2

1. DeepSeek V3/R1

As you may have heard multiple times, DeepSeek R1 caused a huge sensation when it was released in January 2025. DeepSeek R1 is an inference model built on the DeepSeek V3 architecture, which was released in December 2024.

Although the focus of this article is on architectures released in 2025, the author believes it is reasonable to include DeepSeek V3, as it gained widespread attention and adoption with the release of DeepSeek R1 in 2025.

If you are interested in the specific training process of DeepSeek R1, you might find an article I wrote earlier this year on deeply understanding inference LLMs helpful.

In this section, the author will focus on two key architectural techniques introduced in DeepSeek V3 that improve computational efficiency and differentiate it from many other LLMs:

Multi-Head Latent Attention (MLA)

Mixture-of-Experts (MoE)

1.1 Multi-Head Latent Attention (MLA)

Before discussing Multi-Head Latent Attention (MLA), let's briefly review the relevant background knowledge to understand its motivation. For this, we start with Grouped-Query Attention (GQA), which has recently become a new standard alternative to Multi-Head Attention (MHA) due to its superior computational and parameter efficiency.

The core idea of GQA can be briefly summarized as follows: unlike MHA, where each attention head has its own independent key and value, GQA groups multiple query heads to share the same set of key and value projections to reduce memory footprint.

For example, as shown in Figure 2, if there are 2 key-value groups and 4 attention heads, then attention heads 1 and 2 might share one set of keys and values, while attention heads 3 and 4 share another. This reduces the total amount of key-value computations, thereby lowering memory usage and increasing efficiency (with no significant impact on model performance according to ablation studies).

img

Figure 2: Comparison of MHA and GQA. Here, the group size is 2, meaning one key-value pair is shared by 2 queries.

Thus, the core idea of GQA is to reduce the number of key-value heads by having multiple query heads share keys and values. This not only (1) reduces the model's parameter count but also (2) decreases the memory bandwidth usage of key-value tensors during inference, as fewer keys and values need to be stored and retrieved from the KV cache.

(If you are interested in the code implementation of GQA, you can check out my GPT-2 to Llama 3 conversion guide, which includes a version without KV caching; a version with KV caching can be found here: here.)

Although GQA is primarily a workaround to compensate for MHA's computational inefficiencies, ablation studies (e.g., in the original GQA paper and the Llama 2 paper) show that it performs comparably to standard MHA in terms of LLM modeling performance.

Now, Multi-Head Latent Attention (MLA) proposes a different memory-saving strategy that is particularly suitable for KV caches. Unlike GQA, which shares key and value heads, MLA compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache.

During inference, these compressed tensors are first restored to their original size before they can be used, as shown in Figure 3. While this adds extra matrix multiplication operations, it significantly reduces memory footprint.

img

Figure 3: MLA (used in DeepSeek V3 and R1) compared to regular MHA.

(Incidentally, Queries are also compressed, but only during training, not inference.)

Furthermore, MLA is not new to DeepSeek V3; its DeepSeek-V2 predecessor also used (and even first introduced) this technique. The V2 paper also contains some interesting ablation studies, which might explain why the DeepSeek team chose MLA over GQA (see Figure 4).

img

Figure 4: Annotated table from the DeepSeek-V2 paper, https://arxiv.org/abs/2405.04434

As shown in Figure 4, GQA's performance appears to be inferior to MHA, while MLA slightly outperforms MHA in terms of model performance, which is likely why the DeepSeek team chose MLA over GQA. (It would be even more interesting to see a comparison of "per-token KV cache" savings between MLA and GQA!)

Before moving on to the next architectural component, let's summarize this section: MLA is a clever trick that both reduces KV cache memory usage and slightly outperforms MHA in terms of model performance.

1.2 Mixture of Experts (MoE)

Another significant architectural component in DeepSeek is its application of Mixture-of-Experts (MoE) layers. Although MoE is not original to DeepSeek, it has regained popularity this year, and many models we will discuss later also adopt this approach.

You may already be familiar with MoE, but a quick refresher might still be helpful. The core idea of MoE is to replace each FeedForward module in a Transformer block with multiple expert layers, where each expert layer is also a FeedForward neural network module. This means we replace a single feedforward layer with multiple feedforward layers, as shown in Figure 5.

img

Figure 5: Mixture-of-Experts (MoE) module in DeepSeek V3/R1 (right) versus LLM using standard feedforward blocks (left).

The feedforward neural network within a Transformer block (dark gray module in the figure) typically accounts for a large portion of the model's total parameters. (Note that Transformer blocks and feedforward neural networks are repeated many times in LLMs; in DeepSeek-V3, they are repeated 61 times.)

Therefore, replacing a single feedforward block with multiple feedforward blocks (as done in MoE settings) significantly increases the total number of model parameters. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a routing mechanism selects a small subset of experts for each token. (To save time, or rather, article length, the author will discuss routing strategies in more detail later.)

Since only a few experts are activated at a time, MoE modules are often referred to as sparse modules, in contrast to dense modules that always use the entire parameter set. However, MoE increases the capacity of LLMs through a large number of parameters, meaning they can absorb more knowledge during training. Sparsity maintains inference efficiency because we don't use all parameters simultaneously.

For example, each MoE module in DeepSeek-V3 has 256 experts, and the model's total parameter count reaches 671 billion. But during inference, only 9 experts (1 shared expert plus 8 experts selected by routing) are activated at a time. This means only 37 billion parameters are used per inference step, instead of the full 671 billion. A notable feature in DeepSeek-V3's MoE design is the use of "shared experts." This is an expert that is always activated for every token. This idea is not new; it was already introduced in the DeepSeek 2024 MoE and 2022 DeepSpeedMoE papers.

图片

Figure 6: Annotated diagram from "DeepSeekMoE: Achieving Ultimate Expert Specialization in Mixture-of-Experts Language Models," https://arxiv.org/abs/2401.06066

In the DeepSpeedMoE paper, the benefits of shared experts were first pointed out; they found that shared experts improved overall model performance compared to not having them. This is likely because common or repetitive patterns do not need to be learned by multiple independent experts, thereby allowing these experts more room to learn more specialized patterns.

1.3 DeepSeek Summary

In summary, DeepSeek-V3 is a large model with 671 billion parameters, whose performance surpassed other open-source models including Llama3 (405 billion parameters) at the time of its release. Despite its larger size, thanks to its Mixture-of-Experts (MoE) architecture, it is more efficient during inference, as only a small subset of parameters (just 37 billion) are activated per token.

Another key distinguishing feature is that DeepSeek-V3 uses Multi-Head Latent Attention (MLA) instead of Grouped-Query Attention (GQA). Both MLA and GQA are inference-efficient alternatives to standard Multi-Head Attention (MHA), especially when using KV caching. Although MLA is more complex to implement, research in the DeepSeek-V2 paper shows that it has better model performance than GQA.

2. OLMo 2

The OLMo series of models developed by the Allen Institute for AI has garnered significant attention due to its transparency in training data and code, as well as its relatively detailed technical reports.

While you might not see OLMo models topping any benchmarks or leaderboards, they are very clear, and more importantly, their transparency makes them an excellent blueprint for developing LLMs.

Although OLMo models are popular for their transparency, their performance is also quite good. In fact, upon their release in January (earlier than Llama 4, Gemma 3, and Qwen 3), OLMo 2 models had already reached the Pareto frontier of computational efficiency and performance, as shown in Figure 7.

img

Figure 7: Comparison of model benchmark performance (higher is better) vs. pre-training cost (FLOPs; lower is better) for different LLMs. This figure is taken from the OLMo 2 paper and annotated, https://arxiv.org/abs/2501.00656

As mentioned earlier in this article, the author aims to focus only on the architectural details of LLMs (rather than training or data) to keep the article concise. So, what interesting architectural design choices are there in OLMo2? It mainly boils down to normalization: the placement of RMSNorm layers and the addition of QK-Norm, which the author will discuss below.

It's worth noting that OLMo 2 still uses traditional Multi-Head Attention (MHA), not MLA or GQA.

2.1 Normalization Layer Placement

Overall, OLMo 2 largely follows the architecture of the original GPT models, similar to other contemporary LLMs. However, there are some notable deviations. Let's start with the normalization layers.

Like Llama, Gemma, and most other large language models, OLMo 2 also switched from LayerNorm to RMSNorm.

But since RMSNorm is already "old hat" (it's essentially a simplified version of LayerNorm with fewer trainable parameters), the author will skip the discussion of RMSNorm vs. LayerNorm. (Interested readers can find a code implementation of RMSNorm in my GPT-2 to Llama conversion guide.)

However, the placement of the RMSNorm layers is worth discussing. The original Transformer model (from the "Attention is all you need" paper) placed both normalization layers after the attention module and the feedforward module, respectively, in the Transformer block.

This is also known as Post-Normalization (Post-LN or Post-Norm).

GPT models and many subsequent LLMs, however, place the normalization layers before the attention module and the feedforward module, which is called Pre-Normalization (Pre-LN or Pre-Norm). A comparison of Post-Normalization and Pre-Normalization is shown in the figure.

img

Figure 8: Comparison of Post-Normalization, Pre-Normalization, and OLMo 2's Post-Normalization variant.

In 2020, research by Xiong et al. showed that Pre-Normalization (Pre-LN) yields more desirable gradient behavior during initialization. Furthermore, the researchers noted that Pre-Normalization even performs well without careful tuning of learning rate warm-up, whereas learning rate warm-up is a crucial tool for Post-Normalization (Post-LN).

Now, the reason I bring this up is that OLMo 2 uses a form of Post-Normalization (but with RMSNorm instead of LayerNorm, so I refer to it as Post-Norm). In OLMo 2, the normalization layers are placed after the attention and feedforward layers, rather than before, as shown in the figure above. However, note that unlike the original Transformer architecture, these normalization layers are still located within the residual layers (skip connections).

So, why did they change the placement of the normalization layers? The reason is that it helps with training stability, as shown in the figure below.

img

Figure 9: Graph showing training stability comparison between Pre-Normalization (as in GPT-2, Llama 3, and many other models) and OLMo 2's Post-Normalization variant.

Unfortunately, this figure shows the results of reordering combined with QK-Norm, which is a separate mechanism. Therefore, it is difficult to judge how much the normalization layer reordering itself contributes.

2.2 QK-Norm

Now that QK-Norm has been mentioned in the previous section, and other LLMs we will discuss later, such as Gemma 2 and Gemma 3, also use QK-Norm, let's briefly discuss what it is.

QK-Norm is essentially another RMSNorm layer. It is placed inside the Multi-Head Attention (MHA) module and applied to queries (q) and keys (k) before applying Rotary Positional Embeddings (RoPE). To illustrate this, here is a snippet of the Grouped-Query Attention (GQA) layer I wrote for Qwen3 from scratch (the application of QK-Norm in GQA is similar to MHA in OLMo):

class GroupedQueryAttention(nn.Module): def __init__( self, d_in, num_heads, num_kv_groups, head_dim=None, qk_norm=False, dtype=None ): # ... if qk_norm: self.q_norm = RMSNorm(head_dim, eps=1e-6) self.k_norm = RMSNorm(head_dim, eps=1e-6) else: self.q_norm = self.k_norm = None def forward(self, x, mask, cos, sin): b, num_tokens, _ = x.shape # Apply projection queries = self.W_query(x) keys = self.W_key(x) values = self.W_value(x) # ... # Optional normalization if self.q_norm: queries = self.q_norm(queries) if self.k_norm: keys = self.k_norm(keys) # Apply RoPE queries = apply_rope(queries, cos, sin) keys = apply_rope(keys, cos, sin) # Expand K and V to match the number of heads keys = keys.repeat_interleave(self.group_size, dim=1) values = values.repeat_interleave(self.group_size, dim=1) # Attention calculation attn_scores = queries @ keys.transpose(2, 3) # ...

As mentioned, QK-Norm works in conjunction with Post-Normalization to jointly stabilize the training process. It's important to note that QK-Norm was not invented by OLMo 2; its concept dates back to the 2023 "Scaling Vision Transformers" paper.

2.3 OLMo 2 Summary

In short, OLMo 2's notable architectural design decisions primarily concern the placement of RMSNorm: RMSNorm is placed after rather than before the attention module and feedforward module (a variant of Post-Normalization), and RMSNorm is added to queries and keys within the attention mechanism (QK-Norm), both of which contribute to stabilizing training loss.

Below is a side-by-side comparison of OLMo 2 and Llama 3; it can be seen that apart from OLMo 2 still using traditional MHA instead of GQA, their architectures are relatively similar. (However, the OLMo 2 team released a 32-billion parameter variant using GQA three months later.)

img

Figure 10: Llama 3 and OLMo 2 architecture comparison.

3. Gemma 3

Google's Gemma models have consistently performed well, and in my opinion, they have been somewhat underestimated compared to other popular models like the Llama series.

Gemma's uniqueness lies in its relatively large vocabulary (to better support multiple languages) and a stronger focus on the 27-billion parameter model size (rather than 8 billion or 700 billion). However, note that Gemma 2 also offers smaller sizes: 1 billion, 4 billion, and 12 billion.

The 27-billion parameter size strikes a very good balance: it's much more powerful than 8-billion parameter models but less resource-intensive than 700-billion parameter models, and it runs smoothly on my Mac Mini. So, what other notable features does Gemma 3 have? As discussed earlier, other models like DeepSeek-V3/R1 adopt the Mixture-of-Experts (MoE) architecture to reduce memory required for inference given a fixed model size (the MoE approach is also used by several other models we will discuss later).

Gemma 3 employs a different "trick" to reduce computational cost: sliding window attention.

3.1 Sliding Window Attention

With the help of sliding window attention (originally introduced by the LongFormer paper in 2020 and already adopted by Gemma 2), the Gemma 3 team significantly reduced the KV cache's memory requirements, as shown in the figure below.

img

Figure 11: Annotated diagram from the Gemma 3 paper (https://arxiv.org/abs/2503.19786), showing KV cache memory savings through sliding window attention.

So, what is sliding window attention? If we view regular self-attention as a global attention mechanism because each sequence element can access all other sequence elements, then sliding window attention can be seen as a local attention mechanism because it restricts the context size around the current query position. This is illustrated in the figure below.

img

Figure 12: Comparison of regular attention (left) and sliding window attention (right).

Note that sliding window attention can be used in conjunction with Multi-Head Attention (MHA) and Grouped-Query Attention (GQA); Gemma 3 uses Grouped-Query Attention.

As mentioned above, sliding window attention is also referred to as local attention because the local window surrounds and moves with the current query position. In contrast, regular attention is global because each token can access all other tokens.

Now, as briefly mentioned earlier, Gemma 2's predecessor architecture also used sliding window attention. What's different about Gemma 3 is that they adjusted the ratio between global (regular) attention and local (sliding) attention.

For example, Gemma 2 used a mixed attention mechanism, combining sliding window (local) attention and global attention in a 1:1 ratio. Each token could attend to a window of 4k tokens around it.

Gemma 2 used sliding window attention every other layer, while Gemma 3 now employs a 5:1 ratio, meaning there is only 1 full attention layer for every 5 sliding window (local) attention layers; furthermore, the sliding window size was reduced from 4096 (Gemma 2) to only 1024 (Gemma 3). This shifts the model's focus towards more efficient local computation.

According to their ablation studies, sliding window attention has a negligible impact on model performance, as shown in the figure below.

img

Figure 13: Annotated diagram from the Gemma 3 paper (https://arxiv.org/abs/2503.19786), showing that sliding window attention has almost no impact on the perplexity of LLM generated output.

Although sliding window attention is Gemma 3's most prominent architectural feature, I want to briefly discuss the placement of normalization layers, building on the previous section on OLMo 2.

3.2 Normalization Layer Placement in Gemma 3

A small but interesting detail is that Gemma 3 uses RMSNorm both before and after its grouped query attention module, meaning it employs both Pre-Normalization (Pre-Norm) and Post-Normalization (Post-Norm) settings simultaneously.

This is similar to Gemma 2 but still worth emphasizing because it differs from (1) the Post-Normalization used in the original Transformer ("Attention is all you need"), (2) the Pre-Normalization popularized by GPT-2 and subsequently used in many other architectures, and (3) the Post-Normalization variant we saw earlier in OLMo 2.

图片

Figure 14: OLMo2 and Gemma 3 architecture comparison; note the additional normalization layers in Gemma 3.

The author believes this normalization layer placement method is relatively intuitive because it combines the advantages of both Pre-Normalization and Post-Normalization. In the author's opinion, a little more normalization never hurts. In the worst case, if the extra normalization is redundant, it would only lead to some inefficiencies. But in practice, due to the relatively low computational cost of RMSNorm, this should not have any significant impact.

3.3 Gemma 3 Summary

In conclusion, Gemma 3 is a high-performing open-source LLM that, in the author's opinion, is somewhat underestimated in the open-source community. Its most interesting highlight is the improved efficiency through the use of sliding window attention (it would be interesting to combine it with MoE in the future).

Furthermore, Gemma 3 uniquely places its normalization layers, with RMSNorm layers both before and after the attention module and feedforward module.

3.4 Gemma 3n

A few months after Gemma 3 was released, Google unveiled Gemma 3n, a Gemma 3 model optimized for efficiency on small devices, with the goal of running on mobile phones.

One of the changes Gemma 3n implements for higher efficiency is the so-called "Per-Layer Embedding (PLE)" parameter layer. The core idea is to keep only a subset of the model parameters in GPU memory. Token-layer specific embeddings, such as embeddings for text, audio, and visual modalities, are streamed on demand from the CPU or Solid State Drive (SSD).

The figure below shows the memory saving effect of PLE, listing 5.44 billion parameters for the standard Gemma 3 model. This likely refers to the 4-billion parameter variant of Gemma 3.

img

Figure 15: Annotated diagram from the Google Gemma 3n blog (https://developers.googleblog.com/en/introducing-gemma-3n/), illustrating PLE memory savings.

The difference between 5.44 billion and 4 billion parameters is due to Google's interesting way of reporting LLM parameter counts. They often exclude embedding parameters to make the model seem smaller, unless in this case, it's convenient to include them to make the model seem larger. This practice is not unique to Google; it has become common practice across the field.

Another interesting trick is the MatFormer concept (short for Matryoshka Transformer). For example, Gemma 3n uses a shared LLM (Transformer) architecture that can be sliced into smaller, independently usable models. Each slice is trained to operate independently, so during inference, we only run the required portion (rather than the entire large model).

4. Mistral Small 3.1

Mistral Small 3.1 24B was released in March, shortly after Gemma 3, and it's noteworthy that it outperformed Gemma 3 27B on several benchmarks (except math) and was faster.

The lower inference latency of Mistral Small 3.1 may be attributed to its customized tokenizer, as well as reductions in KV cache and number of layers. Other than that, it employs a standard architecture, as shown in the figure below.

图片

Figure 16: Gemma 3 27B and Mistral 3.1 Small 24B architecture comparison.

Interestingly, earlier Mistral models utilized sliding window attention, but this technique seems to have been abandoned in Mistral Small 3.1. Therefore, since Mistral uses regular grouped query attention rather than sliding-windowed grouped query attention as in Gemma 3, there might be additional inference computational savings, benefiting from the ability to use more optimized code (e.g., FlashAttention). For example, the author speculates that while sliding window attention reduces memory usage, it doesn't necessarily reduce inference latency, which is the focus of Mistral Small 3.1.

5. Llama 4

The detailed introduction to Mixture-of-Experts (MoE) earlier in this article comes in handy again. Llama 4 also adopts the MoE approach, while otherwise following a relatively standard architecture, very similar to DeepSeek-V3, as shown in the figure below. (Llama 4 includes native multimodal support, similar to models like Gemma and Mistral. However, since this article focuses on language modeling, we will only consider the text model part.)

图片

Figure 17: DeepSeek V3 (671 billion parameters) and Llama 4 Maverick (400 billion parameters) architecture comparison.

While Llama 4 Maverick's overall architecture is very similar to DeepSeek-V3, there are still some interesting differences worth highlighting.

First, Llama 4, like its predecessors, uses Grouped-Query Attention (GQA), while DeepSeek-V3 employs Multi-Head Latent Attention (MLA) that we discussed at the beginning of this article. Currently, both DeepSeek-V3 and Llama 4 Maverick are very large architectures, with DeepSeek-V3 having approximately 68% more total parameters than Llama 4 Maverick. However, DeepSeek-V3 has 37 billion active parameters, more than double Llama 4 Maverick's 17 billion active parameters.

Llama 4 Maverick uses a more classical MoE setup with fewer but larger experts (2 active experts, each with a hidden layer size of 8192), while DeepSeek-V3 has 9 active experts, each with a hidden layer size of 2048. Additionally, DeepSeek uses MoE layers in every Transformer block (except the first 3), while Llama 4 alternates between MoE and dense modules every other Transformer block.

Given the numerous subtle architectural differences, it is challenging to accurately determine their impact on the final model performance. However, the main conclusion is that the Mixture-of-Experts (MoE) architecture has significantly increased in popularity in 2025.

6. Qwen3

The Tongyi Qianwen (Qwen) team has been committed to providing high-quality open-source LLMs. I remember assisting in co-mentoring the LLM Efficiency Challenge at NeurIPS 2023, where all the best winning solutions were based on Qwen2.

Now, Qwen3 is another popular model series, topping leaderboards in its respective scale categories. It has 7 dense models: 0.6B, 1.7B, 4B, 8B, 14B, and 32B. Additionally, there are 2 MoE models: 30B-A3B and 235B-A22B.

(Incidentally, the lack of a space in "Qwen3" is not a typo; the author simply tried to preserve the original spelling chosen by the Tongyi Qianwen developers.)

6.1 Qwen3 (Dense Models)

Let's first discuss the dense model architecture. As of writing this article, the 0.6B model is likely the smallest open-source model of the current generation. In my personal experience, it performs exceptionally well at such a small size. If you plan to run it locally, it has excellent tokens-per-second throughput and low memory footprint. More importantly, due to its small size, it is also easy to train locally (for educational purposes).

Therefore, Qwen3 0.6B has largely replaced Llama 3 1B in most cases. A comparison of these two architectures is shown in the figure below.

图片

Figure 18: Qwen3 0.6B and Llama 3 1B architecture comparison; note that Qwen3 is a deeper architecture with more layers, while Llama 3 is a wider architecture with more attention heads.

If you are interested in a readable Qwen3 implementation that does not rely on external third-party LLM libraries, I recently implemented a Qwen3 from scratch (pure PyTorch).

The computational performance data shown in the figure above is based on my from-scratch PyTorch implementation running on an A100 GPU. It can be seen that Qwen3 uses less memory because its overall architecture is smaller, and it has fewer hidden layers and attention heads. However, it uses more Transformer blocks than Llama 3, resulting in slower execution (lower tokens-per-second generation speed).

6.2 Qwen3 (MoE)

As mentioned earlier, Qwen3 also offers two MoE versions: 30B-A3B and 235B-A22B. Why do some architectures, like Qwen3, offer both regular (dense) and MoE (sparse) versions?

As discussed at the beginning of this article, MoE variants help reduce the inference cost of large foundation models. Providing both dense and MoE versions offers flexibility based on the user's goals and constraints.

Dense models are generally easier to fine-tune, deploy, and optimize on various hardware.

On the other hand, MoE models are optimized for large-scale inference. For example, given a fixed inference budget, they can achieve higher overall model capacity (i.e., absorb more knowledge during training due to the larger model) without proportionally increasing inference cost.

By releasing both types, the Qwen3 series can support a wider range of use cases: dense models for robustness, simplicity, and fine-tuning, while MoE models are for efficient large-scale serving.

To summarize this section, let's compare DeepSeek-V3 and Qwen3 235B-A22B (note that A22B stands for "22 billion active parameters"); the former has nearly twice as many active parameters as the latter (37 billion).

图片

Figure 19: DeepSeek-V3 and Qwen3 235B-A22B architecture comparison.

As shown in the figure above, the DeepSeek-V3 and Qwen3 235B-A22B architectures are strikingly similar. However, it is worth noting that the Qwen3 model abandoned the use of shared experts (earlier Qwen models, such as Qwen2.5-MoE, did use shared experts).

Unfortunately, the Qwen3 team did not disclose their reasons for abandoning shared experts. If the author had to guess, perhaps when they increased the number of experts from 2 (in Qwen2.5-MoE) to 8 (in Qwen3), shared experts were simply not necessary for the training stability of their setup. Thus, by using only 8 instead of 8+1 experts, they saved additional computational/memory costs. (However, this does not explain why DeepSeek-V3 still retains shared experts.)

7. SmolLM3

SmolLM3 may not be as popular as other LLMs mentioned in this article, but the author believes it is still an interesting model to include because it offers very good model performance at a relatively small 3-billion parameter model size, its scale being between the 1.7-billion and 4-billion parameter Qwen3 models, as shown in the figure below.

Furthermore, it shares many training details, similar to OLMo, which is rare and always appreciated!

图片

Figure 20: Annotated diagram from the SmolLM3 release announcement (https://huggingface.co/blog/smollm3) comparing SmolLM3 win rates with Qwen3 1.7B and 4B, and Llama 3 3B and Gemma 3 4B.

As shown in the architecture comparison below, the SmolLM3 architecture looks quite standard. However, its most interesting aspect is perhaps its use of No Positional Embeddings (NoPE).

图片

Figure 21: Side-by-side architecture comparison of Qwen3 4B and SmolLM3 3B.

7.1 No Positional Embeddings (NoPE)

In LLM applications, No Positional Embeddings (NoPE) is an older concept, traceable to a 2023 paper ("The Impact of Positional Encodings on Length Generalization in Transformers"), aiming to remove explicit positional information injection (e.g., through classical absolute positional embedding layers in early GPT architectures or current Rotary Positional Embeddings RoPE).

In Transformer-based Large Language Models (LLMs), positional encoding is typically required because self-attention mechanisms process tokens independently of their order. Absolute positional embeddings solve this by adding an additional embedding layer to inject information into the token embeddings.

图片

Figure 22: Modified diagram from the author's book "Building Large Language Models from Scratch" (https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167), illustrating absolute positional embeddings.

On the other hand, RoPE's solution is to rotate the query and key vectors relative to the token position.

However, in NoPE (No Positional Embeddings) layers, no such positional signal is added at all: no fixed, no learned, no relative. Nothing.

Even without positional embeddings, the model still knows which tokens come before others, thanks to the causal attention mask. This mask prevents each token from attending to future tokens. Thus, a token at position t can only see tokens at positions less than or equal to t, which preserves the autoregressive ordering.

Therefore, although no explicit positional information is added, a sense of direction is still implicitly present in the model's structure, which LLMs can learn to leverage during regular gradient-descent based training if it benefits the optimization objective. (For more information, please refer to the theorems in the NoPE paper.)

Overall, the NoPE paper not only found that injecting positional information is not necessary but also that NoPE has better length generalization, meaning LLM response performance degrades less as sequence length increases, as shown in the figure below.

图片

Figure 23: Annotated diagram from the NoPE paper (https://arxiv.org/abs/2305.19466), showing NoPE has better length generalization.

Note that the experiments above were conducted using relatively small GPT-style models with approximately 100 million parameters and relatively small context sizes. To what extent these findings generalize to larger contemporary LLMs is currently unclear.

Therefore, the SmolLM3 team may have simply "applied" NoPE (or rather, omitted RoPE) in every fourth layer.

8. Kimi 2

Kimi 2 has recently caused a huge buzz in the AI community as it is a high-performing open-source model. According to benchmarks, it is comparable to top-tier closed-source models such as Google's Gemini, Anthropic's Claude, and OpenAI's ChatGPT.

A notable feature is its use of a variant of the relatively new Muon optimizer, instead of AdamW. To the author's knowledge, this is the first time Muon has been used in a production model of this scale, previously only proven to scale up to 16 billion parameters. This results in an excellent training loss curve, which likely contributed to pushing this model to the top of the aforementioned benchmarks.

While people comment on the unusually smooth loss (as there are no spikes), the author doesn't think it's unusually smooth (e.g., see the OLMo 2 loss curve in the figure below; furthermore, the L2 norm of the gradients might be a better measure of training stability). However, the degree of decay in its loss curve is indeed striking.

However, as mentioned in the introduction of this article, training methodology is another topic to be discussed later.

The model itself has 1 trillion parameters, which is truly impressive.

As of writing this article, it is likely the largest LLM of this generation (considering Llama 4 Behemoth has not yet been released, excluding closed-source LLMs, and Google's 1.6 trillion parameter Switch Transformer is an encoder-decoder architecture from another generation).

Kimi 2 also comes full circle, adopting the DeepSeek-V3 architecture we introduced at the beginning of this article, just scaled up, as shown in the figure below.

图片

Figure 25: DeepSeek V3 and Kimi K2 architecture comparison.

As shown in the figure, Kimi 2 and DeepSeek V3 are basically identical, except that Kimi 2 uses more experts in its Mixture-of-Experts (MoE) module and fewer heads in its Multi-Head Latent Attention (MLA) module.

Kimi 2 did not appear out of nowhere. The earlier Kimi 1.5 model discussed in the Kimik1.5: Reinforcement Learning and LLM Scaling paper was also impressive. However, it was unfortunately released on the same day (January 22) as the DeepSeek R1 model paper. Furthermore, to the author's knowledge, Kimi 1.5 weights were never publicly shared.

Therefore, the Kimi K2 team likely learned these lessons and shared Kimi K2 as an open-source model before DeepSeek R2 was released. As of writing this article, Kimi K2 is the most impressive open-source model.

If you found this article helpful, don't forget to like it and show some love.

>/ Author: Zhi Great

>/ Author: Reprints welcome, please cite the source

Main Tag:LLM Architectures

Sub Tags:Mixture of ExpertsDeep Learning ArchitecturesLarge Language ModelsModel ComparisonNormalization LayersAttention Mechanisms


Previous:Kimi K2's Key Training Technique: QK-Clip!

Next:New Book Recommendation: "Reshuffle: Who Wins When AI Restacks the Knowledge Economy"

Share Short URL