Kimi K2's Key Training Technique: QK-Clip!

Author: Su Jianlin Original Text: https://kexue.fm/archives/11126

Su Jianlin Reveals Kimi K2's Key Training Technique QK-Clip: Advancing Muon Further on the Scale-up Path!

Four months ago, we released Moonlight, validating the effectiveness of the Muon optimizer on a 16B MoE model. In Moonlight, we confirmed the necessity of adding Weight Decay to Muon and proposed a technique for transferring Adam hyperparameters by aligning Update RMS, which enabled Muon to be quickly applied to LLM training. However, when we tried to extend Muon to models with hundreds of billions of parameters, we encountered a new obstacle – MaxLogit explosion.

To solve this problem, we proposed a simple yet extremely effective new method, which we call “QK-Clip.” This method addresses the MaxLogit phenomenon from a fundamental perspective, without compromising model performance, and has become one of the key training techniques for our recently released trillion-parameter model, “Kimi K2.”

Problem Description

Let's briefly introduce the MaxLogit explosion phenomenon. Recalling the definition of Attention:

Attention(Q, K, V) = Softmax(QK^T)V

The scaling factor is omitted here, as it can always be absorbed into the definition of V. The “Logit” in “MaxLogit explosion” refers to the Attention matrix before Softmax, i.e., QK^T, and MaxLogit refers to the maximum value of all Logits, which we denote as:

MaxLogit = max(QK^T)

This MaxLogit is actually taken over the batch_size dimension, resulting in a scalar. MaxLogit explosion means that MaxLogit continuously increases as training progresses, at a linear or even superlinear rate, showing no signs of stabilization for a considerably long period.

圖片

MaxLogit is essentially an outlier metric, and its explosion indicates that outliers have exceeded controllable limits. Specifically, we have:

||QK^T||_∞ <= ||Q||_∞ ||K||_∞

Since Q and K are usually subjected to RMSNorm, Q and K typically do not explode. Therefore, MaxLogit explosion implies that the spectral norm of Q or K risks tending towards infinity, which is clearly not good news.

Since any large value becomes less than 1 after Softmax, in lucky cases, this phenomenon does not lead to severe consequences, at most wasting an Attention Head. However, in worse scenarios, it can cause Grad Spike or even training collapse. Therefore, it is safer to avoid MaxLogit explosion.

Previous Attempts

In “Muon Sequel: Why Did We Choose to Try Muon?” we briefly analyzed that Weight Decay can, to some extent, prevent MaxLogit explosion. Thus, the probability of MaxLogit explosion in small models is very low. Even for a 16B model like Moonlight, MaxLogit would at most rise to 120 before automatically decreasing.

圖片

Moonlight's MaxLogit automatically decreased.

In other words, MaxLogit explosion occurs more frequently in models with very large numbers of parameters. The larger the model, the more unstable factors in training, and the harder it is for Weight Decay to stabilize training. Increasing Weight Decay at this point would naturally enhance control, but it would also lead to a significant loss of performance, making this approach unfeasible. Another more direct approach is to directly apply Clip to Logit:

Logit = Clip(QK^T, max_value=max_logit_threshold)

where max_logit_threshold was introduced by Google's Gemma2. Due to the boundedness of Clip, Logit after clipping is naturally guaranteed to be bounded. However, it cannot guarantee that Logit before clipping is bounded (personally verified), so Clip merely transformed one problem into another without actually solving it.

Perhaps Google itself realized this, which is why in the later Gemma3, Clip was no longer used, and “QK-Norm” was adopted instead:

QK-Norm: Q_norm = Q / ||Q||_2, K_norm = K / ||K||_2

QK-Norm is indeed a very effective method for suppressing MaxLogit. However, it is only suitable for MHA, GQA, etc., and not for MLA, because QK-Norm requires QK^T to be materialized. But for MLA, its QK^T during the training phase is different from that during the decoding phase (as shown below). In the decoding phase, we cannot fully materialize the QK^T from the training phase. In other words, QK-Norm cannot be performed during decoding.

MLA Training: QK^T_train = ... (complex formula)

MLA Decoding: QK^T_decode = ... (complex formula)

Why use MLA? We have discussed this in two articles: “The Road to Transformer Upgrades: 21, What Makes MLA Good? (Part 1)” and “The Road to Transformer Upgrades: 21, What Makes MLA Good? (Part 2)”, so we won't repeat it here. In short, we hope that MLA can also have a means similar to QK-Norm to guarantee MaxLogit suppression.

Direct Approach

During this period, we also tried some indirect methods, such as separately lowering the learning rate for Q and K, or separately increasing their Weight Decay, but none of them worked. The closest to success was Partial QK-Norm. For MLA, its QK^T is divided into four parts: qr, qc, kr, and kc. The first three parts can be materialized during decoding, so we added RMSNorm to all three. The result was that MaxLogit could be suppressed, but the length activation effect was very poor.

After multiple failures, we couldn't help but reflect: our previous attempts were all just “indirect means” to suppress MaxLogit. What is the truly direct means that can guarantee solving the MaxLogit explosion? From the inequality ||QK^T||_∞ <= ||Q||_∞ ||K||_∞, it's easy to think of singular value clipping for Q and K, but this is essentially still an indirect method, and the computational cost of singular value clipping is not low.

However, it's clear that post-scaling Q and K is theoretically feasible. The question is when to scale and by how much. Finally, one day, inspiration struck, and the author realized: MaxLogit itself is the most direct signal to trigger scaling! Specifically, when MaxLogit exceeds the desired threshold max_logit_threshold, we directly multiply Q and K by factor = max_logit_threshold / MaxLogit. Then the new MaxLogit will certainly not exceed max_logit_threshold. The operation of multiplying by factor can be absorbed into the weights of Q and K respectively, leading to the initial version of QK-Clip:

Q_new = Q * factor, K_new = K * factor (where factor = min(1, max_logit_threshold / MaxLogit_L))

Here, MaxLogit_L is the MaxLogit of the L-th Attention layer, and Q and K are its weights. This means that after the optimizer update, the decision to clip the weights of Q and K is made based on the magnitude of MaxLogit_L, and the clipping magnitude is directly determined by the ratio of MaxLogit_L to the threshold max_logit_threshold, directly ensuring that the clipped matrix no longer experiences MaxLogit explosion. At the same time, since it directly operates on the weights, it does not affect the inference mode, naturally making it compatible with MLA.

Fine-Tuning

The initial version of QK-Clip did successfully suppress MLA's MaxLogit, but after carefully observing the model's “internal workings,” we found that it suffered from “over-clipping.” After fixing this issue, we arrived at the final version of QK-Clip.

We know that regardless of the Attention variant, there are multiple Heads. Initially, we monitored only one MaxLogit metric per Attention layer, taking the Max of all Head Logits together, which led to QK-Clip clipping all Heads together. However, when we monitored each Head's MaxLogit separately, we found that in reality, only a few Heads per layer experienced MaxLogit explosion. If all Heads were clipped by the same ratio, most Heads would be “innocently affected,” which is the meaning of over-clipping.

Simply put, QK-Clip involves multiplying by a number less than 1. For a MaxLogit-exploding Head, this number is just enough to counteract the growth trend, but for other Heads, it's simply a reduction (they have no or very weak growth trends). Being continuously multiplied by a number less than 1 for a long time can easily lead to the values tending towards zero, which is a manifestation of “over-clipping.”

Therefore, to avoid “collateral damage,” we should monitor MaxLogit and perform QK-Clip on a per-Head basis. However, another subtle detail is hidden here: the initial version of QK-Clip distributed the clipping factor across Q and K. But MLA's QK^T has four parts: qr, qc, kr, and kc. Among these, kr is shared by all Heads. If we clip kr, it would also cause “collateral damage.” Therefore, for the (qr, kr) pair, we should only clip qr.

After the above adjustments, the final version of QK-Clip is:

Q_new = Q * factor_L_h, K_new = K * factor_L_h (where factor_L_h = min(1, max_logit_threshold / MaxLogit_L_h))

Where the superscripts L denote the L-th layer and h denotes the h-th Head.

Road to Expansion

At this point, the operational details of QK-Clip have been fully introduced. It directly uses our desired MaxLogit as a signal to make the smallest possible changes to the Q and K weights, achieving the effect of controlling the MaxLogit value within the specified threshold. Since this method directly modifies weights, it has better compatibility than QK-Norm and can be used with MLA.

In Kimi K2's training, we set the threshold max_logit_threshold to 100. The total training steps were approximately 220k steps. From roughly 7k steps onwards, Heads with MaxLogit exceeding max_logit_threshold began to appear. For a considerable period thereafter, Muon Update and QK-Clip were in a “tug-of-war,” where Muon aimed to increase MaxLogit and QK-Clip aimed to decrease it, maintaining a subtle balance. Interestingly, after 70k steps, the MaxLogit of all Heads actively decreased below 100, and QK-Clip stopped being active.

圖片

After nearly 70k steps of tug-of-war between Muon and QK-Clip, MaxLogit proactively decreased.

This indicates that under the action of Weight Decay, as long as we can stabilize the training, the model is likely to proactively reduce MaxLogit in the end. The role of QK-Clip is precisely to help the model pass through the initial training phase more smoothly. Some readers might worry that QK-Clip would degrade performance, but we conducted comparative experiments on small models. Even when MaxLogit was suppressed significantly (e.g., to 30) using QK-Clip, no substantial difference in performance was observed. Coupled with the phenomenon that the model proactively reduces MaxLogit in the mid-to-late stages, we have reason to believe that QK-Clip is lossless in terms of performance.

We also observed in experiments that Muon is generally more prone to MaxLogit explosion than Adam. So, to some extent, QK-Clip is an additional update rule specifically for Muon; it is one of Muon's “secret weapons” for ultra-large-scale training, which is also the meaning of this article's title. For this, we combined the Muon modifications proposed in our Moonlight with QK-Clip and named it “MuonClip”:

MuonClip combines Muon changes from Moonlight with QK-Clip.

Note that “Muon is generally more prone to MaxLogit explosion than Adam” does not mean that only Muon will experience MaxLogit explosion. We know that DeepSeek-V3 is trained with Adam, and we also observed MaxLogit explosion in the open-source DeepSeek-V3 model. Additionally, Gemma2 used Clip to prevent MaxLogit explosion, and it was also trained with Adam. Therefore, although we emphasize the value of QK-Clip for Muon, if readers insist on using Adam, it can also be combined with Adam to form AdamClip.

Reasoning

Why is Muon more likely to cause MaxLogit explosion? In this section, the author attempts to provide a theoretical explanation for your reference.

From the inequality ||QK^T||_∞ <= ||Q||_∞ ||K||_∞, it can be seen that MaxLogit explosion often implies that the spectral norm of Q or K shows signs of explosion. In fact, the definition of spectral norm also includes the Max operation, so the two are essentially related. Therefore, the problem can be transformed into “why is Muon more likely to cause spectral norm explosion?” We know that the spectral norm equals the largest singular value, so we can further infer “why does Muon tend to increase singular values more?”

What is the difference between Muon and Adam? The update amount provided by Muon undergoes singular value decomposition, and all its singular values are equal, meaning its effective rank is full rank. In contrast, for general matrices, singular values usually vary in magnitude, with the first few singular values dominating. From the perspective of effective rank, they are low rank. Our assumption about Adam's update amount is also such. This assumption is not new; for example, higher-order muP also assumes the low-rank property of Adam's update amount.

In formula terms, let's denote the SVD of parameter W as W = U_W Σ_W V_W^T, the SVD of Muon's update amount as ΔW_muon = U_muon Σ_muon V_muon^T, and the SVD of Adam's update amount as ΔW_adam = U_adam Σ_adam V_adam^T. Then

Σ_muon is a scalar multiple of an identity matrix, whereas Σ_adam has varying singular values.

Clearly, if the singular vector pairs U_W, V_W are very close to some U_muon or V_muon, they will directly add up, thereby increasing W's singular values. Since Muon's update amount is full-rank, its “collision probability” with W will be much greater than Adam's, so Muon is more likely to increase the singular values of parameters.

Of course, the above analysis is general and not limited to the weights of Q and K. In fact, in Moonlight, we have already verified that models trained with Muon generally have higher singular value entropy for their weights, which corroborates the above conjecture. The special nature of Attention Logit lies in its bilinear form QK^T. The multiplication of Q and K makes the risk of explosion greater and easily leads to a “worse getting worse” vicious cycle, ultimately contributing to MaxLogit explosion.

圖片

Comparison of Singular Value Entropy (equivalent to effective rank) for Model Weights trained with Muon vs. Adam.

Finally, “Muon's collision probability being much greater than Adam's” is relative. In reality, singular vectors colliding is still a low-probability event, which explains why only a small number of Attention Heads experience MaxLogit explosion.

Further Extensions

By now, the important computational and experimental details about QK-Clip should be clear. It's also worth noting that while the idea behind QK-Clip is simple, implementing it in distributed training can be a bit challenging due to the need for per-Head clipping, as parameter matrices are often “fragmented” in such setups (it's not too difficult to modify based on Muon, but slightly more complex based on Adam).

For the author and their team, QK-Clip is not just a specific method for solving the MaxLogit explosion problem; it also represents a “sudden realization” after repeatedly attempting to solve the problem through indirect means and failing: Since we have a clear metric, we should seek direct approaches that guarantee a solution, rather than wasting time on possibly, but not necessarily, effective ideas like lowering LR, increasing Weight Decay, or partial QK-Norm.

From a methodological perspective, QK-Clip's approach is not limited to solving MaxLogit explosion. It can be considered an “antibiotic” for many training instability issues. An antibiotic, in this context, might not be the most elegant solution, but it is often one of the most direct and effective. QK-Clip possesses this characteristic, and it can be generalized to “clip wherever instability occurs.”

For example, in some cases, models may experience a “MaxOutput explosion.” In such situations, we could consider clipping the weights W based on the MaxOutput value. Analogous to QK-Clip's per-Head operation, here we would also need to consider per-Dim operations, but per-Dim clipping would clearly be too costly, possibly requiring a compromise. In short, “clip wherever instability occurs” provides a unified approach, but the specific details depend on individual implementation.

Finally, QK-Clip's approach of manually setting update rules based on certain signals was, to some extent, inspired by DeepSeek's Loss-Free load balancing strategy. We pay tribute to DeepSeek once again!

Article Summary

This article introduces QK-Clip, a new approach to the MaxLogit explosion problem. Unlike QK-Norm, it is a post-adjustment scheme for Q and K weights that does not change the model's forward computation, making it more widely applicable. It is an important stabilization strategy for the “Muon + MLA” combination in ultra-large-scale training, and one of the key technologies for our recently released trillion-parameter model, Kimi K2.

References

Moonlight

Muon Optimizer

Kimi K2

Muon Sequel: Why Did We Choose to Try Muon?

Gemma2

Gemma3

The Road to Transformer Upgrades: 21, What Makes MLA Good? (Part 1)

The Road to Transformer Upgrades: 21, What Makes MLA Good? (Part 2)

Singular Value Clipping

DeepSeek-V3

Effective Rank

Higher-order muP

Loss-Free

Main Tag:Large Language Models

Sub Tags:Attention MechanismsQK-ClipDeep Learning OptimizersTraining StabilityModel Optimization


Previous:Crushing DeepSeek V3! Alibaba Open-Sources New Qwen-3, Dominating Benchmarks with a Clear Lead

Next:Must-Read: In-depth Comparison of Mainstream LLM Architectures, Covering Llama, Qwen, DeepSeek, and Six Other Models

Share Short URL