Edited by | Tu Min
Produced by | CSDN (ID: CSDNnews)
It's not a top-tier conference paper, nor was it published on arXiv, and it can't even be called “officially published” — yet, a pure blog post like this enabled a researcher to successfully land an Offer from OpenAI, and it's even rumored that the technology from this blog is being used in GPT-5's training.
It sounds like a joke, but this researcher named Keller Jordan truly achieved it.
Keller Jordan's blog post is titled《Muon: An optimizer for hidden layers in neural networks》(https://kellerjordan.github.io/posts/muon/), in which he proposes a new optimizer called Muon.
Simply put, this article is neither in a paper format nor peer-reviewed, but it unexpectedly went viral due to its excellent practical results. Even more surprisingly, it became his stepping stone to OpenAI.
This news was first publicly shared on X by Keller Jordan's collaborator, Yuchen Jin, co-founder of AI cloud platform startup Hyperbolic Labs.
Yuchen Jin wrote:
「Many PhDs (including my past self) often fall into a misconception: believing that publishing papers at top conferences is the ultimate goal.
But “publication” does not equal “impact.”
Muon is just a blog post, yet it helped Keller get an OpenAI offer — and now he might be using it to train GPT-5.
I'm grateful he listed me as a second author. I just ran some experiments with NanoGPT, testing Muon's scalability on larger language models, and it completely crushed AdamW (the former king of optimizers)!
This taught me: whether in research or in life, what one should pursue is impact, not dazzling titles.」
AI products are booming, but have your pain points been resolved? August 15-16, Beijing Westin·Global Product Manager Summit (PM-Summit), a community of 3000+ AI product professionals is ready. Face AI implementation challenges, dissect leading cases, and connect with precise resources!
Scan the code to register your information, add the assistant to join the group, and seize the next wave of AI product dividends:
After joining the group, you will have the opportunity to get:
· The latest, most noteworthy AI product information and expert insights
· Exclusive videos and articles interpreting product methodologies and practical experience in the AGI era
· Irregularly gifted AI product dry goods and secrets
Top Conference Papers ≠ Impactful
Yuchen Jin's words sparked a lot of discussion.
After all, in academia, top conference papers are almost the “hard currency” for measuring a person's research level and career potential — especially for PhDs, whether they can enter a top-tier lab, apply for faculty positions, or secure funding, having their name on papers at conferences like NeurIPS, ICLR, CVPR, ACL is very important.
Yet, Keller managed to overtake directly with an “informal” blog post, which is quite disruptive to conventional wisdom.
In fact, Keller Jordan publicly expressed his stance on this matter as early as February this year. He wrote on X that the reason he didn't write a formal arXiv paper for Muon was because he simply didn't believe there was any necessary connection between “writing an optimizer paper with beautiful data and impressive charts” and “whether this optimizer is actually useful.”
He prioritizes real-world training performance, “I only believe in practical benchmarks.”
In his view, instead of investing a lot of time in writing papers with cumbersome format requirements and lengthy review cycles, it's better to focus on practical implementation and real-world effectiveness. After all, an idea often takes several months or even longer to mature from conception to paper publication, and by the time it finally appears, it is highly likely to be “outdated,” and even if published, it might get buried in the wave after wave of submissions at top conferences, with few people actually reading or using it.
At a time when AI is accelerating the iteration speed across various fields, this perspective is not uncommon.
Former Google researcher Hieu Pham commented on this matter:
“Once upon a time, 'publishing a paper' equaled 'making an impact.' ResNet, Seq2Seq, Adam, Attention, Transformers, MoE… these classic achievements all appeared in paper form. But the real problem is that we haven't realized this era has passed. I myself have made similar mistakes. Fortunately, we still have a chance to choose again.”
He added that, concerning optimizers, “Tens of thousands of papers on optimizers have been published in the industry, but actual progress in SOTA (State-Of-The-Art) has only happened once — from Adam to AdamW. Other supposed improvements are basically just enhancements of these two, like FSDP. Therefore, we really should stop writing such papers. There's no need to cite AdamW; everyone knows where it came from.”
Yuchen Jin, also a PhD graduate, lamented the limitations of the academic ecosystem: “This is the lamentable part of academia. I once had a lab mate who couldn't publish papers at any top computer systems conferences, which made it difficult for him to secure a faculty position at a prestigious university. But eventually, he became a Vice President at Google.”
Unconventional “Hardcore Scholar”
Today, Keller Jordan's experience also offers new insights: it turns out that one can still enter top-tier labs without writing papers.
As Muon receives more and more attention from researchers, today, Keller reiterated his view — “Hundreds of papers on optimizers have been published, but the so-called optimal performance (SOTA) has only improved a few times. So we can conclude: almost all optimizer papers are 'fake.' If you're going to write another 'fake optimizer' paper, please don't cite Muon. I don't need your citation.”
While sharp, these remarks also reflect Keller Jordan's insistence on “practical results over academic embellishment” and his distinct personality.
Looking at Keller's resume, he is indeed an out-and-out “hardcore scholar.”
According to his LinkedIn profile, Keller attended UC Santa Cruz, specializing in machine learning, data science, and other fields. He then studied operating systems and computational security at UC Berkeley. In 2020, he earned dual degrees in Mathematics and Computer Science from UC San Diego with a high GPA of 3.94 (out of 4).
After graduation, he joined Hive as a Machine Learning Engineer and later joined the Complexity Science Hub Vienna as a Visiting Researcher, continuing to delve into AI practice.
By December 2024, shortly after releasing Muon, Keller successfully joined OpenAI, breaking conventional academic norms for entering top AI labs.
So, the question is: what is the magic of his informal blog post? Why did it attract so much attention without top conference endorsement or a paper format?
Next, let's take a look at Muon's actual performance and characteristics.
What makes Muon appealing compared to other optimizers?
Muon is an optimizer specifically designed for hidden layers in neural networks. It has currently set new training speed records for popular tasks like NanoGPT and CIFAR-10.
First, from practical tests, Muon has achieved excellent results:
On CIFAR-10, the time to train from scratch to 94% accuracy was reduced from 3.3 A100 seconds to 2.6 A100 seconds.
In the NanoGPT “FineWeb” task, it improved the speed of reaching a validation loss of 3.28 by 1.35 times.
When scaling to 774M and 1.5B parameters, training speed still maintained an advantage.
Training a 1.5 billion parameter transformer with Muon reached GPT-2 XL level in the HellaSwag task in just 10 hours (using an 8 H100 GPU cluster). Using AdamW, it took 13.3 hours to reach the same level.
The figure below shows Muon's performance comparison with other optimizers in terms of sample efficiency and actual training time in the NanoGPT task:
Figure 1 Optimizer comparison by sample efficiency
Figure 2 Optimizer comparison by wall-clock time
Below is a comparison of Muon and AdamW when training a 1.5 billion parameter language model:
Figure 3 Muon vs. AdamW in short-term training of 1.5 billion parameters
From a design perspective, Muon's core principle is: first, generate updates using SGD with momentum (SGD-momentum), then process each update matrix with one iteration of Newton-Schulz (NS), and finally apply it to the model parameters.
Its implementation is also relatively simple:
The role of Newton-Schulz iteration is to approximately orthogonalize the update matrix, meaning it performs the following operation:
In other words, the actual effect of the NS iteration is: to replace the update matrix originally obtained from SGD-momentum with the closest “semi-orthogonal matrix.”
Interested individuals can also quickly find Muon's PyTorch implementation via the GitHub address: https://github.com/KellerJordan/Muon
Final Thoughts
Keller's experience is not to deny the value of academia, but to remind us: in the rapidly evolving landscape of AI, the source of influence is quietly changing.
A blog post with excellent practical results might be more convincing than a perfectly formatted but hard-to-implement paper.
This also reminds us of DeepSeek, a team that also forged a path to fame through “technical effectiveness first”: no high-profile warm-up, no complex packaging, relying on solid performance and stable results to break through in the intense large model competition and quickly gain community recognition.
For AI researchers today, perhaps it's time to rethink: what is truly worth investing time in? Is it a paper that “looks strong,” or a model that “runs fast enough”? The viral success of Keller and Muon might just be the beginning of this shift.
📢 2025 Global Product Manager Summit
August 15–16
Beijing·Westin Hotel
The 2025 Global Product Manager Summit will gather product professionals from internet giants, AI startups, and frontline ToB/ToC practitioners. It will feature 12 specialized sessions covering core topics such as product design, user experience, growth operations, and intelligent implementation, aiming to gain insights into trends, dissect pathways, and envision the future.
For more details and registration, please scan the QR code below.