Xinzhiyuan Report
Editor: Aeneas Hao Kun
【Xinzhiyuan Guide】Originally intending to simply generate some synthetic data for practice, a Stanford Chinese team unexpectedly managed to outperform PyTorch's expert kernels! The AI-generated kernels written in pure CUDA-C by the Stanford Chinese team instantly amazed the industry and landed on the Hacker News hot list. The team even stated: "We didn't originally intend to release this result."
Just now, the Stanford HAI Chinese prodigy team has once again produced an astonishing masterpiece.
Their fast AI-generated kernels, written in pure CUDA-C, have surprisingly surpassed PyTorch!
In this process, without relying on libraries like CUTLASS and Triton or domain-specific languages (DSLs), the performance can approach or even exceed PyTorch's built-in, expert-optimized standard production-grade kernels in some cases.
The author team consists of familiar names – Anne Ouyang, Azalia Mirhoseini, and Percy Liang. Interestingly, they even frankly stated that they didn't originally intend to publish this result.
Upon release, this discovery ignited the tech community and has now climbed to the second spot on the Hacker News main list.
Speaking of which, this discovery contains many unexpected elements.
Originally, their goal was to generate synthetic data to train better kernel generation models, and the design for synthetic data generation was quite simple.
However, something unexpected happened: the synthetic data generation itself, intended only for testing, began to produce highly excellent kernels, even surpassing human-expert-optimized PyTorch baselines, and it leveraged advanced optimizations and hardware features.
Prior to this, it was a very challenging task.
Consequently, the researchers decided to write a blog post ahead of schedule to share their findings.
In summary, the highlights of the research are as follows:
Matrix Multiplication (Matmul, FP32): Achieved 101.3% of PyTorch FP32 torch.matmul performance
2D Convolution (Conv2D, FP32): Achieved 179.9% of PyTorch FP32 torch.nn.Conv2D performance
Softmax (FP32): Achieved 111.8% of PyTorch FP32 torch.softmax performance
Layer Normalization (LayerNorm, FP32): Achieved 484.4% of PyTorch FP32 torch.nn.LayerNorm performance
2D Convolution + ReLU + Max Pooling (Conv2D + ReLU + MaxPool, FP32): Achieved 290.1% of PyTorch FP32 reference implementation performance, and 189.0% of PyTorch FP32 torch.compile() reference implementation performance
The above results were benchmarked on an Nvidia L40S GPU. Performance percentage is defined as reference time divided by generated kernel time.
Netizens: Forcing LLM inference is really interesting
On Hacker News, netizens also engaged in a lively discussion about this.
For example, why it's easier to achieve performance improvements with FP32 kernels than with PyTorch, the reasons are quite interesting.
If AI can truly achieve more optimized kernels at lower costs, the potential is indeed enormous.
What's most stunning is that whether it's Google's recent AlphaEvolve or o3 discovering a zero-day vulnerability in the Linux kernel, they all remind us that—
Gemini Pro 2.5 and o3 have reached a new level of capability, and ideas that previously failed with other models are now suddenly working.
It can be said that we have reached a point where LLMs can iterate and test at a much faster rate than humans, and the brute force of information combination, progress, and intelligent application seems to be succeeding!
Next, let's look at the specific content from the Stanford researchers' blog.
Full Blog Post
Methodology
The researchers adopted the task setup of KernelBench (an AI-based kernel generation benchmark they released in December 2024).
Specifically, given a piece of torch code, the LLM writes custom kernels to replace the original torch operators, aiming for acceleration.
According to KernelBench's initial design, the reference code defaults to FP32 precision; lower-precision solutions are also allowed under a given tolerance threshold (1e-02).
Furthermore, due to the abundance of optimization techniques for specific scales, each problem in KernelBench has specific input sizes set.
Therefore, this benchmark aims to find the fastest kernel for a specific problem size, rather than a single high-speed kernel applicable to any problem size.
Moreover, the researchers run both the torch reference code and the generated code simultaneously, and verify its correctness by comparing whether the numerical outputs are consistent under various random inputs.
Currently, for the problem of optimizing kernels, the most common method for scaling computational resources during testing in the industry is sequential revision.
This is a multi-round iterative loop: the model first makes incremental modifications to the kernel, then checks its correctness and performance, and then tries again based on the results.
That is, either fix problematic kernels or further improve the performance of existing kernels.
This iterative process is very intuitive and easy to implement. The model fixes broken kernels, fine-tunes available ones, and gradually optimizes them into better-performing versions.
The main limitation of this method lies in the lack of diversity in optimization ideas.
Sequential loops often fall into local optima, such as repeatedly trying the same type of transformation, or endlessly tweaking along optimization paths that lack potential.
The result is inefficient use of computational resources during testing, and it makes it difficult for the model to generate fundamentally innovative optimization ideas.
To address this problem, the researchers introduced two key changes:
Leveraging natural language to reason about optimization ideas
Instead of directly generating new kernels at each step, they generate optimization ideas in natural language conditioned on previously tried ideas, and then materialize these ideas into new code variants.
Branching out at each optimization step
Instead of improving only one candidate solution at each step, they branch out, allowing each idea to spawn multiple implementation versions, with the best-performing kernel serving as the seed for the next round of optimization.
(The researchers also maintain a library of high-performing existing kernels to provide seeds).
This approach unlocks large-scale parallel processing capabilities, enabling them to explore fundamentally different optimization directions in each round, avoiding getting stuck in narrow optimization paths.
As a result, this test-time loop is no longer just "dialoguing" with the compiler like sequential revision, but is closer to a structured exploratory search.
This search is guided by explicit optimization hypotheses and performed with large-scale parallel evaluation.
The researchers ran 10 problems from KernelBench Level 1 for testing.
They adjusted the problem sizes to ensure that kernel launch overhead was negligible relative to the overall runtime of the problem.
Then, 5 rounds of experiments were conducted using OpenAI o3 and Gemini 2.5 Pro models.
The figure below shows the distribution of rounds in which the best-performing kernels were first discovered.
It can be seen that most optimal results appeared in later rounds (total 5 rounds), with the vast majority appearing in rounds 4 or 5.
As the search scope expanded, the researchers also found that many high-performance kernel optimization strategies are highly similar, concentrating on a few common patterns, which is consistent with their experience of manually writing kernels.
The main optimization categories are summarized as follows—
Memory access optimization: Improves the efficiency of data transfer between different memory levels (global memory, shared memory, registers) and ensures data access patterns maximize bandwidth and minimize conflicts.
Asynchronous operations and latency hiding: Hides latency by overlapping long-running operations (e.g., global memory access) with computation or other memory transfers.
Data type and precision optimization: Where permissible, uses lower-precision data types (such as FP16 or BF16) as much as possible to reduce memory bandwidth requirements, improve cache efficiency, and potentially leverage specialized hardware acceleration units.
Computation and instruction optimization: Improves the efficiency of arithmetic operations themselves, reduces the number of instructions, or utilizes specialized hardware instructions.
Parallelism and occupancy enhancement: Maximizes the number of active warps on Streaming Multiprocessors (SMs) to better hide latency and improve overall throughput.
Control flow and loop optimization: Reduces overhead introduced by loops, branches, and index calculations.
Summary
The method adopted by the researchers in this study aligns with an increasingly significant trend in AI research—
Combining powerful reasoning capabilities with parallel exploration of multiple hypotheses can lead to performance improvements.
As some recent research (e.g., AlphaEvolve, Gemini 2.5 Pro Deep Think) emphasizes, we don't always need large-scale retraining.
Sometimes, clever search and branching strategies are sufficient to spark scientific innovation and solve complex problems, and extensive searching with the help of validators can yield even greater benefits.
However, this does not mean that we don't need further training.
On the contrary, the researchers' method also helps generate higher-quality synthetic data for improving future model training (which requires more problem instances).
Therefore, it is both a powerful test-time scaling method and a step towards developing more intelligent and data-efficient models.
Moreover, the results presented by the researchers this time are only preliminary. The quality of these optimization results seems considerable, but there is still ample room for improvement, such as generating even better optimization ideas, producing higher-quality final code, and applying this method to increasingly complex kernels.
Currently, two specific examples that the researchers are actively improving include:
FP16 Matmul: Achieved 52% of torch.matmul performance
FP16 Flash Attention: Achieved 9% of torch.nn.functional.scaled_dot_product_attention performance
In modern machine learning tasks, FP32 is less common than FP16 or BF16, and optimizations for FP32 are often fewer on newer hardware.
This may partly explain why FP32-based kernels are more likely to surpass PyTorch in performance.
About the Authors
Anne Ouyang
Anne Ouyang is currently a Computer Science (CS) Ph.D. student at Stanford University, conducting research at the Scaling Intelligence Lab.
Her research interests primarily focus on scalable self-improving machine learning systems, while also broadly covering empirical ML and performance engineering.
Previously, she earned her bachelor's and master's degrees from MIT and worked at the NVIDIA cuDNN team, where she was responsible for writing CUDA kernels to accelerate deep learning workloads on GPUs.
Azalia Mirhoseini
Azalia Mirhoseini is an Assistant Professor of Computer Science at Stanford University and the founder of the Scaling Intelligence Lab, also serving as a Senior Research Scientist at Google DeepMind.
Her lab is dedicated to developing scalable, self-evolving AI systems and methodologies with the aim of advancing artificial general intelligence.
Before joining Stanford, she worked for many years at top AI labs in the industry, including Google Brain and Anthropic.
Her notable past achievements include:
Proposing the Mixture of Experts (MoE) neural architecture – currently widely used by cutting-edge AI models;
Leading the AlphaChip project – a groundbreaking work that applied deep reinforcement learning to layout optimization, successfully used in the design of advanced chips like Google AI accelerators (TPU) and data center CPUs;
Conducting in-depth research on scaling test-time computation
Percy Liang
Percy Liang is an Associate Professor of Computer Science at Stanford University and the Director of the Center for Research on Foundation Models (CRFM). He is also the creator of CodaLab Worksheets, through which he strongly advocates for research reproducibility.
He currently focuses on improving the accessibility and interpretability of foundation models (especially large language models) through open source and rigorous benchmarking.
He has conducted extensive research in the fields of machine learning and natural language processing, with specific directions including robustness, interpretability, human-computer interaction, learning theory, knowledge grounding, semantics, and inference.
Previously, he received his bachelor's degree from MIT in 2004 and his Ph.D. from UC Berkeley in 2011.
Kernel Optimization Process
Finally, here's an example of the optimization trajectory for an automatically generated Conv2D idea, with a torch reference baseline time of 1.41 milliseconds.
Round 0: 7.02 ms, 20.1% of reference performance
Idea: Given PyTorch code, replace operations with a CUDA Kernel.
Round 1: 7.54 ms, 18.8% of reference performance
Idea: Utilize read-only cache by loading immutable tensors using __ldg.
Round 2: 3.46 ms, 41.0% of reference performance
Idea: Convert convolution to FP16 Tensor-Core GEMM.
Author's comment: This is an algorithmic optimization, converting convolution to implicit GEMM, which is crucial for efficient convolution on Tensor Cores.
Round 3: 3.67 ms, 38.7% of reference performance
Idea: Employ double-buffered cp.async pipeline to overlap global memory loading with Tensor-Core computation.
Round 4: 3.46 ms, 41.0% of reference performance
Idea: Given PyTorch code, replace operations with a CUDA Kernel for implicit matrix multiplication (implicit matmul). A given GEMM kernel might be helpful.
Author's comment: Since the optimization involves using GEMM, an existing good GEMM kernel generated previously was used as a seed at the beginning of this round. This idea was manually written.
Round 5: 1.91 ms, 74.9% of reference performance
Idea: Inside each K-tile loop, precompute k_idx decomposed kernel/input indices and reuse them in shared memory to avoid redundant arithmetic operations.
Round 6: 1.37 ms, 103.6% of reference performance
Idea: Precompute and cache N-dimensional GEMM indices in shared memory to reduce redundant arithmetic operations within the input data loading loop.
Round 7: 1.38 ms, 102.9% of reference performance
Idea: Parallelize CUDA kernel output by using dedicated per-warp shared memory buffers to eliminate warp-level serialization when writing to global memory.
Round 8: 1.37 ms, 103.6% of reference performance
Idea: Precompute and cache base input coordinates in shared memory to reduce redundant arithmetic operations during input data loading.
Round 9: 1.36 ms, 105.1% of reference performance
Idea: Software pipeline B-fragment loading to overlap the next B-tile's shared memory reads with the current B-tile's WMMA computations.
Round 10: 1.07 ms, 133.6% of reference performance
Idea: Reuse precomputed N-dimensional GEMM decomposition in shared memory for output address calculations, avoiding redundant and costly division/modulo operations.
Round 11: 1.21 ms, 117.4% of reference performance
Idea: Remove hi/lo decomposition in half WMMA operations; if final precision is acceptable, rely on standard FP16 accumulation for performance improvement.
Round 12: 1.01 ms, 141.2% of reference performance
Idea: Dynamically compute K-dimensional indices during the pipeline's loading stage and use double buffering to overlap global memory loading of Asub (weights) and Bsub (inputs) K-loops with MMA computation.
Round 13: 0.795 ms, 179.9% of reference performance
Idea: Achieve vectorized shared memory writes for loading Asub_pipe and Bsub_pipe by using wider data types like half2.
Final Code
The finally generated Conv2D kernel code uses advanced CUDA techniques, the kind that would be very challenging for humans to write themselves!
References:
https://crfm.stanford.edu/2025/05/28/fast-kernels.html
https://news.ycombinator.com/item?id=44139454