The next step in GPU inference acceleration is kernel fusion.
Compiled by Zheng Jiamei
Edited by Ma Xiaoning
The Stanford Hazy Research team has just announced a major optimization achievement: they have integrated the forward inference of the open-source model Llama-3.2-1B into a "Megakernel," pushing low-latency inference capabilities to their limit.
In certain real-time critical applications, such as conversational AI and human-involved interactive workflows, the response speed of large language models is not just important but can even determine the success or failure of the user experience.
The team believes that the bottleneck limiting LLM inference speed is actually in memory loading. Their research found that existing open-source inference engines (such as vLLM, SGLang), even on top-tier GPUs (like H100), can only utilize less than 50% of memory bandwidth for ultra-low-latency single-sequence generation tasks.
This is mainly because each Transformer module layer is decomposed into dozens to hundreds of CUDA kernels, with each kernel performing very small operations (e.g., RMS norm, attention, MLP, Rotary Position Embedding, etc.), leading to a large number of context switches and waits between them.
More seriously, the combined costs of these kernel launches and cleanups are not fully hidden by mechanisms like CUDA Graph or PDL (Programmatic Dependent Launch) and are instead magnified in short-duration tasks.
In other words, the GPU spends a lot of time "waiting to work" instead of "working." The Hazy team's research is precisely focused on this problem.
1 Megakernel: A Fusion Approach Designed from Scratch
First, the experimental results: Megakernel reduced inference latency on H100 to less than 1 millisecond, with memory bandwidth utilization as high as 78%, an improvement of 2.5 times compared to vLLM and 1.5 times compared to SGLang; on the more advanced B200 platform, latency further decreased to 600-680 microseconds, approaching the theoretical limit.
From the time distribution of a complete inference pass, 250 microseconds are used for storing activations, waiting for consistency, and data loading; 200 microseconds for RMSNorm and matvec (with matvec accounting for 95%); weight loading only takes 30 microseconds, and the pipelining mechanism shows stable performance. Synchronization between warps and barriers introduce a 40-microsecond delay, with other miscellaneous overheads such as setup, parameter passing, and page status marking totaling approximately 80 microseconds.
Overall, under careful scheduling, the Hazy team's Megakernel has almost squeezed the current hardware performance to its limit.
The ability to achieve these results is due to a radical yet effective design approach proposed by the Hazy team: integrating the entire forward propagation process into a single CUDA kernel, known as Megakernel.
In their experiments, they developed a lightweight "instruction interpreter" system running on the GPU, based on the existing ThunderMLA architecture. This system pre-allocates an "execution plan" for each Streaming Multiprocessor (SM), containing multiple sequentially arranged instructions, where each instruction represents a structural unit within the Transformer model.
These instructions include:
Fused instructions for RMSNorm, QKV projection, and RoPE;
Attention matrix multiplication and reduction computation (supporting long-sequence GQA);
O-projection and residual addition;
MLP's RMSNorm, gate activation (SiLU), and up-projection;
Down-projection and final residual;
Last layer RMSNorm + language modeling head.
Each instruction is built upon a unified CUDA template, achieving standardized encapsulation of load, store, and compute operations. Inter-instruction dependencies are statically arranged by the interpreter before runtime, and each SM can repeatedly reuse the same schedule to process multiple tokens.
Furthermore, to ensure efficient data paths, the interpreter statically orchestrates these execution plans according to the model structure, avoiding dynamic branching during scheduling, thereby improving throughput and concurrent execution capabilities.
To achieve pipelined computation and prevent shared memory conflicts, the team also implemented a paging management for GPU shared memory, for example:
Dividing the first 213KB of shared memory into 13 16KiB pages;
The remaining portion is used to store instruction parameters, page allocation information, etc.;
Each instruction explicitly requests a page before loading and returns it to the interpreter scheduler after completion;
When a page is freed, the interpreter immediately allocates it to the next waiting instruction.
This mechanism ensures that the next computation phase can start pre-loading weights as early as possible, thereby maximizing bandwidth utilization and eliminating "bubbles."
However, the Megakernel structure cannot rely on traditional implicit synchronization between kernels. Therefore, the Hazy team also used a counter system: they maintain a set of integers in global memory, and after each instruction completes, its corresponding counter is incremented by 1. If an instruction depends on the result of a previous step, it waits until the counter reaches a specific value before executing.
For example, in the MLP down-projection phase, the team splits the intermediate state into 4 chunks, and each chunk immediately triggers subsequent computations upon writing, thus achieving parallel flow. Additionally, by precisely setting the dependency graph, the team avoided global barriers, significantly reducing wasted time waiting between instructions, making the entire kernel execution as close as possible to theoretical concurrency.
In addition, the research team also measured the performance of CUDA asynchronous barriers and found that even when a barrier has "passed," it still takes 60ns each time, indicating that synchronization operation costs are not negligible. In actual execution, especially in critical operations like matrix-vector multiplication, they found that on Hopper architecture (e.g., H100), using conventional CUDA cores (non-Tensor Cores) can be more effective, but on Blackwell architecture, Tensor Core performance is superior.
This also indicates that for different generations of hardware, the optimal implementation path for Megakernel should adapt to micro-architectural differences, rather than using a single solution for all platforms.
For more content, click below to follow:
Without authorization from "AI Technology Review," it is strictly forbidden to reproduce in any way on web pages, forums, or communities!
For WeChat Official Account reproduction, please leave a message in the backend of "AI Technology Review" to obtain authorization. When reproducing, please indicate the source and insert this official account's business card.
// Recommended Reading
UCL Reinforcement Learning Faction: Wang Jun and His Students
Hidden Large Model Players Enter the Scene: DeepSeek to the Left, Face Wall to the Right
Foundation Large Models "Six In, Two Out": South Jiedian, North Zhipu