AI Inference Soars 7.5x! NVIDIA Rubin CPX Redefines AI Profitability, Turning $100M Investment into $5B Return

On September 9th, the AI world was once again turned upside down by that man. Yes, we're talking about the leather-jacketed warrior, Jensen Huang, founder and CEO of NVIDIA. At the AI Infra Summit, Huang, with a smile and a nonchalant demeanor, unveiled a new category of GPU called Rubin CPX.

Image

In the past, when using AI, if the context was a bit long, it would start to hallucinate, forcing us to reopen the window. Now, AI is racing towards the 'agent' paradigm, requiring multi-step reasoning, persistent memory, and the ability to process unimaginably long contexts. Imagine asking AI to analyze a software project with millions of lines of code, or to directly generate a complete movie. The amount of data that needs to be processed, known as tokens, is astronomical. Traditional GPUs facing such tasks are like asking a sprinter to run a marathon – either the computing power is insufficient, or the memory bandwidth can't keep up, leading to numerous frustrating bottlenecks.

The newly released Rubin CPX, short for Rubin Context Processing Unit, is specifically designed to solve this 'marathon' problem. It directly extends the context window to over 1 million tokens. What's more impressive is its new approach called 'disaggregated inference'. Simply put, it splits the grand task of AI inference into two steps, assigning them to two specialized 'experts', resulting in a direct surge in efficiency. Computing power is boosted by up to 7.5 times, and the Return on Investment (ROI) reaches an astonishing 30 to 50 times.

Jensen Huang stated at the launch event: "The Vera Rubin platform will mark another leap forward in AI computing – introducing both the next-generation Rubin GPU and a new processor class called CPX." He added: "Just as RTX revolutionized graphics and physics AI, Rubin CPX is the first CUDA GPU designed specifically for massive context AI, where models can infer on millions of tokens of knowledge at once."

Such bold claims – how exactly does it achieve this? What is the power of this 'new nuclear bomb'?

Let Specialized GPUs Do Specialized Work

Let's first discuss the two major challenges of AI inference. Previous AI inference was like a chef doing everything themselves, from washing and chopping vegetables to stir-frying. This was fine for simple tasks like 'tomato scrambled eggs'. But now we're talking about a 'Buddha Jumps Over a Wall' level feast – those ultra-long context tasks. The model first needs to spend a lot of time 'preparing ingredients', which means understanding massive amounts of input data. This stage is called the context phase; it is extremely compute-intensive (compute-bound). Once the ingredients are prepared, it enters the 'cooking' stage, which is generating output token by token. This is called the generation phase; it demands extremely high serving speed and heavily tests memory bandwidth, making it memory bandwidth-bound.

Image

Take generating an hour-long video as an example. The AI model must first encode this hour of video content into approximately 1 million tokens. In the first stage, traditional GPUs get exhausted just 'preparing ingredients' due to insufficient computing power, leading to high latency. In the second stage, due to a 'narrow serving channel' (i.e., insufficient memory bandwidth), it cannot efficiently output the generated content.

NVIDIA's 'disaggregated inference' architecture upgrades the kitchen by hiring two master chefs. One is the Rubin CPX, the 'preparation master', immensely powerful and specializing in the context phase. No matter how much input data there is, it handles it perfectly with its ultra-high computing power. The other is the standard Rubin GPU, the 'cooking and serving master'. It is equipped with ultra-fast High Bandwidth Memory (HBM4) and specializes in efficiently 'blasting out' results during the generation phase.

With this division of labor, both chefs perform their respective duties, operating at full power in their areas of expertise. Resource waste? Non-existent. To ensure seamless cooperation between the two chefs, NVIDIA also provides a 'back-kitchen manager' – the Dynamo platform – responsible for coordinating critical KV caching, task routing, in-memory management, ensuring smooth transitions and seamless switching between the two phases.

The 'preparation master' Rubin CPX is itself a formidable piece of hardware. It adopts a monolithic die design, based on the latest Rubin architecture, packed with cutting-edge technology. It boasts up to 30 petaFLOPS of NVFP4 computing power, meaning 30 quadrillion floating-point operations per second, specifically optimized for low-precision inference. For memory, it uses 128GB of GDDR7 VRAM, striking an excellent balance between cost and bandwidth, perfectly meeting the high data throughput demands of the context phase. Even more impressive, it integrates hardware-level video decoders and encoders, capable of directly processing long video streams, saving a lot of pre-processing hassle. In core attention mechanism computations, its speed is a full 3 times faster than the previous generation flagship GB300 NVL72.

Pushing Boundaries with Sheer Hardware, Unreasonable Parameters

Of course, a single CPX, however powerful, is just one component. NVIDIA's traditional strength is 'teaming up'. The Rubin CPX is the core combat power of the NVIDIA Vera Rubin NVL144 CPX platform. This platform, simply put, is a rack packed with top-tier hardware, a veritable AI supercomputer in a single rack. Its specifications list is astonishing: it houses 144 'preparation master' Rubin CPXs, 144 'cooking master' Rubin GPUs, orchestrated by 36 Vera CPUs. It provides 100TB of direct memory, with a total bandwidth of up to 1.7 PB per second, which is 1.7 quadrillion bytes. At NVFP4 precision, this behemoth's total computing power reaches a terrifying 8 exaFLOPS, or 8 quintillion floating-point operations per second.

Image

What does this mean? The performance of this single rack is 7.5 times that of our current flagship product, the GB300 NVL72. Even compared to the Vera Rubin NVL144 version (3.6 exaFLOPS) without CPX, it's 2.2 times more powerful. To enable these performance beasts to cluster and form even larger combat groups, NVIDIA also offers two top-tier networking solutions: one is the ultra-low latency, high-throughput Quantum-X800 InfiniBand network; the other is the Spectrum-X solution optimized for Ethernet AI workloads, paired with Spectrum-XGS switches and ConnectX-9 SuperNICs, ensuring unobstructed data transmission.

To give everyone a more intuitive sense of how clear the division of labor between the two 'master chefs' is, the table below compares their core parameters. The data comes from official NVIDIA sources and reports from renowned hardware media Tom's Hardware, guaranteed authentic.

Image

See? Rubin CPX uses relatively affordable GDDR7 VRAM to achieve extreme computing density, focusing on tackling the most challenging context understanding. The standard Rubin GPU, on the other hand, leverages luxurious HBM4 ultra-high bandwidth to single-mindedly focus on rapid content generation. This 'specialization' design is the essence of the disaggregated inference architecture and the root of its powerful efficiency.

What Can Million-Token Context Really Change?

After all this talk about technology, some might ask, what actual changes can this million-token context bring to our lives? Good question, the changes are immense.

In software development, AI programming assistants, such as the familiar GitHub Copilot, could previously only help write small code snippets within a single file, essentially being 'blind' to the macroscopic structure of the entire project. But with Rubin CPX's ultra-long context capability, AI models can directly ingest the entire codebase, all relevant documentation, and even years of modification history in one go, forming a 'god's-eye view' for project-level code analysis and generation.

Michael Truell, CEO of AI programming company Cursor, is thrilled: "With NVIDIA Rubin CPX, Cursor will be able to deliver lightning-fast code generation and developer insights, transforming how software is created. This will unlock new levels of productivity and empower users to achieve ideas once out of reach."

In video generation, AI-generated video is evolving from few-second 'GIF animations' towards feature-length films. As mentioned earlier, generating an hour of high-definition video requires processing approximately 1 million tokens. Traditional GPUs spend too much time in the video content understanding phase, making real-time creation impossible.

The advent of Rubin CPX completely changes the game. Its integrated hardware video codecs can directly process video streams, significantly reducing pre-processing time. Cristóbal Valenzuela, CEO of Runway, commented: "Video generation is rapidly evolving towards longer contexts and more flexible, agent-driven creative workflows. We see Rubin CPX as a significant leap in performance that enables these demanding workloads to build more general and intelligent creative tools. This means creators – from independent artists to large studios – can achieve unprecedented speed, realism, and control in their work."

For true AI agents to achieve autonomous decision-making, they must possess long-term memory and powerful reasoning capabilities. Eric Steinberger, CEO of Magic, a company focused on AI software engineering automation, describes the future as follows: "With a 100 million-token context window, our models can see the entire codebase, years of interaction history, documentation, and libraries without fine-tuning. This allows users to guide agents through conversation and access their environment during testing, bringing us closer to autonomous agent experiences. Using GPUs like NVIDIA Rubin CPX greatly accelerates our computational workloads."

Real Returns are the Bottom Line

After discussing so much performance and application, what about the commercial value? NVIDIA officially provided a very astonishing estimation: the Vera Rubin NVL144 CPX platform, based on Rubin CPX, can achieve a "30 to 50 times Return on Investment." This means that for every $100 million in capital expenditure, customers can potentially generate up to $5 billion in token revenue.

This figure might sound like a pipe dream, but it's supported by logic. The terrifying 8 exaFLOPS computing power of a single rack is 7.5 times that of the previous generation, meaning the cost per unit of computing power is significantly diluted. The disaggregated architecture maximizes hardware resource utilization, directly boosting inference throughput several times over. NVIDIA provides a complete software ecosystem, including the Dynamo platform mentioned earlier, as well as NIM microservices, Nemotron multimodal models, and more. These software tools further optimize deployment and operational efficiency, allowing customers to convert computing power into revenue faster.

Jensen Huang summarized at the launch event: "Rubin CPX brings long-context processing performance and token revenue to unprecedented heights – far beyond the design limits of today's systems. This completely transforms AI programming assistants, from simple code generation tools to complex systems capable of understanding and optimizing large software projects."

Of course, powerful hardware cannot exist without a thriving software ecosystem. Behind Rubin CPX stands the entire NVIDIA AI empire. There's the NVIDIA Dynamo platform responsible for inference orchestration, which has already set records in MLPerf performance tests. There are enterprise-grade NVIDIA NIM microservices, providing top-tier AI inference capabilities for businesses. There's also the CUDA-X library with 6 million developers and nearly 6,000 applications, ensuring that Rubin CPX has a massive array of applications ready to run from day one. Furthermore, there's the AI Enterprise software platform, tailored for businesses, supporting full-scenario deployment from cloud, data centers to workstations.

Rubin CPX, through its disaggregated architecture and task-optimized design, precisely addresses the core pain points of long-context inference, paving the way for cutting-edge applications in software engineering, video creation, and AI agents.

The Vera Rubin NVL144 CPX platform further redefines the ceiling of AI infrastructure with its almost unbelievable performance parameters.

As Jensen Huang said: "Rubin CPX is the RTX moment for massive context AI."

From this moment on, AI may truly break free from the constraints of being a 'tool' and begin to become an intelligent partner with long-term memory, deep reasoning, and extraordinary creativity.

References:

https://nvidianews.nvidia.com/news/nvidia-unveils-rubin-cpx-a-new-class-of-gpu-designed-for-massive-context-inference

https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-workloads

https://www.tomshardware.com/tech-industry/semiconductors/nvidia-rubin-cpx-forms-one-half-of-new-disaggregated-ai-inference-architecture-approach-splits-work-between-compute-and-bandwidth-optimized-chips-for-best-performance

Main Tag:AI Hardware

Sub Tags:AI InferenceRubin CPXNVIDIAGPU


Previous:Boost LLM Reasoning Accuracy to 99% Without Fine-Tuning! Try DeepConf, a Lightweight Inference Framework | Latest from Meta

Next:The More You Think, The More You Err: CoT "Deep Deliberation" as a Catalyst for LLM Hallucinations!

Share Short URL