Jensen Huang Confirmed NV72 Vera Rubin Cabinets in Production

Jensen Huang announced that NV72 Vera Rubin cabinets are now in full production — and that’s a larger deal than most headlines are making it out to be. The news came at Nvidia’s Computex 2025 keynote and Huang didn’t hold back on the roadmap details. This is no ordinary chip refresh. It’s a fundamental rethinking of how AI computation is packaged, cooled and deployed at scale.”

The NV72 label is for a complete rack-scale system, which contains 72 Vera Rubin GPUs in a single liquid-cooled cabinet. The company is also pitching the cabinets as the basis for AI training and inference workloads out to 2026 and beyond. I’ve been following hardware launches for a decade, and the ambition here at the cabinet level is really different from what we’ve seen before.

Table of contents

What the NV72 Vera Rubin Architecture Delivers

Memory, Bandwidth, and Tensor Core Gains Over Blackwell

Manufacturing Ramp and Production Volume Forecasts

Customer Deployments and Competitive Positioning

Inference vs. Training: Who Benefits Most

What This Means for the AI Hardware Market

Conclusion

FAQ

What the NV72 Vera Rubin Architecture Delivers

When Jensen Huang announced that NV72 Vera Rubin cabinets were in production, he revealed important architectural elements – and some of them astonished me when I initially looked at the specs.

The Vera Rubin GPU is based on a new architecture after Blackwell. It takes advantage of precisely Nvidia’s next gen streaming multiprocessors with vastly better tensor cores. That’s not just marketing hype. The silicon changes underneath are massive.

The headline value here is memory bandwidth. All Vera Rubin GPUs include HBM4 stacks. Nvidia has not given specific per-chip bandwidth estimates, but industry observers expect each GPU will perform well over 8 TB/s, or almost twice what Blackwell B200 GPUs deliver with HBM3e. Twice. That’s not just a bit of an increase.

Tensor performance leaps in the same way. The new tensor cores natively handle FP4, FP8 and FP16 precision formats. This means lower precision computation is a huge boon for inference workloads. Training still needs FP8 or better, but the flexibility is more than people think.

This is what makes the NV72 cabinet different from previous rack designs:

72 GPUs per cabinet, instead than 36 in the Blackwell GB200 NVL72 setup
GPU-to-GPU communication using NVLink 6 connector with ultra-high bandwidth
Liquid cooling everywhere – no air-cooled option at this density (fair warning if your facility isn’t set up for it)
Integrated Vera Rubin CPUs powered by Nvidia’s own Arm-based Grace successor
Single-fabric NVLink domain – all 72 GPUs share a single memory space

And the cabinet-level design means clients don’t build out individual servers. They order full racks. That makes deployment easier in ways that are easy to under-appreciate until you’ve actually tried to stand up a dense GPU cluster from scratch.

Shipments are underway, and you can expect to find updated specs on Nvidia’s official data center solutions page. Meanwhile, the move to rack-scale computing is part of a trend across the industry – but no one is doing it like this.

Memory, Bandwidth, and Tensor Core Gains Over Blackwell

To understand why Jensen Huang affirmed NV72 Vera Rubin cabinets matter, you have to set them next to existing hardware. I’ve experimented with many GPU configs over the years and the leap from Blackwell to Vera Rubin is actually huge, not the usual 20% shuffle.

Feature	Blackwell B200 (GB200 NVL72)	Vera Rubin (NV72 Cabinet)
GPUs per cabinet	36 (in NVL72 config)	72
Memory type	HBM3e	HBM4
Memory per GPU	192 GB	Expected 288 GB+
Interconnect	NVLink 5	NVLink 6
Tensor precision	FP4, FP8, FP16	FP4, FP6, FP8, FP16
Cooling	Liquid	Liquid
CPU companion	Grace (Arm)	Vera CPU (Arm, next-gen)
Manufacturing node	TSMC 4NP	TSMC 3nm-class

Importantly, the move to HBM4 is crucial. Both Samsung and SK Hynix are building HBM4 stacks with broader interfaces and higher per-pin data speeds. This means that memory-bound AI models like most large language models operate much faster. Bottom line: if your workload is memory-bound, this trumps just about any other criteria on the page.

The NVLink 6 connection also merits a mention. It allows all 72 GPUs to communicate with each other as one giant processor. Particularly, this unified memory domain means a single model may cover the entire cabinet, without complicated parallelism workarounds. Just the fact that you don’t have to troubleshoot distributed training settings is kind of strange. I’ve spent way too many hours debugging distributed training setups.

And moving to a TSMC 3-nanometer-class process node also helps power efficiency. Each GPU does more work per watt of power. Overall cabinet power consumption is still over 100 kW – heads up, that’s a major facilities conversation – but performance per watt goes up considerably.

FP6 precise support is all new with Vera Rubin, and this one truly startled me. It is somewhere in between FP4 and FP8, providing a sweet spot for certain inference tasks. It retains better model fidelity than FP4, while consuming less compute than FP8. That means operators can make precision callouts calibrated to the workload, instead of a binary compromise.

Manufacturing Ramp and Production Volume Forecasts

Jensen Huang confirms full production of NV72 Vera Rubin cabinets with good confidence in manufacturing. But what does “full production” mean in terms of quantity? That’s the question that should be asked.

Details of the production timeline:

Q2 2025 – Engineering samples and validation units sent to key partners
Q3 2025 – Full manufacturing ramp-up at TSMC and assembly partners
Q4 2025 – First client shipments to hyperscalers (Microsoft, Google, Meta, Amazon)
H1 2026 – Wider availability to enterprise and cloud providers

TSMC manufactures GPUs for Nvida. The 3nm-class process for Vera Rubin chips requires improved CoWoS (Chip-on-Wafer-on-Substrate) packaging and is still a real bottleneck – and not a talking point. “Aggressive expansion” in semiconductor production is still moving slowly, however TSMC has been actively ramping up CoWoS capacity throughout 2024 and 2025.

But supply problems are probable. Blackwell GPUs were in short supply for a long time after launch, and we’ll likely see the same with Vera Rubin cabinets. Demand from hyperscalers alone could use up initial manufacturing runs entirely – and that’s before enterprise clients ever get a look in.

Analyst firms’ volume predictions are:

First year shipments of NV72 cabinets: 50,000-80,000
Revenue per cabinet projected at $3-5 million
AI infrastructure total addressable market above $200 billion by 2027

Nvidia’s manufacturing partners, including as Foxconn, Quanta and Wistron, are also setting up dedicated lines to assemble the cabinets. Liquid cooled rack integration is hard and needs specialized equipment. This is one reason why the ramp takes time even when chips are available.

“Nvidia’s annual architecture cadence means successors to Vera Rubin are already in development,” Jensen Huang has stressed. So if you are arranging procurement, don’t wait for perfect – there is always something newer coming. If you want to keep a close eye on the numbers, Nvidia’s investor relations page analyzes quarterly production and revenue milestones.

Customer Deployments and Competitive Positioning

Jensen Huang also noted early client commitments when he announced NV72 Vera Rubin cabinets were ready for manufacturing. The competition dynamics here are really interesting – and a bit more subtle than the typical “Nvidia wins everything” story.

Confirmed deployment partners are:

Microsoft Azure: NV72 cabinets for Azure AI services
Google Cloud: testing Vera Rubin on its own TPU v6 hardware
Meta: Training and inference with Llama models using cabinets
Amazon Web Services: NV72 instances via EC2 Oracle Cloud — AI infrastructure partnerships with Nvidia CoreWeave – scaling GPU cloud capacity with Vera Rubin systems

Japan, France and India have similarly mandated sovereign AI efforts. These governments demand local AI compute capacity and the NV72 cabinet offers a complete solution that is difficult to replicate fast with alternatives.

And that’s where things get interesting – competition between AMD and Intel. AMD’s Instinct MI350 series is aimed at the similar tasks. Intel’s Gaudi 3 accelerator for a lower pricing point. But neither have the same rack-scale integration as Nvidia’s NV72. And that gap is genuine, not simply a spec sheet difference.

Here’s what the competitive landscape looks like:

Nvidia NV72 Vera Rubin: Top-tier performance, highest price, deepest software ecosystem (CUDA)
AMD Instinct MI350: Good price, decent performance, expanding ROCm software support
Intel Gaudi 3: Affordable, less mature software, better for particular inference workloads
Google TPU v6: Only in Google Cloud, optimized for JAX/TensorFlow workloads
Custom ASICs (Amazon Trainium, Microsoft Maia): Proprietary, tuned for certain internal workloads

So Nvidia still reigns. The CUDA software ecosystem remains the company’s biggest moat – and I don’t say it lightly. Most AI researchers develop CUDA first . The switching costs are really unpleasant . AMD’s ROCm has improved a lot but still lags behind in library support and developer tooling. The distance is narrowing, but not closed yet.

Nvidia also has a distinct integration advantage with the NV72 cabinet approach. Networking, cooling or power distribution don’t have to be worked out by the customer – everything is pre-configured. It’s a simple value proposition for enterprises that want to get AI infrastructure up and running rapidly.

Inference vs. Training: Who Benefits Most

The news is that NV72 Vera Rubin cabinets are in production. Jensen Huang has confirmed this. What does this mean for inference and training ? These two sorts of task have different needs, so it’s good to be precise about who benefits the most.

Training workloads require high memory capacity, high bandwidth and quick GPU to GPU communication. And huge language models like GPT-5-class systems need hundreds of GPUs to function in concert. That’s where the NV72 cabinet comes in, with its unified NVLink 6 domain – all 72 GPUs share gradients and activations without network bottlenecks. That’s the real kicker for the creation of frontier models.

For inference workloads, throughput and latency are more important than raw compute. They also benefit greatly from lower precision forms such as FP4 and FP6. Vera Rubin’s tensor cores are built for this, and that’s why the architecture delivers more inference requests per second per watt than Blackwell. I’ve seen the cost of inference at scale compound first hand – this matters.

Why this matters economically:

Training expenditures are one-time (per model version) . You work out once, then run.
Inference expenses are still running. Each user query has a compute cost.
Now, more than 60% of AI compute spend at large cloud providers is for inference.

So Nvidia built Vera Rubin with inference efficiency as a key design objective. FP4 tensor cores give about 2x throughput for inference workloads than Blackwell. Larger HBM4 memory pools also mean that larger models can fit on fewer GPUs – a cost decrease disguised behind a performance spec.

For enterprises deploying AI applications in production, this means lower cost per query. Or they can run more users for the same budget of hardware. Either way, the economics are much improved. And that’s ultimately what drives most teams’ procurement decisions, I’ve found.”

The Vera Rubin results will likely be included in the MLPerf benchmark suite when systems ship to clients. These standard benchmarks are the most trustworthy way to compare how vendors perform – far more dependable than anything in a vendor news release, even one from Nvidia.

Nvidia’s TensorRT inference optimization software is already being upgraded for Vera Rubin. Early access partners are seeing substantial speedups on popular models including Llama 3, Mixtral and Stable Diffusion variations. But those early statistics are usually best-case scenarios so wait for independent benchmarks before planning capacity around them.

What This Means for the AI Hardware Market

Jensen Huang has already revealed NV72 Vera Rubin cabinets are in full production, and the ripple effects go far beyond Nvidia. The whole AI hardware ecosystem has to react — and certain elements of it aren’t ready.

Power infrastructure is becoming a key bottleneck. A single NV72 cabinet will pull over 100 kW, so data centers need huge electrical capacity and cooling infrastructure. The main impediment to deploying AI may be the availability of power, not the supply of chips. That’s a structural issue that can’t be remedied by creating more fabs.

The U.S. Department of Energy has recognized power usage by data centers as an emerging issue. There are new nuclear and renewable projects in the pipeline to support the growth of AI infrastructure and that says something about the scope of what is coming.

Supply chain effects are equally important:

Must ramp up HBM4 memory production quickly
CoWoS enhanced packing capacity strained
Demand for liquid cooling components
Rack level power distribution requires specialist equipment
Data center build times are expanding out to 18-24 months

Then there’s the expense of the NV72 cabinet, at $3 million to $5 million per, which implies that only well-funded groups can participate directly. This widens the divide between the AI haves and the AI have-nots. Smaller organizations are increasingly turning to the cloud for access to the latest hardware and that trend will only accelerate.

Specifically, the shift to cabinet-level sales makes a substantial change in the business model of Nvidia. They’re selling full infrastructure units instead of individual GPUs or servers. That improves revenue per client while simplifying the deployment process, which is good for Nvidia’s margins and, frankly, not terrible for customers either.

It will be interesting to see how AMD responds competitively. AMD’s MI350 accelerators offer attractive performance at lower pricing points. Although AMD lacks Nvidia’s rack-scale integration, its open-source ROCm software stack appeals to budget-conscious consumers. Plus, any meaningful enterprise study should include AMD in the mix – the savings can be considerable, depending on your workload.

Conclusion

Jensen Huang Confirmed NV72 Vera Rubin cabinets are now in full production and the consequences are Huge. This is not just a faster GPU, but an entire new way of packaging and delivering AI compute. I’ve seen enough product cycles to know when something is truly distinct and this one is.

The figures say it all. Seventy-two GPUs per rack. HBM4: Bandwidth memory that breaks records. NVLink 6 for unified memory throughout the entire system. Inference speed with FP4 and FP6 precision. These enhancements together are a generational jump over Blackwell, not a point release.

What technology executives should do now:

Know your AI workload mix – is training or inference the primary driver of your compute requirements
Contact cloud providers – ask about NV72 Vera Rubin availability schedules on AWS, Azure and Google Cloud
Assess power infrastructure – guarantee your data centers can support 100+ kW per cabinet
Check software compatibility – make sure your CUDA programs will take advantage of Vera Rubin’s new tensor core characteristics
Plan procurement early – supply shortages are a near certainty during initial ramp
Compare alternatives – AMD MI350 and cloud-native offerings may be cheaper for some workloads

Jensen Huang Confirmed NV72: The Vera Rubin cabinets are currently in full production, therefore the AI hardware industry is changing right now, not six months from now.” Organizations who plan ahead will be the first to enjoy the performance benefits. Those waiting for supply to return to normal could find themselves a whole generation behind.

FAQ

What did Jensen Huang confirm about NV72 Vera Rubin cabinets?

Jensen Huang confirmed NV72 Vera Rubin cabinets have entered full production during Nvidia’s Computex 2025 keynote. Specifically, he stated that manufacturing partners are actively building complete rack-scale systems. These cabinets each contain 72 Vera Rubin GPUs, and first customer shipments are expected in Q4 2025 for hyperscale cloud providers.

How does the NV72 Vera Rubin cabinet differ from Blackwell GB200 NVL72?

The NV72 Vera Rubin cabinet doubles the GPU count per rack compared to Blackwell configurations. It uses HBM4 memory instead of HBM3e, providing significantly higher bandwidth. Additionally, it features NVLink 6 interconnects and a newer TSMC 3nm-class manufacturing process. The Vera Rubin architecture also introduces FP6 precision support for optimized inference workloads.

How much does an NV72 Vera Rubin cabinet cost?

Nvidia hasn’t disclosed official pricing. However, industry analysts estimate each NV72 Vera Rubin cabinet costs between $3 million and $5 million. This price includes all 72 GPUs, networking, liquid cooling, and power distribution. Consequently, most organizations will access these systems through cloud providers rather than purchasing directly.

When will NV72 Vera Rubin cabinets be available to customers?

Hyperscale customers like Microsoft, Google, Meta, and Amazon are expected to receive first shipments in Q4 2025. Broader enterprise availability through cloud platforms should follow in H1 2026. Nevertheless, supply constraints will likely limit availability during the initial production ramp, similar to what happened with Blackwell GPUs.

Is the NV72 Vera Rubin cabinet better for AI training or inference?

It excels at both, but Nvidia specifically optimized Vera Rubin for inference efficiency. The new FP4 and FP6 tensor core support delivers dramatically better inference throughput per watt. For training, the unified NVLink 6 memory domain across all 72 GPUs makes large model training more efficient. Therefore, organizations running mixed workloads benefit the most from these cabinets.

How does Nvidia’s NV72 Vera Rubin compare to AMD’s MI350 accelerators?

Nvidia’s NV72 Vera Rubin cabinets offer superior rack-scale integration and the industry’s most mature software ecosystem through CUDA. AMD’s MI350 accelerators compete on raw performance and typically cost less per chip. However, AMD doesn’t currently offer an equivalent cabinet-level product. The choice often depends on software requirements, budget, and whether your team already has CUDA expertise.