Blackwell’s MLPerf Run Puts AI Training Bottlenecks At Rack Scale
NVIDIA says Blackwell led MLPerf Training 6.0 across all seven benchmarks, with submissions scaling to 8,192 GPUs and GB300 NVL72 training up to 1.6x faster than GB200 NVL72 at the same scale.

Blackwell Moves The Benchmark Fight To Full Racks
NVIDIA’s latest MLPerf Training 6.0 results put the AI training race at rack and cluster scale, not just at the level of an individual accelerator.
The company says its Blackwell platform delivered the fastest time to train across all seven MLPerf Training 6.0 benchmarks.
It also submitted across every benchmark in the suite and scaled one Blackwell NVL72 training run to 8,192 GPUs.
Large AI training is increasingly limited by networking, memory, reliability and power envelopes.
Faster chips help, but frontier-model training also depends on whether thousands of GPUs can behave like a coordinated system long enough to finish a run.
MoE Workloads Raise The Networking Burden
MLPerf Training 6.0 added two mixture-of-experts pretraining workloads: DeepSeek-V3 671B and GPT-OSS-20B.
Those workloads make the network more important because tokens have to move across GPUs to reach the right expert subnetworks.
NVIDIA submitted results on GB200 NVL72 and GB300 NVL72 rack-scale systems.
In each NVL72 rack, fifth-generation NVLink Switches connect 72 GPUs into a shared pool of compute and memory.
NVIDIA put the GB300 NVL72 advantage at up to 1.6x versus GB200 NVL72 when both systems were measured at the same scale.
The company tied the gain to Blackwell Ultra features including NVFP4, expanded memory capacity and a higher power ceiling.
Scale Claims Now Depend On Reliability
The largest Blackwell submission used 8,192 GPUs on DeepSeek-V3 671B with GB200 NVL72 systems.
NVIDIA also submitted at 5,120 GPUs on Llama 3.1 405B.
Partner submissions show how hyperscalers and AI clouds are turning the same platform into cluster-scale results.
Microsoft Azure ran Llama 3.1 405B training across 8,192 GPUs and hit the reference quality target in 7.07 minutes.
CoreWeave used GB300 NVL72 with Spectrum-X Ethernet to reach the DeepSeek-V3 671B quality target in 2.02 minutes at 8,192-GPU scale.
At that scale, reliability becomes part of performance.
NVIDIA says production training can run for weeks or months across hundreds of thousands of GPUs, so checkpointing, fault detection and network rerouting affect whether theoretical throughput becomes usable training time.
Partner Demand Is The Commercial Proof
The ecosystem list is broad.
NVIDIA says 19 organizations submitted in this MLPerf round, including Microsoft Azure, CoreWeave, Google Cloud, Dell Technologies, Hewlett Packard Enterprise, Supermicro, Cisco and others.
Customer examples add another layer.
Cohere said its North agentic AI platform trained 3x faster on GB200 NVL72.
Thinking Machines Lab reported 2x faster training and serving speeds on GB300 NVL72 through Google Cloud.
Higgsfield said Nebius infrastructure cut model training time by 30%, while its service has 22 million users generating more than 6 million pieces of AI content per day.
The unresolved issue is not whether NVIDIA can post benchmark wins.
It is whether data-center operators can keep feeding these rack-scale systems with power, networking, cooling and customer workloads at the pace Blackwell demand implies.
















