NVIDIA Blackwell Ultra Sets the Bar in New MLPerf Inference Benchmark

09/09/2025

Inference performance is critical, as it directly influences the economics of an AI factory. The higher the throughput of AI factory infrastructure, the more tokens it can produce at a high speed - increasing revenue, driving down total cost of ownership (TCO) and enhancing the system's overall productivity.

Less than half a year since its debut at NVIDIA GTC, the NVIDIA GB300 NVL72 rack-scale system - powered by the NVIDIA Blackwell Ultra architecture - set records on the new reasoning inference benchmark in MLPerf Inference v5.1, delivering up to 45% more DeepSeek-R1 inference throughput compared with NVIDIA Blackwell-based GB200 NVL72 systems.

Blackwell Ultra builds on the success of the Blackwell architecture, with the Blackwell Ultra architecture featuring 1.5x more NVFP4 AI compute and 2x more attention-layer acceleration than Blackwell, as well as up to 288GB of HBM3e memory per GPU.

The NVIDIA platform also set performance records on all new data center benchmarks added to the MLPerf Inference v5.1 suite - including DeepSeek-R1, Llama 3.1 405B Interactive, Llama 3.1 8B and Whisper - while continuing to hold per-GPU records on every MLPerf data center benchmark.

Stacking It All Up Full-stack co-design plays an important role in delivering these latest benchmark results. Blackwell and Blackwell Ultra incorporate hardware acceleration for the NVFP4 data format - an NVIDIA-designed 4-bit floating point format that provides better accuracy compared with other FP4 formats, as well as comparable accuracy to higher-precision formats.

NVIDIA TensorRT Model Optimizer software quantized DeepSeek-R1, Llama 3.1 405B, Llama 2 70B and Llama 3.1 8B to NVFP4. In concert with the open-source NVIDIA TensorRT-LLM library, this optimization enabled Blackwell and Blackwell Ultra to deliver higher performance while meeting strict accuracy requirements in submissions.

Large language model inference consists of two workloads with distinct execution characteristics: 1) context for processing user input to produce the first output token and 2) generation to produce all subsequent output tokens.

A technique called disaggregated serving splits context and generation tasks so each part can be optimized independently for best overall throughput. This technique was key to record-setting performance on the Llama 3.1 405B Interactive benchmark, helping to deliver a nearly 50% increase in performance per GPU with GB200 NVL72 systems compared with each Blackwell GPU in an NVIDIA DGX B200 server running the benchmark with traditional serving.

NVIDIA also made its first submissions this round using the NVIDIA Dynamo inference framework.

NVIDIA partners - including cloud service providers and server makers - submitted great results using the NVIDIA Blackwell and/or Hopper platform. These partners include Azure, Broadcom, Cisco, CoreWeave, Dell Technologies, Giga Computing, HPE, Lambda, Lenovo, Nebius, Oracle, Quanta Cloud Technology, Supermicro and the University of Florida.

The market-leading inference performance on the NVIDIA AI platform is available from major cloud providers and server makers. This translates to lower TCO and enhanced return on investment for organizations deploying sophisticated AI applications.

Learn more about these full-stack technologies by reading the NVIDIA Technical Blog on MLPerf Inference v5.1. Plus, visit the NVIDIA DGX Cloud Performance Explorer to learn more about NVIDIA performance, model TCO and generate custom reports.

LINK:	https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/...
	See more stories from nvidia

NVIDIA Blackwell Ultra Sets the Bar in New MLPerf Inference Benchmark

More from Nvidia

26/10/2025

24/10/2025

23/10/2025

21/10/2025

20/10/2025

17/10/2025

17/10/2025

16/10/2025

14/10/2025

14/10/2025

13/10/2025

13/10/2025

09/10/2025

09/10/2025

09/10/2025

08/10/2025

02/10/2025

01/10/2025

30/09/2025

30/09/2025

25/09/2025

24/09/2025

24/09/2025

23/09/2025

22/09/2025

19/09/2025

18/09/2025

18/09/2025

17/09/2025

16/09/2025

13/09/2025

10/09/2025

09/09/2025

09/09/2025

09/09/2025

09/09/2025

09/09/2025

09/09/2025

04/09/2025

04/09/2025

04/09/2025

03/09/2025

02/09/2025

28/08/2025

28/08/2025

27/08/2025

25/08/2025

25/08/2025

22/08/2025

22/08/2025