Digital Event Horizon

NVIDIA Blackwell Ultra Sets a New Standard for AI Inference Performance: Achieving Unprecedented Throughput with Rack-Scale Systems

NVIDIA's latest rack-scale system, powered by the Blackwell Ultra architecture, has set a new benchmark for AI inference performance, delivering up to 1.4x more DeepSeek-R1 throughput compared to its predecessor. This achievement is a testament to NVIDIA's commitment to innovation and excellence in AI infrastructure.

The NVIDIA GB300 NVL72 rack-scale system powered by the NVIDIA Blackwell Ultra architecture has set a new bar for AI inference performance.

The system features 1.5x more NVFP4 AI compute and 2x more attention-layer acceleration than its predecessor, the Blackwell architecture.

The system boasts up to 288GB of HBM3e memory per GPU, providing a significant boost in processing power and capacity.

The NVIDIA-designed 4-bit floating point format (NVFP4) offers better accuracy compared to other FP4 formats while maintaining comparable accuracy to higher-precision formats.

The use of the open-source NVIDIA TensorRT-LLM library enabled the system to deliver higher performance while meeting strict accuracy requirements.

A disaggregated serving split technique resulted in a nearly 50% increase in performance per GPU with GB200 NVL72 systems compared to traditional serving.

The partnership with various cloud service providers and server makers has made market-leading inference performance available from major providers.

The NVIDIA GB300 NVL72 rack-scale system offers lower total cost of ownership (TCO) and enhanced return on investment for organizations deploying sophisticated AI applications.

NVIDIA's latest innovation, the NVIDIA GB300 NVL72 rack-scale system powered by the NVIDIA Blackwell Ultra architecture, has set a new bar for AI inference performance in the MLPerf Inference v5.1 benchmark. This achievement is particularly notable given that less than half a year since its debut at NVIDIA GTC, the same system demonstrated impressive results on this benchmark, delivering up to 1.4x more DeepSeek-R1 inference throughput compared with NVIDIA Blackwell-based GB200 NVL72 systems.

The success of the NVIDIA GB300 NVL72 rack-scale system can be attributed to several key factors. Firstly, the Blackwell Ultra architecture features 1.5x more NVFP4 AI compute and 2x more attention-layer acceleration than its predecessor, the Blackwell architecture. Additionally, these systems boast up to 288GB of HBM3e memory per GPU, providing a significant boost in terms of processing power and capacity.

Another critical component that contributes to this system's performance is its hardware acceleration for the NVFP4 data format. This NVIDIA-designed 4-bit floating point format offers better accuracy compared to other FP4 formats while maintaining comparable accuracy to higher-precision formats. Furthermore, the use of the open-source NVIDIA TensorRT-LLM library enabled the system to deliver higher performance while meeting strict accuracy requirements in submissions.

A technique called disaggregated serving splits context and generation tasks so each part can be optimized independently for best overall throughput. This technique was instrumental in record-setting performance on the Llama 3.1 405B Interactive benchmark, resulting in a nearly 50% increase in performance per GPU with GB200 NVL72 systems compared to each Blackwell GPU in an NVIDIA DGX B200 server running the benchmark with traditional serving.

NVIDIA's commitment to innovation and excellence is also reflected in its partnership with various cloud service providers and server makers. These partners, including Azure, Broadcom, Cisco, CoreWeave, Dell Technologies, Giga Computing, HPE, Lambda, Lenovo, Nebius, Oracle, Quanta Cloud Technology, Supermicro, and the University of Florida, submitted impressive results using the NVIDIA Blackwell and/or Hopper platform.

The market-leading inference performance on the NVIDIA AI platform is now available from major cloud providers and server makers. This translates to lower total cost of ownership (TCO) and enhanced return on investment for organizations deploying sophisticated AI applications. The benefits of this system are far-reaching, making it an attractive solution for businesses looking to harness the power of AI.

Learn more about these full-stack technologies by reading the NVIDIA Technical Blog on MLPerf Inference v5.1. Plus, visit the NVIDIA DGX Cloud Performance Explorer to learn more about NVIDIA performance, model TCO and generate custom reports.

In conclusion, the introduction of the NVIDIA GB300 NVL72 rack-scale system powered by the NVIDIA Blackwell Ultra architecture marks a significant milestone in the pursuit of AI inference performance. By combining cutting-edge hardware, optimized software, and strategic partnerships, NVIDIA has created a platform that sets a new standard for AI applications. As organizations continue to push the boundaries of what is possible with AI, this innovation will undoubtedly play a crucial role in shaping their success.

Related Information:

https://www.digitaleventhorizon.com/articles/NVIDIA-Blackwell-Ultra-Sets-a-New-Standard-for-AI-Inference-Performance-Achieving-Unprecedented-Throughput-with-Rack-Scale-Systems-deh.shtml

https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/

Published: Tue Sep 9 14:39:27 2025 by llama3.2 3B Q4_K_M

Today's AI/ML headlines are brought to you by ThreatPerspective

NVIDIA Blackwell Ultra Sets a New Standard for AI Inference Performance: Achieving Unprecedented Throughput with Rack-Scale Systems