As the latest member of the NVIDIA Blackwell architecture family, the NVIDIA Blackwell Ultra GPU builds on core innovations to accelerate training and AI reasoning. It fuses silicon innovations with new levels of system-level integration, delivering next-level performance, scalability, and efficiency for AI factories and the large-scale, real-time AI services they power.With its energy-efficient dual-reticle design, high bandwidth and large-capacity HBM3E memory subsystem, fifth-generation Tensor Cores, and breakthrough NVFP4 precision format, Blackwell Ultra is raising the bar for accelerated computing. This in-depth look explains the architectural advances, why they matter, and how they translate into measurable gains for AI workloads.
Dual-reticle design: one GPU Blackwell Ultra is composed of two reticle-sized dies connected using NVIDIA High-Bandwidth Interface (NV-HBI), a custom, power-efficient die-to-die interconnect technology that provides 10 TB/s of bandwidth. Blackwell Ultra is manufactured using TSMC 4NP and features 208B transistors-2.6x more than the NVIDIA Hopper GPU-all while functioning as a single, NVIDIA CUDA-programmed accelerator. This enables a large increase in performance while also maintaining the familiar CUDA programming model that developers have enjoyed for nearly two decades.
Benefits
Unified compute domain: 160 Streaming Multiprocessors (SMs) across two dies, providing 640 fifth-generation Tensor Cores with 15 PetaFLOPS dense NVFP4 compute.
Full coherence: Shared L2 cache with fully coherent memory accesses.
Maximum silicon utilization: Peak performance per square millimeter.
data-src=https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-png.webp alt=Diagram of NVIDIA Blackwell Ultra GPU showing dual reticle dies linked by a 10 TB/s NV-HBI interface. Each die contains a GigaThread Engine with MIG control, L2 cache, and 8 GPCs with a total of 640 fifth-generation Tensor Cores (15 PFLOPS dense NVFP4). Callouts highlight PCIe Gen 6 (256 GB/s), NVLink v5 (1,800 GB/s to NVSwitch), NVLink-C2C (900 GB/s CPU-GPU), and 288 GB HBM3E (8 stacks, up to 8 GB/s). class=lazyload wp-image-104918 data-srcset=https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-png.webp 1920w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-300x204-png.webp 300w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-625x424-png.webp 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-169x115-png.webp 169w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-768x521-png.webp 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-1536x1042-png.webp 1536w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-645x438-png.webp 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-442x300-png.webp 442w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-133x90-png.webp 133w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-362x246-png.webp 362w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-162x110-png.webp 162w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-1024x695-png.webp 1024w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-GPU-chip-796x540-png.webp 796w data-sizes=(max-width: 1920px) 100vw, 1920px />Figure 1. NVIDIA Blackwell Ultra GPU chip explained
Streaming multiprocessors: compute engines for the AI Factory As shown in Figure 1, the heart of Blackwell Ultra is its 160 Streaming Multiprocessors (SMs) organized into eight Graphics Processing Clusters (GPCs) in the full GPU implementation. Every SM, shown in Figure 2, is a self-contained compute engine housing:
128 CUDA Cores for FP32 and INT32 operations, also FP16/BF16 and other precisions.
4 fifth-generation Tensor Cores with NVIDIA second-generation Transformer Engine, optimized for FP8, FP6, and NVFP4.
256 KB of Tensor Memory (TMEM) for warp-synchronous storage of intermediate results, enabling higher reuse and reduced off-chip memory traffic.
Special Function Units (SFUs) for transcendental math and special operations used in AI kernels.
data-src=https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-png.webp alt=Diagram of Blackwell Ultra Streaming Multiprocessor (SM) architecture showing CUDA cores, Tensor Cores, TMEM, shared memory, SFUs, Tex blocks, and other SM units. class=lazyload wp-image-104889 style=width:712px;height:auto data-srcset=https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-png.webp 1435w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-215x300-png.webp 215w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-625x871-png.webp 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-83x115-png.webp 83w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-768x1070-png.webp 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-1103x1536-png.webp 1103w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-645x899-png.webp 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-65x90-png.webp 65w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/08/Blackwell-Ultra-SM-architecture-362x504-png.webp 362w, https://










