
Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.
But with opportunities often come challenges.
Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.
To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.
Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.
We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.
NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.
This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.
Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.
These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.
Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.
Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.
Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.
For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.
The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.
For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.
Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.
Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
Most recent headlines
05/01/2027
Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...
06/09/2026
June 9 2026, 23:00 (PDT) Dolby and MagentaTV Bring Fans Closer to the FIFA Worl...
04/08/2026
Dalet, a leading technology and service provider for media-rich organizations, t...
04/07/2026
April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...
24/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
24/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
24/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
24/06/2026
First Rush Brings SDI Multicam ProRes Recording to Apple Silicon Macs
Brie Clayton June 23, 2026
0 Comments
First Rush is a native macOS application d...
24/06/2026
Vertical Drama Beneath Crimson Sails Created with Blackmagic Design
Brie Clayton June 23, 2026
0 Comments
Thunder Child Productions relies on cameras&...
23/06/2026
When we began planning our transition from an SDI-based infrastructure to a new ...
23/06/2026
Imagine Communications has announced the appointment of Greg Garmon as Senior Vice President, Americas Video Sales. Garmon will oversee account growth and busin...
23/06/2026
Snap has promoted Emma Wakely to Head of Sports and Media Partnerships, Americas, succeeding Anmol Malhotra, who has been elevated to Global Head of Content and...
23/06/2026
YES Network and The Gotham Sports App will air MI New York's Major League Cr...
23/06/2026
The Universal Talent Identifier (HAND) has issued HAND IDs to 34 top projected prospects in the 2026 NBA Draft class, including AJ Dybantsa, Cameron Boozer, and...
23/06/2026
World Boxing has announced the launch of World Boxing TV, a subscription-based streaming platform built on the Joymo platform, offering live events, on-demand c...
23/06/2026
FloSports will stream 32 off-road motorcycle racing events on FloRacing, includi...
23/06/2026
SES has announced the expansion of its ASTRA TV platform in Spain with the addition of 14 regional channels in HD and UHD quality and the launch of new hybrid s...
23/06/2026
Appear ASA has announced its role in Rede Legislativa de R dio e TV's contri...
23/06/2026
LTN has announced that PBS has selected it as its IP video partner to modernize content distribution and contribution across more than 330 public television sta...
23/06/2026
Ease Live has announced that its graphics overlay platform is powering an interactive fan experience on Rally.TV, the official streaming platform of the FIA Wor...
23/06/2026
Chyron has announced updates to Chyron LIVE, its cloud-native live production pl...
23/06/2026
ESPN has announced ESPN Fan House, a fan engagement hub powered by Flowcode, launching in August ahead of the 2026 college football season. Publicis Sports will...
23/06/2026
The city's solid position in broadcast, entertainment, and sports attracted the major microphone manufacturer
Sennheiser Group is moving its Americas Regio...
23/06/2026
128 channels of signal routing & DSP
Announced just before the NAMM Show 2026, Violet Audio's latest digital audio matrix offers 128 channels of signal ...
23/06/2026
Latest Current expansion created by EPROM
Minimal Audio have just launched the latest Current Expansion, Memory Rites. Designed in collaboration with renown...
23/06/2026
Popular hardware EQ gets official plug-in emulation
Undertone Audio have just launched a new plug-in that brings one of their most popular hardware designs ...
23/06/2026
December 7, 2022
Colorfront (colorfront.com) - the multi-award-winning develope...
23/06/2026
April 23, 2026
NAB 2026, Las Vegas - the Academy and Emmy Award-winning develop...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
PlayBox Neo appoints Besco as Channel Reseller to establish a firm foothold in Asia Pacific's thriving high-tech export-driven economic boom
PlayBox Neo, t...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
LTN, a global leader in IP-based video transport and network services, today announced that PBS has selected LTN as its IP video partner to modernize and future...
23/06/2026
LiveU will introduce its Q Era to Australia and New Zealand for the first time at ABE2026 on Stand No. 25, (July 30 31). Leading the showcase is the LU900Q, a n...
23/06/2026
Miri Technologies Inc. has begun shipping its highly anticipated V410 live 4K video encoder/decoder for streaming, IP-based production workflows and AV-over-IP ...
23/06/2026
DHD audio reports the completion of an upgrade to the audio production facilities at the Galilee headquarters of Radio Tzafon. The station broadcasts two progra...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
23/06/2026
Multifaceted Growth Executive Brings 20+ Years of Experience Leading Organizations Across Tech and M&E
Imagine Communications today announced the appointment ...
23/06/2026
Australians in Film and Screen Australia's talent development initiative UNT...
23/06/2026
Visual Productions Unveils RdmRelay2 Four-channel Relay Control at InfoComm 2026
Brie Clayton June 22, 2026
0 Comments
New Relay Solution Combines DMX, ...
23/06/2026
SMPTE Makes Its Standards Freely Accessible, Opening Standards Library to the Gl...
23/06/2026
Building AI systems at scale is demanding, requiring low-latency inference, fast vector search, strong GPU price-performance and infrastructure that can grow wi...
23/06/2026
23rd June 2026, London: UKTV and BBC Entertainment have unveiled a joint co-comm...
23/06/2026
Also starring Jonny Lee Miller, Sheldon Shepherd and Bel Powley, the ambitious f...
23/06/2026
The priority now is a clear and credible plan
June 23, 2026, Winchester, UK - Arqiva, the UK's leading communications infrastructure provider, welcomes tod...
23/06/2026
The RT Toy Show Appeal has raised over 31 million since its inception in 2020 ...
23/06/2026
News Highlights:
NVIDIA technology runs 81% of the TOP500 and 90% of the systems new to the list.
26 systems on the TOP500 adopted the NVIDIA Grace CPU, up ei...