
Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.
But with opportunities often come challenges.
Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.
To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.
Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.
We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.
NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.
This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.
Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.
These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.
Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.
Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.
Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.
For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.
The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.
For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.
Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.
Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
Most recent headlines
09/11/2025
Dalet today announced a transformative leap forward for media operations: Agentic Artificial Intelligence (AI) that unifies the Dalet ecosystem under one natura...
30/10/2025
WASHINGTON Federal Communications Commission Chair Brendan Carr has called for an end to the government shutdown while providing some updates on the agency'...
30/10/2025
WASHINGTON Federal Communications Commission Chair Brendan Carr has announced that he has circulated a proposal for the FCC to auction additional mid-band spect...
30/10/2025
Get ready, raiders - the wait is over. ARC Raiders is dropping onto GeForce NOW and bringing the fight from orbit to the screen.
To celebrate the launch, gamer...
29/10/2025
MLS, EDGE Sound Research To Debut Immersive Embodied Sound' at LAFC vs. Aus...
29/10/2025
SVG Remote Production Forum 2025: All Sessions Now Available to Watch on SVG PLA...
29/10/2025
World Series 2025: How Audio Is Transported Around the Sites and BeyondThe signals also move not just between two countries but around the globeBy Dan Daley, Au...
29/10/2025
A still from 306 Hollywood, a film by sibling filmmakers Jonathan Bogar n and El...
29/10/2025
New research shows sense of belonging is growing stronger among multilingual Aus...
29/10/2025
SBS, NITV and Screen Australia announce2025 Digital Originals Shortlist
29 October, 2025
Media releases
SBS, NITV and Screen Australia are excited to unvei...
29/10/2025
It's not the wonders of the world that make a place; it's the people. As...
29/10/2025
eds3_5_jq(document).ready(function($) { $(#eds_sliderM519).chameleonSlider_2_1({...
29/10/2025
NAB New York 2025 Preview | October 22-23 | Booth 544 | Javits Center, New York We're looking forward to meeting up with customers and partners at NAB New Y...
29/10/2025
WUPPERTAL, Germany Riedel Communications has hired Ulrich Voigt as director, live production solutions, taking over the leadership of its SimplyLive business fr...
29/10/2025
OKLAHOMA CITY and TULSA, Okla. Sinclair has named Mark Martin as vice president and general manager of KOKH-KOCB Oklahoma City and KTUL Tulsa....
29/10/2025
BELLEVUE, Wash. Julie Van Ullen has joined cross-platform TV ad measurement company iSpot as president and chief revenue officer....
29/10/2025
Brutal g et, a Swiss broadcast services provider, has rolled out a state-of-the-art outside broadcast (OB) vehicle built on a Lawo AoIP (audio-over-internet pro...
29/10/2025
WASHINGTON FCC Commissioner Olivia Trusty has announced a temporary staff change in her office....
29/10/2025
Berklee Valencia Talent Helps Score Alejandro Amen bar's El cautivo Faculty and alumni from Berklee Valencia's scoring for film, television, and video...
29/10/2025
The Walt Disney Company today announced they have closed their transaction to combine Fubo's business with Disney's Hulu + Live TV business....
29/10/2025
WUPPERTAL, Germany Riedel Communications has hired Ulrich Voigt as director, live production solutions, taking over the leadership of its SimplyLive business fr...
29/10/2025
LOS ANGELES Major League Soccer will introduce a broadcast audio enhancement tonight during the LAFC vs. Austin FC playoff match....
29/10/2025
ESPN said it will produce animated telecasts for NFL, NHL, NBA and WNBA games across The Walt Disney Co. and ESPN platforms during the 2025-26 season under an a...
29/10/2025
WASHINGTON Despite the government shutdown, the Federal Communications Commission has passed, with some revisions, a previously announced Notice of Proposed Rul...
29/10/2025
29 10 2025 - Media release SBS, NITV and Screen Australia announce 2025 Digital Originals Shortlist
The 2025 Digital Originals shortlisted teams. Photo credit...
29/10/2025
October 29 2025, 17:00 (PDT) Hyundai Motor Group Brings Dolby Atmos to Elexio, ...
29/10/2025
Wuppertal October 29, 2025
Riedel Communications Appoints Ulrich Voigt as Dire...
29/10/2025
Back to All News
Start of Filming for Daniel S nchez Ar valos New Netflix Movie
Entertainment
29 October 2025
GlobalSpain
Link copied to clipboard
The fil...
29/10/2025
Comscore's 2025 State of Streaming Report Reveals Surging Growth of Both Ad-...
29/10/2025
RT PUBLISHES 2024 ANNUAL REPORT
RT REPORTS NET SURPLUS OF 5.5 MILLION IN 2024
A YEAR WITH MANY SPECIAL EVENTS UEFA EUROS 2024, THE OLYMPICS AND PARALYMP...
29/10/2025
Editor's note: This post is part of Into the Omniverse, a series focused on ...
28/10/2025
ESPN Announces Monsters Funday Football', Its Latest Real-Time Animated Bro...
28/10/2025
SVG All-Stars: Catherine Chalfant, Manager, Remote Operations, ESPNThe Ole Miss alum is an operational force behind ESPN's extensive college-football catalo...
28/10/2025
Elevating the experience: AI and data take Ryder Cup to the next level By Joe OHalloran
Tuesday, October 28, 2025 - 10:25
Print This Story
NBC produced th...
28/10/2025
Conquering the Air (waves): Taking a close up look at the IBC Accelerator Priva...
28/10/2025
World Series 2025: Spectrum SportsNet LA Brings Dodgers Fans Closer to the Actio...
28/10/2025
Dylan Southern and Benedict Cumberbatch at the premiere of The Thing with Feathers (photo by George Pimentel / Shutterstock for Sundance Film Festival)...
28/10/2025
For three weeks in Lagos, Spotify's Greasy Tunes Caf pop-up brought the cit...
28/10/2025
Once a niche subculture, German rap has evolved into an influential cultural movement. Now, Spotify is giving the genre a new home with OFFCULT, a playlist dedi...
28/10/2025
Shane Delia's Malta serves up the Mediterranean this summer
28 October, 2025
Media releases
Feast on 9,000 years of culinary history Mondays from 24 No...
28/10/2025
SBS's global sporting festival continues with the FIVB Beach Volleyball Worl...
28/10/2025
Bilbao, October 28, 2025 - AgileTV, a leading technology solutions company for t...
28/10/2025
Disney, NBCUniversal, FOX, Paramount Each Achieve Double-Digit Monthly Growth
...
28/10/2025
CINCINNATI The E.W. Scripps Company has announced an agreement to sell WRTV, its local ABC-affiliated station in Indianapolis, to Circle City Broadcasting for $...
28/10/2025
Berklee College of Music and Berklee Valencia Named to Billboards 2025 Top Music...
28/10/2025
NEW YORK As AI usage continues to spike, a new study from IAB delves into an important aspect of how AI is transforming the advertising business with new data s...
28/10/2025
Charlie Jablonski, a broadcast tech pioneer who helped shape the modern era of Olympics television coverage, died Oct. 25 at his home in Lake George N.Y., the N...
28/10/2025
VIENNA, Austria Bitmovin has launched Bitmovin Observability, a new stand-alone video data solution that delivers real-time insights into video playback. The so...
28/10/2025
LOS ANGELES LucidLink, the file streaming platform, has announced a Frame.io integration and expanded mobile capabilities at Adobe Max....
28/10/2025
Mediagenix, a global leader in smart content solutions to profitably connect the right content to the right audience, today announced that it has joined the Ama...