
Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.
But with opportunities often come challenges.
Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.
To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.
Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.
We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.
NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.
This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.
Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.
These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.
Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.
Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.
Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.
For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.
The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.
For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.
Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.
Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
Most recent headlines
05/01/2027
Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...
04/08/2026
Dalet, a leading technology and service provider for media-rich organizations, t...
04/07/2026
April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...
01/06/2026
January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026
Throughout the week, Dolby brings to life the latest innovatio...
02/05/2026
Dalet, a leading technology and service provider for media-rich organizations, t...
01/05/2026
January 5 2026, 18:30 (PST) NBCUniversal's Peacock to Be First Streamer to ...
29/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
29/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
29/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
29/04/2026
Jin-Quan Yu elected to the National Academy of Sciences Yu is recognized for his pioneering work in synthetic organic chemistry.
April 28, 2026
LA JOLLA, CA S...
28/04/2026
The audio team for the entertainment event must blend speech intelligibility with full-range music reproduction while considering the broadcast
Last week's...
28/04/2026
The Pac-12 Conference has released an updated primary mark and logo as the starting point of the new league's brand identity. The mark was soft-launched acr...
28/04/2026
The DP World Tour and Amazon Leo have signed an agreement making Amazon's lo...
28/04/2026
Pixellot and HELIOS have announced an integration that automatically converts full-game hockey video into individualized shift videos for each athlete, without ...
28/04/2026
Daktronics has partnered with the Asheville Tourists to manufacture and install a new LED video display. The installation was completed in late 2025 and is now ...
28/04/2026
Eutelsat has announced the renewal of its partnership with PCTV, a content aggregation and distribution company in Mexico and part of Megacable Holdings, for co...
28/04/2026
Daktronics has partnered with the Gary SouthShore RailCats to install a new LED video display at U.S. Steel Yard, replacing the previous Daktronics display inst...
28/04/2026
Telos Alliance and the College Radio Foundation have announced that WWSU-FM of W...
28/04/2026
Golf viewership is growing. The 2025 Ryder Cup drew five million viewers in the UK, a 45% increase over the 2023 event. The US Open was the most streamed golf e...
28/04/2026
The CW Network and WWE, part of TKO Group Holdings (NYSE: TKO), have announced t...
28/04/2026
The Alliance for IP Media Solutions (AIMS) has announced that the Internet Protocol Media Experience (IPMX) suite of standards and specifications has been named...
28/04/2026
The 2026 NAB Show is in the books and the show once again served up a cavalcade ...
28/04/2026
Gray Media and RAJ Sports have announced Rose City SportsNet (RCSN), a new netwo...
28/04/2026
Today, we announced our First Quarter 2026 earnings, starting the Year of Raising Ambition with strong momentum across the business and continued innovation acr...
28/04/2026
I dag presenterade vi v rt resultat f r det f rsta kvartalet 2026. Vi inleder ret med starkt momentum i hela verksamheten och fortsatt innovation p plattforme...
28/04/2026
New handheld promises studio performance for the stage
Mojave have just introduced a new live-focused handheld vocal mic created by award-winning designer D...
28/04/2026
Max for Live device offers AI-powered stem separation
Dynamic Split Module (DSM) is a new Max for Live device created by Ostin Solo, a developer and musican...
28/04/2026
First interface equipped with ISA preamps
Focusrite have just announced the launch of a new high-end audio interface that features a pair of their legendary...
28/04/2026
Triton Digital's Podcast Metrics Demos+ Data Integration Enables Comprehensi...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
TAG Video Systems, the leading IP-native Realtime Media Platform, today announced that Lens, its visual service health interface for broadcast operations, recei...
28/04/2026
Open AV-over-IP Standard Recognized in IT Networking/Infrastructure and Security Category
The Alliance for IP Media Solutions (AIMS) today announced that the ...
28/04/2026
VFX History: Slit Scan
Graham Quince April 28, 2026
0 Comments
How did 2001: A Space Odyssey, Star Wars, Doctor Who and Star Trek: The Next Generation...
28/04/2026
These DaVinci Resolve Effects Will Make You a More Creative Colorist
Kasia Jarco April 28, 2026
0 Comments
Creativity in color grading is not about ha...
28/04/2026
A Simple Introduction to Cavalry: Indexed Circle
Simon Ubsdell April 28, 2026
0 Comments
In this new introductory tutorial for Cavalry we're going...
28/04/2026
Rise, the award-winning advocacy group for gender diversity in the broadcast and media technology sector, is pleased to announce a new global training programme...
28/04/2026
Clear-Com has appointed Brian Grahn as Market Outreach Manager of the Americas and Ben Turnwell as Business Development Manager for EMEA live, expanding their ...
28/04/2026
LiveU is inviting MPTS visitors to step into the companys new Q Era on Stand D32, at The Grand Hall, Olympia, London (May 13-14). The company will showcase its ...
28/04/2026
IBC today announces the launch of the IBC2026 Innovation Awards, with nominations now open for projects, programmes and initiatives that exemplify breakthrough ...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Share
Copy link
Facebook
X
Linkedin
Bluesky
Email...
28/04/2026
Introducing Nx 3-Strip v2 - A Physics-Based Technicolor Reconstruction for DaVin...
28/04/2026
April 28th, 2026 Press Materials Available Here
TRIBECA FESTIVAL MARKS 25 YEAR...
28/04/2026
In 2023, Norwegian climber Kristin Harila set out to break a mountaineering reco...
28/04/2026
LinkedIn Top Companies 2026: Where Career Growth Is Happening Now Published on Apr 28, 2026 Categories: Data and insights
LinkedIn Corporate Communication...