
Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.
But with opportunities often come challenges.
Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.
To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.
Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.
We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.
NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.
This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.
Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.
These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.
Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.
Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.
Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.
For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.
The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.
For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.
Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.
Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
Most recent headlines
11/12/2025
RASTATT, Germany Lawo and the Society of Motion Picture and Television Engineers (SMPTE) have partnered to launch the SMPTE ST 2110 Practical Lab, an immersive ...
11/12/2025
PHILADELPHIA Comcasts Xfinity operating brand has announced the launch of new national video plans with all-in pricing that the operator said will provide custo...
11/12/2025
After eight years of declines, MoffettNathansons new Cord Cutting Monitor for Q3 2025 shows that pay TV subscribers to linear TV packages rose by 303,000, the f...
11/12/2025
Happy Holidays from Berklee Enjoy this years holiday student-performance video.
December 10, 2025
By
Office of the President
Dear Berklee community,
As w...
11/12/2025
Dalet, a leading provider of cloud-native, end-to-end media workflow solutions, ...
10/12/2025
Sound-Alike Commercials Are Part of Sports' Soundtrack Johnny Cash for Coca-Cola is the latest in a long litany of sonic approximationsBy Dan Daley, Audio ...
10/12/2025
Immersive Sound Is Logical Next Step for Sports VenuesSound-systems suppliers are sanguine, but the market has its challengesBy Dan Daley, Audio Editor
Wednes...
10/12/2025
The Romans Built Arenas for Immersive Sound 2,000 Years AgoThe historic Arena of Nimes in France is still in use todayBy Dan Daley, Audio Editor
Wednesday, De...
10/12/2025
SVG Summit 2025 Preview: Audio Workshop Hits on Immersive, Virtualized, and Next...
10/12/2025
SVG Summit 2025 Technology Exhibits Preview: Audio SpotlightBy SVG Staff
Wednesday, December 10, 2025 - 8:21 am
Print This Story | Subscribe
Story Highlig...
10/12/2025
SVG Europe Audio: Listening to the sounds of powder and ice at Milano Cortina wi...
10/12/2025
Advancements in audio technology: Capturing the atmosphere of live sports By David Davies
Tuesday, November 25, 2025 - 09:27
Print This Story
Although wor...
10/12/2025
Everything smelled of popcorn: The art of bringing the complex sound of esports ...
10/12/2025
Top L-R: Ha-Chan, Shake Your Booty!, Hanging by a Wire, Broken English, Buddy
C...
10/12/2025
For the first time, Spotify is giving users the power to steer the algorithm. Gustav S derstr m, Spotify's Co-President, CPO, and CTO, shares the vision beh...
10/12/2025
L3Harris' new contract for Guided Multiple Launch Rocket System Insensitive ...
10/12/2025
L3Harris Meadowlands system has been designed with an open architecture software system that allows for more flexible and efficient software updates. This capab...
10/12/2025
During this interval, streaming comprised the majority of ad supported TV (46.4%...
10/12/2025
NEWPORT BEACH, Calif. Bitcentral, a provider of production, asset management, playout and streaming workflow solutions, has named technology veteran Rick Arnold...
10/12/2025
TV Tech is delighted to reveal the winners of the 2025 Media & Entertainment: Best in Market Awards....
10/12/2025
BOTHELL, Wash. The Alliance for IP Media Solutions (AIMS), the Video Services Forum (VSF), the Advanced Media Workflow Association (AMWA) and the European Broad...
10/12/2025
In a notable example of how pay TV operators are integrating streaming services into their lineup and using those services to retain or attract subscribers, Dir...
10/12/2025
Today, Chaos builds instant feedback into the viewport, connecting Maya and Houdini to Chaos Vantage's real-time path tracer. Artists can now assess 3D asse...
10/12/2025
Smeup, a key partner for companies engaged in digital transformation, today announced the expansion of its adoption of Cubbit, the first geo-distributed cloud s...
10/12/2025
Mediagenix, a global leader in smart content solutions to profitably connect the right content to the right audience, today announced two significant milestones...
10/12/2025
BEAVERTON, Ore. HDR10+ Technologies, LLC has announced that they will soon begin the licensing and certification of devices, content, and services that support ...
10/12/2025
SMPTE has joined forces with the European Broadcasting Union (EBU) and Entertainment Technology Center (ETC) to publish an updated report on AI and its impact o...
10/12/2025
Clear-Com is pleased to announce the appointment of Kris Koch as Director of Sales - North & South America. In this expanded leadership role, Kris will oversee...
10/12/2025
Mavis today announced the latest version of Mavis Camera (v7.4), a major update to its professional iOS camera app, headlined by the launch of Film Kit - an opt...
10/12/2025
Creamsource, renowned for its Vortex series of cinematic lighting, is laying the groundwork for its next phase of growth with the addition of Markus Zeiler as G...
10/12/2025
Digital Alert Systems, a global leader in emergency communications solutions for media providers, today announced that the DAS3-DC-PS, a new DC power supply opt...
10/12/2025
Riedel Communications today announced it has formed a strategic partnership with Racing Electronics, a premier provider of motorsport communication equipment in...
10/12/2025
#GALSNGEAR is launching two signature leadership retreats in early 2026, designed to equip women in media, entertainment, and technology with the tools to lead...
10/12/2025
Providing worldwide customers with total confidence through transparent, all-inclusive pricing
CVP, one of Europe's leading suppliers of professional video...
10/12/2025
With the Federal Communications Commission working on new rules for the deployment of NextGen TV, next year promises to be an important one for both the future ...
10/12/2025
DENVER Tom Rutledge, director emeritus and former president and CEO of Charter Communications, will be honored with the 2026 Bresnan Ethics in Business Award by...
10/12/2025
NEW YORK Novocap's Cadent has acquired VuePlanner, a YouTube video ad planning, optimization, and measurement company in a deal that will help Cadent expand...
10/12/2025
The NVIDIA accelerated computing platform is leading supercomputing benchmarks once dominated by CPUs, enabling AI, science, business and computing efficiency w...
10/12/2025
The world's top-performing system for graph processing at scale was built on...
10/12/2025
As the scale and complexity of AI infrastructure grows, data center operators need continuous visibility into factors including performance, temperature and pow...
10/12/2025
In preparation for the madness of March, here are some important reminders for scheduling back-to-back Special Playlists.
The first Special Playlist MUST end b...
10/12/2025
10 Dec 2025
VEON's Rising Capital Markets Profile Strengthened by Inclusion...
10/12/2025
10 Dec 2025
VEON Recognized for JazzCash, Kyivstar and Jazz at the World Commun...
10/12/2025
December 10th, 2025
TRIBECA FILMS TO RELEASE THE INDEPENDENT DOCUMENTARY FILM...
10/12/2025
Wednesday 10 December 2025
Sky extends partnership with the Ladies European Tour for a landmark 30th year
Sky and the Ladies European Tour (LET) have announce...
10/12/2025
Wednesday 10 December 2025
Walk-on if you love the darts: James Maddison, Luke ...
10/12/2025
Rohde & Schwarz presents world's first RF power sensor with 0.80 mm RF conne...
10/12/2025
Back to All News
2026 Starts With a Swoon: Kim Seon-ho and Go Youn-jung Lead C...
10/12/2025
Back to All News
Berlin and the Lady with an Ermine Arrives to Netflix on May 15
Entertainment
10 December 2025
GlobalSpain
Link copied to clipboard
THE N...
10/12/2025
It's out of the frying pan and into the sequins for comedian and actor Micha...