Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

02/06/2026

Cobalt Digital to Showcase End-to-End IPMX Ecosystem at InfoComm 2026, Making ST 2110 Easy for Pro AV

blueCORE standalone processors headline solutions designed to simplify the trans...

02/06/2026

Monetizing the Archive: How Cantemo's JIT Playback Eliminates the Proxy Bottleneck

If you've kept up with this article series, you know by now where to start w...

02/06/2026

Calrec Scales ImPulseV for Greater Choice in Virtualised Workflows

With new DSP configurations and flexible licensing options, Calrec is removing the barriers to virtualised audio, giving broadcasters the freedom to scale produ...

02/06/2026

Marketing Architects Expands Relationship With Nielsen To Include Integration of Media Data Engine on a National Level

The TV agency was one of the earliest adopters of Nielsen's local television...

02/06/2026

Riedel Networks Taps Gudrun Scharler as CEO

Share Copy link Facebook X Linkedin Bluesky Email...

02/06/2026

Grass Valley Enables Sky News Australia s Cloud-First New...

Grass Valley today announced that Australian News Channel (ANC), operator of Sky News Australia, has deployed Grass Valley AMPP to transform its newsroom produc...

02/06/2026

STUDIO TECHNOLOGIES INTRODUCES NEW MODEL 385 MIC INTERCOM...

Studio Technologies, a leading manufacturer of high-quality audio, video, and fiber-optic solutions, announces its new Model 385 Mic/Intercom Beltpack. The Mode...

02/06/2026

Gudrun Scharler Appointed CEO of Riedel Networks

The Riedel Group today announced the appointment of Gudrun Scharler as CEO of Riedel Networks. She succeeds Michael Martens, who has led Riedel Networks since 2...

02/06/2026

Magewell Levels-Up All-in-One Content Production with Lau...

More signals, higher quality, and outstanding ingest and streaming flexibility deliver professional results in a small, all-in-one footprint...

02/06/2026

Modena Showcases farmerswife at Mediatech 2026

farmerswife will be featured on the Modena Media & Entertainment stand at this year's Mediatech Africa 2026, giving visitors an opportunity to explore the l...

02/06/2026

PTZOptics showcases intelligent video ecosystem at InfoCo...

PTZOptics will showcase a new generation of intelligent video workflows at InfoComm 2026, June 17 19, Las Vegas. Visitors to booth N8227 will see how PTZOptics ...

02/06/2026

Roku Launches the Roku Soccer Zone

Share Copy link Facebook X Linkedin Bluesky Email...

02/06/2026

FCC Sets Deadlines for Comments in ABC License Renewals

Share Copy link Facebook X Linkedin Bluesky Email...

02/06/2026

Studio Technologies Introduces Model 385 Beltpack

Share Copy link Facebook X Linkedin Bluesky Email...

02/06/2026

Gerald Jerry Pierce, Architect of Modern Digital Cinema, Dies at 73

Share Copy link Facebook X Linkedin Bluesky Email...

02/06/2026

Why TAG matters in digital advertising

Trust has become a commercial issue With global advertising spend forecast to exceed US$1 trillion this year*, the commercial consequences of weak governance co...

02/06/2026

RT is Supporting 12 Arts and Cultural Events all over Ireland this June

June sees Ireland's cultural calendar in full bloom, as RT Supporting the Arts showcases a vibrant and wide-ranging programme spanning music, theatre, visu...

02/06/2026

New seasons of The Traitors UK and US now available to stream on RT Player

After The Traitors Ireland launched in 2025, Irish audiences proved to have a taste for the global hit reality show. This Bank Holiday Monday fans can indulge e...

01/06/2026

CBS Sports UEFA Champions League Today Studio Show Heads to Budapest for Final as Transcontinental Popularity Grows

In its sixth year, the broadcaster's coverage has become a global brand and ...

01/06/2026

AudioShake Launches End-to-End Copyright Compliance System for Mixed-Media Audio

Designed to solve a common problem in broadcasting, the automated workflow detects, identifies, removes, and documents copyrighted music AudioShake has introdu...

01/06/2026

SVG Sit-Down: Stats Perform's Charles Kaplan on 30 Years of Opta, a Busy Summer of Soccer, What's Next

The sports-analytics company combines its data with proprietary AI to help leagu...

01/06/2026

Production Music Awards 2026

Category line-up & sponsors announced Photo: Paul Clarke The Production Music Awards (PMA) have announced that submissions are now officially open ahead of...

01/06/2026

Evolve Nest Acoustics from Excite Audio

New hybrid sample/synthesis instrument revealed Excite Audio have just released the latest instalment in their Evolve series, which has been developed in co...

01/06/2026

IK Multimedia release Royal 45 Legends Signature Collection

Latest TONEX expansion captures three rare vintage amps The newest addition to IK Multimedia's ever-growing TONEX line-up introduces a set of three incr...

01/06/2026

Scaler Music Carbon Electra 2

Musically intelligent soft synth gets upgraded Scaler Music will be probably be best known to many for their music theory tools, but their product range al...

01/06/2026

SBS confirms its broadcast sponsors for FIFA World Cup 2026

SBS confirms its broadcast sponsors for FIFA World Cup 2026 1 June, 2026 Media releases SBS has secured Hyundai, Hisense, Macca's, Rexona, bet365, Com...

01/06/2026

Rohde & Schwarz Satellite Industry Days 2026 guided by the motto From Earth to Orbit

Rohde & Schwarz Satellite Industry Days 2026 guided by the motto From Earth to ...

01/06/2026

ASG Advances Joe Marchitto to Western Regional CTO

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

Scripps Stations Go Dark on DirecTV

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

MARSHALL ELECTRONICS POWERS SEAMLESS AV EXPERIENCES WITH...

Marshall Electronics is showcasing a comprehensive lineup of next-generation POV cameras, purpose-built to power today's connected AV environments, at InfoC...

01/06/2026

Adobe Announces Concept to Vector

Adobe Announces Concept to Vector Deepa Subramaniam June 1, 2026 0 Comments One of the biggest frustrations we hear from designers is how difficult it...

01/06/2026

Vampire Feature Night Patrol Graded with DaVinci Resolve Studio

Vampire Feature Night Patrol Graded with DaVinci Resolve Studio Brie Clayton June 1, 2026 0 Comments Colorist shapes dark, gritty tone for horror thri...

01/06/2026

U.S. Broadcasters Ready for Most Complex FIFA World Cup Ever

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

Broadcasters Prepare for Nation's 250th Birthday Bash

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

Broadcasters Reveal What Makes C-Band Alternatives Right for Them

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

IAMT to Offer New Educational Sessions at InfoComm 2026

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

NewsNation Launches New Podcasting Studio and Podcasts

Share Copy link Facebook X Linkedin Bluesky Email...

01/06/2026

SES Launches Multi-Orbit Satellite Connectivity on Mexico's Viva

Luxembourg, June 1, 2026 - SES, a leading space solutions company, and Viva, Mexico's ultra low-cost airline, launched fast and reliable multi-orbit satelli...

01/06/2026

NVIDIA Jetson Brings Agentic AI to the Physical World

Agentic AI is getting physical. At COMPUTEX on Tuesday, NVIDIA announced NVIDIA JetPack 7.2 and NVIDIA NemoClaw support on NVIDIA Jetson. JetPack 7.2 brings a...

01/06/2026

Why Financial Institutions Are Converging on Transaction Foundation Models to Build Their Own Intelligence

Financial institutions have spent years building AI: fraud models, credit models...

01/06/2026

Simplifiez vos workflows avec FLAPI. Paris. 2 juin 2026

Mardi 2 juin 14h00 FilmLight (ARRI), 10 rue Ren Boulanger, 75010 Paris Rejoignez-nous pour d couvrir comment FLAPI (l'API FilmLight) peut transformer e...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

31/05/2026

Olivia Prez-Collellmir to Premiere Original Work at Gaud Centennial in Barcelona

Olivia P rez-Collellmir to Premiere Original Work at Gaud Centennial in Barcelona The Berklee graduate and faculty member will debut her choral symphony with...

31/05/2026

Netflix Wins 15 Awards at the Canadian Screen Awards - See Photos From Inside Our Photo Suite

Back to All News Netflix Wins 15 Awards at the Canadian Screen Awards - See Pho...

31/05/2026

Taiwan's Industry Titans Turbocharge World's AI Infrastructure Buildout With NVIDIA

Taiwan is home to more than 500 NVIDIA ecosystem partners. More than 1 million N...