Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

04/05/2026

just:play pro 2026 and just:live pro 2026 are available to download!

just:play pro 2026 and just:live pro 2026 are available to download! More Details:At NAB 2026, ToolsOnAir showcased just:play pro 2026 and just:live pro 2026, ...

04/05/2026

just:in mac pro 2026 - The Next Level of Professional Recording on macOS

just:in mac pro 2026 - The Next Level of Professional Recording on macOS More Details:The headline innovation in just:in mac pro 2026 is the new Auto format si...

04/05/2026

SVG Sit-Down: NEP Americas Mike Werteen on How Great Tech, Better People Drive Success

Hardware is still an emphasis - Supershooter 11 is new, and REMI-based 65 is in ...

04/05/2026

Beyond 90 Minutes: How K League's Soccer Blueprint for Growth Has Lessons for Everyone

Head of International Business Development Min Joo Kim explores the league's...

04/05/2026

Audio-Technica ATND1061 and ATUC Discussion Systems Certified for Crestron Automate VX

Audio-Technica has announced that its ATND1061 ceiling array microphone and ATUC...

04/05/2026

Triple B Media Launches Bowling TV, a 24/7 FAST Channel Dedicated to Bowling

Triple B Media has launched Bowling TV, a free ad-supported television (FAST) channel dedicated to bowling. The channel is available on Prime Video, LG Channels...

04/05/2026

PlayMetrics Acquires SportsEngine from Versant

PlayMetrics, a provider of operations management software for youth sports organizations, has announced the completion of its acquisition of substantially all t...

04/05/2026

IHSE GmbH Appoints Dr. Thomas Niessen as CEO

IHSE GmbH has announced that Dr. Thomas Niessen has joined as CEO and Managing Director, effective May 1, 2026. He joins Frank Breitenfelder, who has served as ...

04/05/2026

PMY Group Deploys Optic Crowd Intelligence Platform at Australian Formula 1 Grand Prix

PMY Group deployed its AI-powered crowd intelligence platform, Optic, at the For...

04/05/2026

Behind The Mic: Stephen A. Smith and Skip Bayless to Reunite for First Take Episode; Donna Brothers Worked Final Kentucky Derby

Behind The Mic provides a roundup of recent news regarding on-air talent, includ...

04/05/2026

Spotify Brings Fashion and Podcasting Together With Mina Le and Mia Calabrese

Last week, guests gathered in New York City for On Air, In Style: An Evening with Spotify-a night of conversation, culture, and connection celebrating the inter...

04/05/2026

Avid introduce Pro Tools 2026.4

New music & post-production features added Avid's latest DAW update delivers an array of helpful features aimed at both music and post-production users,...

04/05/2026

SAG-AFTRA, Studios Reach Tentative Agreement

Share Copy link Facebook X Linkedin Bluesky Email...

04/05/2026

Study: Paramount-WBD Deal Signals New Era of Streaming Scale

Share Copy link Facebook X Linkedin Bluesky Email...

04/05/2026

Student Spotlight: Joshua Griffin

Student Spotlight: Joshua Griffin The New Orleans native, who was named the 2026 student commencement speaker for Boston Conservatory at Berklee, talks about ...

04/05/2026

It's Andrew! stomps onto screens this June

It's Andrew! stomps onto screens this June 4 May 2026 The ABC and Screen Australia are delighted to announce that brand new preschool series, It's Andr...

03/05/2026

Melbourne Instruments' Nina gains Braids engine

Polysynth now features Mutable Instruments' macro oscillators Melbourne Instruments have just released a free firmware update that brings the engine beh...

03/05/2026

Introducing the new Mistika Workflows Suite: transformative and cost-effective for every user

Introducing the new Mistika Workflows Suite: transformative and cost-effective f...

03/05/2026

Introducing the new Mistake Workflows Suite: transformative and cost-effective for every user

Introducing the new Mistake Workflows Suite: transformative and cost-effective f...

03/05/2026

Filming begins on the third and final season of Breathless

Back to All News Filming begins on the third and final season of Breathless Entertainment 03 May 2026 GlobalSpain Link copied to clipboard Discover the vi...

02/05/2026

Release Rundown: What to Watch in May, From Saccharine to Tuner

(L-R) Dustin Hoffman and Leo Woodall appear in Tuner by Daniel Roher, an official selection of the 2026 Sundance Film Festival. (Photo courtesy of Sundance In...

02/05/2026

Warm Audio launch the Reamper

Versatile re-amping tool announced Warm Audio are best known for their recreations of sought-after vintage studio gear, but their latest release brings a ne...

02/05/2026

FCC Releases Tentative Agenda for May Open Meeting

Share Copy link Facebook X Linkedin Bluesky Email...

02/05/2026

Sinclair Remains Bullish on Station M&A

Share Copy link Facebook X Linkedin Bluesky Email...

02/05/2026

NABLF Announces 2026 Broadcast Leadership Training Award Winners

Share Copy link Facebook X Linkedin Bluesky Email...

02/05/2026

Gravity Media Taps Custom Consoles for Work on Production Center

Share Copy link Facebook X Linkedin Bluesky Email...

02/05/2026

May 01, 2026

Scripps Research immunologist Dennis Burton elected to American Academy of Arts and Sciences A leader in broadly neutralizing antibodies, Burton has helped driv...

02/05/2026

Dalet Flex LTS Delivers Smarter Search, Faster Editing, and an AI-Ready Foundation for Modern Media

Dalet, a leading technology and service provider for media-rich organizations, t...

01/05/2026

Ratings Roundup: NBA Playoffs Return to NBC Sports up 38%; NFL Draft Down 12% Overall From 2025

Ratings Roundup is a rundown of recent rating news and is derived from press rel...

01/05/2026

BKB Bare Knuckle Boxing Appoints Will Wright as Chief Operating Officer to Drive Global Growth and Operational Excellence

BKB Bare Knuckle Boxing ( BKB ), today announced the appointment of Will Wright ...

01/05/2026

NAB Rewind: Lawo's Andreas Hilmer on the Power of the Edge One AV Stagebox

Lawo has been at the center of the industry's transition to IP and other next-generation technologies. At NAB 2026, its story was the Edge One AV stagebox, ...

01/05/2026

Kentucky Derby 152 to Air Across 19 Networks in 170-Plus Territories

HBA Media, acting on behalf of NBC Sports and Churchill Downs Incorporated, has announced broadcast and streaming distribution for Kentucky Derby 152, taking pl...

01/05/2026

Give Me the Backstory: Get to Know Barbara Kopple, the Director of American Dream

By Bailey Pennick One of the most exciting things about the Sundance Film Festi...

01/05/2026

Find Out Which The Devil Wears Prada 2' Character You Are With Our New Playlist

Florals for spring? Groundbreaking. But a playlist that tells you which The Devi...

01/05/2026

Olivia Rodrigo Takes Over FC Barcelona Jersey for El Clsico Match at Spotify Camp Nou

One of the world's biggest popstars is headed to El Cl sico. Later this mont...

01/05/2026

Heritage Audio announce the Baby RAM Black Edition

Limited-edition model celebrates 15th anniversary Heritage Audio's range of monitor controllers has just gained a new member, the Baby RAM Black Edition...

01/05/2026

Universal Audio release UAD Enigmatic '82 Overdrive Special Amp

Dumble recreation now available as UAD plug-in Along with their renowned processing plug-ins, Universal Audio have been steadily introducing emulations of c...

01/05/2026

UPDATED: Republican AGs Join Nexstar-Tegna Antitrust Suit

Share Copy link Facebook X Linkedin Bluesky Email...

01/05/2026

Broadcaster Draper Media Names Bill Vernon President

Share Copy link Facebook X Linkedin Bluesky Email...

01/05/2026

Analysts: 'Hollywood's Vertical Video Strategy Is Dead Wrong'

Share Copy link Facebook X Linkedin Bluesky Email...

01/05/2026

Lightware UK celebrates new London showroom with launch e...

To celebrate the opening of its new showroom and office, Lightware UK hosted a dedicated launch event at the new London location. The event welcomed partners, c...

01/05/2026

Calrec Puts Broadcaster Choice Centre Stage at MPTS 2026

Choice without compromise The broadcast industrys transformation is accelerating, and traditional broadcasters are having to fundamentally reinvent how they o...

01/05/2026

Beam Dynamics Showcases its Asset Intelligence Platform a...

Beam Dynamics will return to MPTS 2026 with its asset intelligence platform, helping systems integrators, live production teams, media facilities and profession...

01/05/2026

Synamedia and FX Digital collaborate to bring GO Shorts a...

Best-in-class UX design and rapid, scalable delivery for next-generation viewing experiences Leading video software provider, Synamedia, today announced a coll...

01/05/2026

Compact new cforce MAX lens motor brings unrivaled speed and responsiveness to the Hi-5 ecosystem

Compact new cforce MAX lens motor brings unrivaled speed and responsiveness to t...

01/05/2026

Panavision welcomes Fritz Heinzle as Vice President of Sales

Panavision welcomes Fritz Heinzle as Vice President of Sales Brie Clayton May 1, 2026 0 Comments Heinzle will support Panavision's global growth s...