Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

06/09/2026

Dolby and MagentaTV Bring Fans Closer to the FIFA World Cup 2026 in Germany with Dolby Vision and Dolby Atmos

June 9 2026, 23:00 (PDT) Dolby and MagentaTV Bring Fans Closer to the FIFA Worl...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

16/06/2026

Hitachi and PESA Announce Strategic Partnership to Drive Growth in Poland's Rail Market

Bydgoszcz to Become a Local Centre of Excellence for Advanced Rail Technologies....

16/06/2026

Chyron Unveils Chyron Weather 2.4

Share Copy link Facebook X Linkedin Bluesky Email...

16/06/2026

Historic Zhuque-3 Reusable Rocket Test Mission Captured with URSA Cine Immersive

Historic Zhuque-3 Reusable Rocket Test Mission Captured with URSA Cine Immersive Brie Clayton June 16, 2026 0 Comments Apple Immersive Video puts view...

16/06/2026

SMPTE Plans ST 2110 Education Summer Programs

Share Copy link Facebook X Linkedin Bluesky Email...

16/06/2026

Rise Awards Returns for 2026 to Celebrate Excellence in B...

Rise WIB, the award-winning advocacy group championing gender diversity and career progression across the broadcast and media technology industry, today announc...

16/06/2026

Limecraft Expands its Media Production Platform with Team...

Limecraft today announced the availability of Limecraft 2026.4, the fourth of eight planned platform releases this year. The update introduces Team-Based Access...

16/06/2026

Perry Sook: Big Tech Poses 'Very Urgent Threat to Broadcast Stations

Share Copy link Facebook X Linkedin Bluesky Email...

16/06/2026

FIFA World Cup Delivers Record Ratings on Fox

Share Copy link Facebook X Linkedin Bluesky Email...

16/06/2026

AIMS Launches the Official IPMX Training Series Online

Free Program Supports IPMX Education from Foundational Concepts Through System and Network Design The Alliance for IP Media Solutions (AIMS) today announced t...

16/06/2026

Share your views on Screen Australia and the future of the industry

Share your views on Screen Australia and the future of the industry 15 June 2026 Your feedback matters. Following the instrumental insights provided in 2025,...

16/06/2026

Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0

Every breakthrough AI model starts the same way: with a training run. The infrastructure running those training jobs shapes everything: how fast teams can itera...

15/06/2026

University of South Carolina's Valerie Gerfin on Gamecock Productions' Growth, Upgrades at Williams-Brice Stadium

One of the more exciting internal video production divisions within a college at...

15/06/2026

Fox Corp. To Acquire Roku, Pairs Live Sports Powerhouse With Major CTV Platform

The deal valued at $22 Billion is expected to close in the first half of 2027...

15/06/2026

Golf Channel Mobile to Live Stream 2026 Arnold Palmer Cup Beginning July 13th

Golf Channel and the Arnold Palmer Cup have announced a partnership to livestream the 2026 Arnold Palmer Cup on Golf Channel Mobile and GolfChannel.com. The tou...

15/06/2026

TikTok and Panini Launch Digital Collectible Card Experience for FIFA World Cup 2026

TikTok and Panini have announced a partnership to bring a digital collectible ca...

15/06/2026

Cosm and Monster Energy Launch First Full-Dome Immersive Advertisement in Shared Reality Venues

Cosm and Monster Energy have announced the debut of the first full-dome immersiv...

15/06/2026

Fox Nation and Real American Freestyle Sign International Media Rights Deal

Real American Freestyle (RAF) and Fox Nation have announced an exclusive streaming agreement for three RAF international events, beginning with RAF Georgia on J...

15/06/2026

FanConnect and Extreme Networks Announce IPTV Integration for Large Venue Deployments

FanConnect has announced a partnership with Extreme Networks integrating FanConn...

15/06/2026

2026 Sundance Institute Ignite x Adobe Fellows Named

Ten Emerging Filmmakers Ages 18 to 25 Will Start Fellowship Year at Ignite Lab from June 14-19 LOS ANGELES, CA, June 15, 2026 - The nonprofit Sundance Institut...

15/06/2026

Rumble from UVI

Innovative three-band soft synth introduced UVI's latest synth takes an interesting approach to synthesis, offering a trio of synth engines that each op...

15/06/2026

Oram Awards 2026: Open call announcement

Applications now open for 2026 The Oram Awards have returned for 2026 to celebrate the unusual, unique and unfiltered creative worlds of women and gender-di...

15/06/2026

PSPaudioware release PSP Levelizer

New intelligent auto-fader plug-in revealed PSPaudioware's latest release offers automatic level adjustment and provides more detailed control than many...

15/06/2026

4.78M AUSSIES TUNE IN FOR SOCCEROOS WIN OVER TRKYE ON SBS

4.78M AUSSIES TUNE IN FOR SOCCEROOS WIN OVER T RK YE ON SBS 15 June, 2026 Media releases Match had a Total TV average audience of 3.035 million, with over ...

15/06/2026

SBS Head of Commissioning John Godfrey to depart after 18 years

SBS Head of Commissioning John Godfrey to depart after 18 years 15 June, 2026 Media releases SBS Head of Commissioning John Godfrey will depart the broadca...

15/06/2026

Greater Manchester Police installs Rohde & Schwarz security scanner for custody searches

Greater Manchester Police installs Rohde & Schwarz security scanner for custody ...

15/06/2026

The New Discovery Stack: AI, Metadata and Audience Intelligence

Insights from NAGRAVISION's latest industry webinar featuring One Hungary, Liberty Global and Media Press Group In this blog, Laura Rognoni explores the k...

15/06/2026

Clear-Com Introduces Avalon IP Intercom Platform

Share Copy link Facebook X Linkedin Bluesky Email...

15/06/2026

DoJ Approves Paramount Skydance, Warner Bros. Discovery Merger

Share Copy link Facebook X Linkedin Bluesky Email...

15/06/2026

Clear-Com Introduces Avalon IP Station for Modern Communi...

Clear-Com has introduced Avalon , a purpose built 1RU IP intercom communication platform for modern networked production, designed to simplify and scale workfl...

15/06/2026

Fox Makes CTV Play with Roku Acquisition

Share Copy link Facebook X Linkedin Bluesky Email...

15/06/2026

Gray Announces Plans to Expand Lansing, Mich. Broadcast HQ

Share Copy link Facebook X Linkedin Bluesky Email...

15/06/2026

Richmond Flying Squirrels Raise the Bar for Live Baseball...

MiLB Club Deploys LDX 110 Cameras at CarMax Park to Deliver A New Standard in Engaging Fan Experience Grass Valley today announced that the Richmond Flying Sq...

15/06/2026

Detach from Direct-Attached: How Remote Editing with EVO Keeps Creative Teams Moving

Detach from Direct-Attached: How Remote Editing with EVO Keeps Creative Teams Mo...

15/06/2026

Techtel Completes Media Production Setup for a major AFL sporting organisation

Techtel Completes Media Production Setup for a major AFL sporting organisation Sports 15 June Written By Suzanne Costello (Sydney, Australia 15 June 2026)...

15/06/2026

Sky News takes viewers inside Minab in new film investigating primary school strike in Iran

Monday 15 June 2026 Sky News takes viewers inside Minab in new film investigati...

15/06/2026

Fox Corporation to Acquire Roku, Inc.

Fox Corporation to Acquire Roku, Inc. Combination Creates a Scaled Media and Technology Platform with Superior Reach, Engagement and Monetization Capability ...

14/06/2026

Detroit Drums from Iconic Instruments

Library captures 1960s R&B/pop drum sound Following on from their recent wave of plug-in effects, Iconic Instruments have just launched an all-new virtual d...

14/06/2026

HBO Comedy Rooster Shot with URSA Cine 17K 65

HBO Comedy Rooster Shot with URSA Cine 17K 65 Brie Clayton June 14, 2026 0 Comments Large format brings viewers intimately close to characters. Black...

13/06/2026

Rhythmic Filters for Devious Machines' Infiltrator

Latest expansion pack includes 252 presets Devious Machines have recently introduced another expansion for their powerful multi-effects plug-in, Infiltrator...

13/06/2026

MetaGrid Pro gains AI Builder

Create custom DAW/plug-in controllers using prompts MetaGrid have recently introduced an all-new AI Builder function to their touchscreen-based control surf...

13/06/2026

Spectrum Reach Taps Anoki AI for Contextual Intelligence

Share Copy link Facebook X Linkedin Bluesky Email...

13/06/2026

Google TV Launches Soccer Hub, New Voice Command Features

Share Copy link Facebook X Linkedin Bluesky Email...

12/06/2026

YES Network and Gotham Sports App to Air Seven Athletes Unlimited Softball League Games

YES Network and The Gotham Sports App will air seven Athletes Unlimited Softball...

12/06/2026

UFL to Feature FAST Innovation Suite at 2026 United Bowl

The United Football League will host its FAST Innovation Suite at the 2026 United Bowl presented by Credit One Bank on Saturday, June 13 at 3:00 p.m. ET at Audi...

12/06/2026

InfoComm 2026: PTZOptics and LayerJot to Demo AI-Driven Camera Control

PTZOptics and LayerJot will present live demonstrations at InfoComm 2026 showing how natural-language AI prompting, robotic camera control, and on-device comput...