Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

02/05/2026

Dalet Flex LTS Delivers Smarter Search, Faster Editing, and an AI-Ready Foundation for Modern Media

Dalet, a leading technology and service provider for media-rich organizations, t...

01/05/2026

NBCUniversal's Peacock to Be First Streamer to Integrate Dolby's Full Suite of Premium Picture and Sound Innovations

January 5 2026, 18:30 (PST) NBCUniversal's Peacock to Be First Streamer to ...

13/04/2026

ToolsOnAir Composition Builder 2026 Boilerplate

ToolsOnAir Composition Builder 2026 Boilerplate More Details: The Composition Builder 2026 application for macOS enables TV stations and Live Event broadcast...

13/04/2026

ToolsOnAr just:live pro 2026 Boilerplate

ToolsOnAr just:live pro 2026 Boilerplate More Details: just:live pro 2026 is a Multi-Channel Live Production Playout solution for video and static or real-ti...

13/04/2026

ToolsOnAr just:play pro 2026 Boilerplate

ToolsOnAr just:play pro 2026 Boilerplate More Details: just:play pro 2026 is a Multi-Channel automated 24/7 Master Control playout solution with SD, HD and U...

13/04/2026

ToolsOnAr live:cut 2026 Boilerplate

ToolsOnAr live:cut 2026 Boilerplate More Details: live:cut is an option to just:in mac pro 2025 and enables multicamera production workflows for up to 16 cam...

13/04/2026

ToolsOnAir Just In Mac Lite NDI 2026 Boilerplate

ToolsOnAir Just In Mac Lite NDI 2026 Boilerplate More Details: The Just In Mac Lite NDI application is a streamlined media capture solution designed specific...

13/04/2026

ToolsOnAir Just In Mac Lite 2026 Boilerplate

ToolsOnAir Just In Mac Lite 2026 Boilerplate More Details: The Just In Mac Lite application is a streamlined media capture solution designed specifically for...

13/04/2026

ToolsOnAir just:in mac pro 2026 Boilerplate

ToolsOnAir just:in mac pro 2026 Boilerplate More Details: just:in mac pro is a macOS-based client-server multichannel capture solution to record SDI, HDMI, N...

13/04/2026

Jnger Audio Joins EBU ADM Implementers Group as Founding Member

Telos Alliance has announced that J nger Audio has joined the EBU ADM Implementers Group (ADM-IG) as a founding member. The group is focused on advancing ADM an...

13/04/2026

NAB 2026: Grass Valley to Showcase Alliance Partner Ecosystem

Grass Valley will demonstrate its Alliance Partner ecosystem at NAB Show 2026 (Booth C2408, Central Hall, April 19-22), showing AMPP integrations across live pr...

13/04/2026

NAB 2026: Media Links to Demonstrate IP Transport Solutions

Media Links will exhibit at NAB Show 2026 (Booth W2033), demonstrating IP transport solutions for live production including hitless protection technology, Xscen...

13/04/2026

NBC Sports Partners with Overtime for OT7 Football League and Navy All-American Bowl

NBC Sports has announced a programming, distribution, and sales partnership with...

13/04/2026

FloSports Promotes Jayar Donlan from COO to President

FloSports has promoted Chief Operating Officer Jayar Donlan to President, effective immediately. In his new role, Donlan will lead the company's commercial,...

13/04/2026

MASV Case Study: PanCam Pictures Uses MASV for Remote Post-Production at Senior Bowl 2026

PanCam Pictures, the documentary production company founded by Paul Camarata, us...

13/04/2026

NAB 2026: Mimir to Showcase Cloud Production Platform

Mimir will exhibit at NAB Show 2026 (North Hall, Booth N2850), demonstrating its cloud-native media production platform with new capabilities including Mimir Cu...

13/04/2026

NAB 2026: BBright Adds RIST Protocol Support to IP Gateway

BBright has announced that its IP Gateway now supports the Reliable Internet Stream Transport (RIST) protocol. The addition will be introduced at NAB Show 2026 ...

13/04/2026

Net Insight Awarded ESA NAVISP Development Project for PNT Technology

Net Insight has been awarded a development project through the European Space Agency's Navigation Innovation and Support Program (NAVISP), with co-funding f...

13/04/2026

NAB 2026: intoPIX to Showcase JPEG XS, IPMX, and SMPTE 2110 Solutions

intoPIX will exhibit at NAB Show 2026, marking the company's 20th anniversary. The company will demonstrate its JPEG XS compression portfolio and IPMX-appro...

13/04/2026

Inside the Launch of BravesVision: How Braves, Raycom Sports Pulled Off One of the Most Ambitious Efforts in Regional-Sports-Media History

Starting from scratch, the team built an in-house content platform comprising ga...

13/04/2026

NAB 2026: AI Will Make Its Presence Felt in Audio Offerings, Presentations

Here's a look at some of the new products and updates, along with audio-centric conferences, that attendees will find next week at the show When the 2026 N...

13/04/2026

NAB 2026: Avid to Demonstrate Integrated Newsroom Capabilities

Avid will launch new integrated newsroom capabilities for Avid for News at NAB Show 2026 (Booth N2226, April 18-22), demonstrating how Avid Content Core connect...

13/04/2026

NAB 2026: Synamedia Launches Cloud-Controlled Edge Playout Version of Quortex PowerVu

Synamedia has announced a new version of Quortex PowerVu, an IP-native, software...

13/04/2026

NAB 2026: Mediaproxy Adds AI Brand and Advertisement Tracking to LogServer

Mediaproxy has developed a suite of AI-powered tools for brand and advertisement tracking, integrated into its LogServer compliance logging and analysis platfor...

13/04/2026

NAB 2026: Disguise to Demonstrate Media Server and Software Integrations

Disguise will demonstrate its media servers and software at NAB Show 2026, appearing across five partner booths in Central Hall: MRMC, B&H, Planar, CarbonBlack,...

13/04/2026

NAB 2026: OpenDrives Introduces Edge Hybrid Cloud-Edge Performance Accelerator

OpenDrives is introducing OpenDrives Edge at NAB Show 2026, a hybrid cloud-edge performance accelerator for distributed video and rich media workflows. The prod...

13/04/2026

ESPN Returns to The Shed for 2026 WNBA Draft, Expanding Camera Arsenal and Deepening Fan Coverage

The show will deploy 18 cameras across two sets and the draft floor, including a...

13/04/2026

Musik Hack update SweetEQ

Intuitive EQ plug-in gets an upgrade Following its official launch back in February 2026, Musik Hack's intuitive EQ plug-in has been treated to its firs...

13/04/2026

Vintage Vault 5 from UVI

Flagship soft synth collection expanded The latest version of UVI's flagship vintage-inspired soft synth collection has just arrived, expanding the suit...

13/04/2026

Sonuscore launch Lux Orchestral Strings Elements

Free version of innovative string library arrives Released in October 2025, Lux Orchestral Strings was said to be Sonuscore's most ambitious library to ...

13/04/2026

Girls' Research Camp at SGL Carbon in Meitingen inspires young women to pursue STEM careers

The Girls' Research Camp is part of the Technology - Future in Bavaria edu...

13/04/2026

Rohde & Schwarz transforms submarine communications for realtime underwater dominance at UDT 2026

Rohde & Schwarz transforms submarine communications for real time underwater dom...

13/04/2026

Rohde & Schwarz enables Pulsar signal simulation to support next-generation navigation devices

Rohde & Schwarz enables Pulsar signal simulation to support next-generation navi...

13/04/2026

When Missiles Move at 5X the Speed of Sound, Timing Is Everything

L3Harris is accelerating the development of infrared payloads for Space Development Agency's Tranche 2 Tracking Layer, to help meet urgent national defense ...

13/04/2026

US Army Selects L3Harris for Next-Generation Night-Vision System

By leveraging cutting-edge unfilmed Gen III image intensifier technology, NOVA delivers unmatched clarity, range, and reliability in low-light environments - en...

13/04/2026

Harvey Arnold Represents the Best of Broadcast Engineering

Share Copy link Facebook X Linkedin Bluesky Email...

13/04/2026

Ross Video and HighField AI to Deliver AI-Assisted Graphics Creation

Share Copy link Facebook X Linkedin Bluesky Email...

13/04/2026

Disguise to Showcase Cutting-Edge Experience Tech for Bro...

Explore new Disguise plugins, including Sony's VP integration; Listen to panels across partner booths at Sony and B&H Disguise, the company powering everyt...

13/04/2026

TAG Video Systems Joins MXL Interoperability Initiative t...

TAG Video Systems, the leading IP-native Realtime Media Platform, has announced its participation in the Media Exchange Layer (MXL) interop initiative. TAG has ...

13/04/2026

Chaos Launches Free V-Ray for Blender Community Edition a...

Today, Chaos launched V-Ray for Blender Community Edition at BCON Austin 2026, making its production-proven 3D renderer free for all Blender users. The same Aca...

13/04/2026

LTN Appoints Mark Romano as Vice President Multichannel P...

Additions strengthen LTN's leadership as broadcasters scale satellite-to-IP transition LTN today announced the appointments of Mark Romano as Vice Presiden...

13/04/2026

NUGEN Audio Updates Halo Vision With New Precision Analys...

LEEDS, UK, APRIL 13, 2026 NUGEN Audio releases Halo Vision v1.2, a significant update to its real time, customizable audio analysis suite for 3D, surround and...

13/04/2026

Atomos to Acquire Flanders Scientific

Atomos today announced the acquisition of Flanders Scientific (FSI), one of the most respected names in professional reference monitoring. This strategic move r...

13/04/2026

How Mei Semones Built Her Sound from J-Pop, Jazz, and Bilingual Songwriting

How Mei Semones Built Her Sound from J-Pop, Jazz, and Bilingual Songwriting The indie-pop artist combines agile guitar lines, rhythmic shifts, and lyrics that...

13/04/2026

Cue the Change: Jonathon Heyward Is Making Classical Music More Relatable

Cue the Change: Jonathon Heyward Is Making Classical Music More Relatable Nicknamed the Converse Conductor, the Boston Conservatory alum holds top conductin...

13/04/2026

Heat Wave: Inside Miamis Sizzling, Boundary-Blurring Latin Music Scene

Heat Wave: Inside Miamis Sizzling, Boundary-Blurring Latin Music Scene In a city shaped by migration and exchange, Berklee alumni are helping drive a Latin mu...