Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

02/05/2026

Dalet Flex LTS Delivers Smarter Search, Faster Editing, and an AI-Ready Foundation for Modern Media

Dalet, a leading technology and service provider for media-rich organizations, t...

01/05/2026

NBCUniversal's Peacock to Be First Streamer to Integrate Dolby's Full Suite of Premium Picture and Sound Innovations

January 5 2026, 18:30 (PST) NBCUniversal's Peacock to Be First Streamer to ...

22/04/2026

Live From NAB 2026: Solid State Logics Berny Carpenter on Expanding System T With Virtual DSP, Cloud Workflows

Solid State Logic is advancing its System T platform with a stronger focus on IP...

22/04/2026

Live From NAB 2026: Dolbys Giles Baker on the Growth of Dolby OptiView, Immersive Vision and Audio for Live Sports

From immersive audio to live streaming, Dolby Laboratories is focused on the fut...

22/04/2026

Live From NAB 2026: Blackmagic Design's Bob Caniglia on Implementing Cinematic Looks in Live Broadcasts

Shallow depth-of-field cameras have taken the industry by storm. Its debut a han...

22/04/2026

NAB 2026: Eastern Kentucky University deploys campus-wide ST 2110 network with Riedel and Bridge Digital

Riedel Communications (Booth C4908) announced that Eastern Kentucky University (...

22/04/2026

SportsTechBuzz at NAB 2026, Day 4: Live Reports From the Show Floor in Vegas

The NAB Show is in full swing, and the SVG and SVG Europe editorial teams are chasing down the hottest stories from all over the Las Vegas Convention Center. He...

22/04/2026

NAB 2026: Blackmagic Design Announces URSA Cine 12K LF 100G

Blackmagic Design has announced the URSA Cine 12K LF 100G, a new model in the URSA Cine family adding 100G Ethernet for SMPTE 2110 live production output up to ...

22/04/2026

Live From NAB 2026: NEPs Martin Stewart Talks 40 Years, the NEP Platform, and Scaling for FIFA World Cup

Celebrating its 40th anniversary, NEP is leaning into hybrid production with the...

22/04/2026

Live From NAB 2026: NEPs Dan Murphy on NEP Platform, TFC, and the Shift to Software-Defined Workflows

NEP VP, Platform Dan Murphy sits down at the 2026 NAB Show to unpack what NEP P...

22/04/2026

Spotify and WNBA's New York Liberty Bring Basketball and Music Together With New Partnership

Spotify and the New York Liberty are teaming up to give music and basketball fan...

22/04/2026

The story of the Focusrite ISA preamp

New 20-minute documentary explores iconic design The Focusrite Room in Mesa, Arizona, where John Aquilino hosts the Studio Console 005. In 2025, Focusrite co...

22/04/2026

EverSync SP-10 wireless from Cloudvocal

Offers compact wireless solution for pedalboards Taiwanese audio brand Cloudvocal have announced the availability of a new pedalboard-friendly wireless syst...

22/04/2026

Arturia release Augmented Persia

Latest hybrid sampling/synthesis instrument arrives Arturia's Augmented series offerings rely on a mixture of sampling and synthesis, allowing users to ...

22/04/2026

Acustica Audio launch Salt 2

Combines three distinct analogue EQ emulations The latest addition to Acustica Audio's ever-expanding collection of analogue-emulation plug-ins combines...

22/04/2026

Analog Empire: Bass & Lead from Melda Production

Final instalment in vintage-inspired instrument series Analog Empire: Bass & Lead marks the final instalment in Melda Production's vintage hardware-insp...

22/04/2026

Strymon reveal the Canoga

Fuzz pedal joins all-analogue Series A line Given that Strymons reputation was built on unapologetically digital pedals, it was a little surprising to see t...

22/04/2026

SBS names shortlisted brands for 2026 SBS Media Sustainability Challenge

SBS names shortlisted brands for 2026 SBS Media Sustainability Challenge 22 April, 2026 Media releases National broadcaster also releases its second annual...

22/04/2026

The Frequency That Decides the Fight

Why Low Band Electronic Warfare Matters...

22/04/2026

Polish national football team play-off games top monthly programme list

The nation unites around football team's World Cup dream Warsaw, Poland, 20.04.26: Nielsen, a global leader in audience measurement, data, and media intell...

22/04/2026

Nielsen and the Polish Organisation of Advertisers announce strategic partnership to elevate marketing standards in Poland

Warsaw, Poland, 22.04.26: Nielsen, a global leader in audience measurement, data...

22/04/2026

Nielsen helps New Zealand brands expand internationally with greater clarity and confidence

New market intelligence offering gives businesses a clearer view of local consum...

22/04/2026

Glookast Unveils New UX, YouTube and Social Media Connectors, Premiere Panel, Cinnafilm Tachyon Plugin and More at NAB

Glookast Unveils New UX, YouTube and Social Media Connectors, Premiere Panel, Ci...

22/04/2026

Lightcraft Technology to Preview Spark Story at NAB 2026 with Interactive Previs Experience

Lightcraft Technology to Preview Spark Story at NAB 2026 with Interactive Previs...

22/04/2026

Bolin Demos New PTZ Cameras and Controller at 2026 NAB Show

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

Anchor Audio Launches Beacon 3

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

FCC Grants WSWB TV License Transfer to Sinclair

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

Telemundo Puerto Rico Streaming Channel Launches On Prime Video

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

Chyron Announces PRIME Translate

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

TV Tech Announces Winners of Best of Show Awards at 2026 NAB Show

Share Copy link Facebook X Linkedin Bluesky Email...

22/04/2026

VEON's Banglalink to Bring Starlink Mobile to Customers in Bangladesh

22 Apr 2026 VEON's Banglalink to Bring Starlink Mobile to Customers in Bangladesh Bangladesh becomes the third market where VEON and Starlink Mobile partne...

22/04/2026

FIRST LOOK FOR NEW U DRAMA SERIES HIT POINT

U have unveiled exclusive first-look images for their six-part police thriller Hit Point, starring Nick Blood (Day of the Jackal) and BAFTA nominee Saffron Hock...

22/04/2026

UKTV Highlights: Saturday May 9th -15th 2026

What can I watch on UKTV and stream on U this week? This week on UKTV and the free streaming service U, viewers can watch a range of new and returning programm...

22/04/2026

Sky announces fifth year of WNT Fund with 30,000 bursary supporting players and grassroots football

Wednesday 22 April 2026 Sky announces fifth year of WNT Fund with 30,000 bursa...

22/04/2026

This Earth Day, Discover the Sustainable Productions Behind Our Films and Series

Back to All News This Earth Day, Discover the Sustainable Productions Behind Our Films and Series Emma Stewart, Ph.D. Netflix Sustainability Officer Enterta...

22/04/2026

Retail Media Standards Are Expanding Into Commerce Media - Here's Why That Matters for Measurement

The move from Retail Media to Commerce Media is about broadening the scope of th...

22/04/2026

Dolby and BMW Bring Dolby Atmos to the BMW 7 Series, Expanding Immersive Audio Across Future Models

April 22 2026, 07:00 (PDT) Dolby and BMW Bring Dolby Atmos to the BMW 7 Series,...

22/04/2026

RT Licenses Stolen Sister to Pushkin

RT Documentary On One 7-part series breaks US market for first time RT Programme Sales has announced its first deal with a US distribution partner for its 7-...

22/04/2026

NVIDIA and Google Cloud Collaborate to Advance Agentic and Physical AI

NVIDIA and Google Cloud have collaborated for more than a decade, co engineering a full stack AI platform that spans every technology layer - from performance o...

21/04/2026

Live From NAB 2026: BitFires Colin Bonzey on Growing Spark Platform for Live Cloud-Based Productions

Cloud-based production isnt going anywhere, and BitFire is doubling down by prov...

21/04/2026

Live From NAB 2026: AWSs Jason Dvorkin, Regina Rossi on Driving Innovation With Al-Based Workflows

The topic of artificial intelligence has a stranglehold on the sports-video-prod...

21/04/2026

Live From NAB 2026: T-Mobile for Business' Jason Schnellbacher on Enhancing 5G for Sports Fans, Broadcasters

5G is still a hot topic in live event production, and this workflow continues to...

21/04/2026

Live From NAB 2026: Appears Ed McGivern on Fox Sports Deal, New XM Platform, and VX Software Debut

At the 2026 NAB Show, Ed McGivern, GM and President of Appear US, discusses the ...

21/04/2026

NAB 2026: Studio Network Solutions launches on-premise AI suite for media production workflows

Studio Network Solutions (SNS) has announced an on-premise AI suite designed for...

21/04/2026

NAB 2026: Suite Studios integrates file-streaming technology into Frame.io Drive

Suite Studios has integrated its file-streaming technology into the newly announced Frame.io Drive, a desktop application from Adobe company Frame.io. The colla...

21/04/2026

NAB 2026: Net Insight integrates InSync FrameFormer into Nimbra Edge for media processing

Net Insight has integrated InSync Technology's FrameFormer into the Nimbra E...