Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

13/12/2025

YouTube TV to Launch Genre Packages

In a move that will help it offer more flexible and less costly programming options, YouTube TV has announced that it will be launching YouTube TV Plans with mo...

13/12/2025

Magna Systems Finishes UHD, IP-based OB Truck For Singapore Network

SINGAPORE Magna Systems has designed, built and completed what is believed to be the first full UHD and IP-based OB truck in Southeast Asia for a Singapore medi...

12/12/2025

SVG Summit 2025 Preview: Everything You Need to Know for Next Week's Big Show in NYC

SVG Summit 2025 Preview: Everything You Need to Know for Next Week's Big Sho...

12/12/2025

Hailey Gates and Alia Shawkat Welcome You to the Village of Atropia

Hailey Gates at the Atropia premiere (photo by George Pimentel / Shutterstock for Sundance Film Festival)...

12/12/2025

Spotify and ATP Tour Launch First Episode of New Video Series

Last month, Spotify announced a new collaboration with the ATP Tour, the global governing body of men's professional tennis, aimed at bringing the next gene...

12/12/2025

Arkansas TV Drops PBS Affiliation Amid Funding Cuts

CONWAY, Ark. In a notable example of how the elimination of Federal federal funding is forcing public stations to make massive cuts and changes in the way they...

12/12/2025

Wisycom and DPA Microphones Appoint Rene Moerch as Group...

Wisycom and DPA Microphones announce the appointment of Ren Moerch as Group Product Director, Wireless, a strategic leadership role that will guide the combine...

12/12/2025

SMPTE Releases Updated Engineering Report on Artificial I...

SMPTE , the home of media professionals, technologists, and engineers, in conjuncture with the European Broadcasting Union (EBU) and the Entertainment Technolog...

12/12/2025

Keepit and Ingram Micro form strategic relationship in Po...

Keepit, the vendor-independent, cloud-native data protection provider, today announced a strategic go-to-market relationship in Poland with Ingram Micro, a lead...

12/12/2025

Atomos Enhances FUJIFILM GFX ETERNA 55 with RAW Capabilit...

Atomos announced the immediate availability of a new firmware update for its Ninja TX GO and Ninja TX monitor-recorders, unlocking Open Gate 48P RAW recording w...

12/12/2025

Professional Wireless Systems Provides Comprehensive RF S...

Professional Wireless Systems (PWS) once again played a critical role in delivering flawless wireless coordination and support at the 2025 Latin Grammy Awards a...

12/12/2025

AIMS Announces Inaugural IPMX Product Testing and Certifi...

The Alliance for IP Media Solutions (AIMS), together with the Video Services Forum (VSF), the Advanced Media Workflow Association (AMWA) and the European Broadc...

12/12/2025

DHD Gears for Hamburg Open 2026 with Latest Audio Product...

DHD audio will demonstrate the latest additions to its range of digital audio production solutions on Booth 321 in Hall B6 at Hamburg Open 2026. The show will b...

12/12/2025

Chaos Brings macOS Support and AI Tools to V-Ray for Blen...

Chaos today announces the release of V-Ray for Blender, update 2, bringing its award-winning rendering technology to even more Blender users by adding support f...

12/12/2025

UltraLEDs Launches Precision LED Tape for Professional Fi...

Lighting specialist UltraLEDs has launched Precision LED Tape, a high-CRI lighting solution designed specifically for professional film, TV, and studio use. P...

12/12/2025

Zixi Appoints Roi Sasson as Vice President Engineering

Zixi, the Emmy Award-winning leader in live broadcast-quality video over IP, today announced that Roi Sasson has joined the company as Vice President, Engineer...

12/12/2025

BitFire and Appear Partner to Advance Cloud and Edge Work...

BitFire (bitfire.tv), the leader in software-defined live production and IP transmission, today announced a strategic partnership with Appear, a leader in high-...

12/12/2025

HPA Announces Tech 2026 Retreat Agenda

LOS ANGELES The Hollywood Professional Association (HPA) today said futurist Robert Tercek, creative technologist Jessie Hughes from Leonardo.AI and Emmy-winnin...

12/12/2025

BitFire, Appear Form Strategic Partnership Integrating IP-Based Solutions

HUDSON, Mass. BitFire and Appear have struck a strategic partnership aimed at offering broadcasters, sports leagues and streaming platforms a faster, more flexi...

12/12/2025

TV Tech, TVBEurope to Explore MXLs Impact on Media Production

The broadcast industry is evolving faster than ever. #IPWorkflows #remoteproduction, and next-gen audio systems are reshaping how teams design, deliver, and sca...

12/12/2025

Wrapbook Acquires TV and Film Production Scheduling Platform Cinapse

LOS ANGELES The payroll and production accounting platform Wrapbook has announced the acquisition of Cinapse, a modern scheduling platform for film and televisi...

12/12/2025

Ross Video Expands South Asian Operations

DEHLI Ross Video has announced that it is expanding and restructuring its commercial and technical teams in the South Asian Association for Regional Cooperation...

12/12/2025

Rise AV Launches Asia Pacific Council and Mentoring Program

LONDON Following the success of its UK launch in January 2025, Rise AV, the global not-for-profit initiative dedicated to supporting and advancing women in the ...

12/12/2025

Tubi To Introduce Matter Casting For Fire TV

SAN FRANCISCO Ad-supported streaming service Tubi next week will launch Matter Casting, a new casting standard that will enable seamless mobile-to-TV viewing di...

12/12/2025

HPA Announces Tech Retreat Highlights

LOS ANGELES The Hollywood Professional Association (HPA) today said futurist Robert Tercek, creative technologist Jessie Hughes from Leonardo.AI and Emmy-winnin...

12/12/2025

Cheers to AI: ADAM Robot Bartender Makes Drinks at Vegas Golden Knights Game

In Las Vegas's T-Mobile Arena, fans of the Golden Knights are getting more than just hockey - they're getting a taste of the future. ADAM, a robot devel...

12/12/2025

President of Ireland Catherine Connolly visit to RT Raidi na Gaeltachta in Casla, Connemara

Uachtar n na h ireann, Catherine Connolly visited RT Raidi na Gaeltachta's...

12/12/2025

TV Host and social media sensation Eric Roberts revealed as sixth contestant for Dancing with the Stars 2026

Ireland AM host Eric Roberts has been revealed as the sixth contestant taking to...

12/12/2025

December 11, 2025

Scripps Research team pioneers an efficient way to stereoselectively add fluorine to drug-like molecules A new method uses a novel catalyst and inexpensive fluo...

11/12/2025

AI for Sustainability: Lessons from Sarajevo

Thomson and the Center for News, Technology and Innovation (CNTI) convened a two-day workshop in Sarajevo bringing together more than 35 journalists, editors, p...

11/12/2025

ESPN's Aims for Spectacular With Heisman Trophy Show

ESPN's Aims for Spectacular With Heisman Trophy ShowEvent firsts include 1080p HDR production airing on both national broadcast and cableBy Dan Daley, Audio...

11/12/2025

SVG Students To Watch: Frankie Patton, University of Colorado

SVG Students To Watch: Frankie Patton, University of ColoradoThe 2025 grad is hitting the ground running as a PA on national broadcastsBy Brandon Costa, Directo...

11/12/2025

SVG Summit 2025 Technology Exhibits Preview, Part 3

SVG Summit 2025 Technology Exhibits Preview, Part 3By SVG Staff Thursday, December 11, 2025 - 7:24 am Print This Story | Subscribe Story Highlights The 2...

11/12/2025

SVG Sit-Down: What Makes Gen Z, X, and Y Fans Tick? Dave Gavant of WSC Sports Goes Inside the 2025 Fan Engagement Survey

SVG Sit-Down: What Makes Gen Z, X, and Y Fans Tick? Dave Gavant of WSC Sports Go...

11/12/2025

SVG Summit 2025 Preview: 5G, MXL, Spectrum Loss, and Outerspace on Tap for Tuesday Tech Talks'

SVG Summit 2025 Preview: 5G, MXL, Spectrum Loss, and Outerspace on Tap for Tues...

11/12/2025

2025 Sports Broadcasting Hall of Fame: David Levy, Turner Titan and Master of All Sports-Media Trades

2025 Sports Broadcasting Hall of Fame: David Levy, Turner Titan and Master of Al...

11/12/2025

SVG Launches Follow the Money' Podcast: Go Inside the Sports Media Biz with Sam McCleery and John Kosner

SVG Launches Follow the Money' Podcast: Go Inside the Sports Media Biz with...

11/12/2025

A Deep Dive Inside Game Creek Video's Bird and Magic Mobile Units, Home to Amazon's NBA on Prime Video'

A Deep Dive Inside Game Creek Video's Bird and Magic Mobile Units, Home to A...

11/12/2025

How Sound Effects for Monsters Funday Football' Emulated the Sonic Soul of Monsters, Inc.'

How Sound Effects for Monsters Funday Football' Emulated the Sonic Soul of ...

11/12/2025

SVG New Sponsor Spotlight: CSP Mobile Productions' Len Chase on Upgrading Truck Fleet to 1080p, HDR, and ST 2110

SVG New Sponsor Spotlight: CSP Mobile Productions' Len Chase on Upgrading Tr...

11/12/2025

Spotify and The Game Awards Debut Gaming-Inspired Spotify Singles From Labrinth, Evanescence x GUNSHIP, and Bilmuri

Having the right song soundtrack your moves can make all the difference when gam...

11/12/2025

Celebrate Taylor Swift's Record-Breaking Year and New Docuseries with Exclusive Playlist Cover Art Stickers

It's been a big year for Taylor Swift. Her highly anticipated album The Life...

11/12/2025

L3Harris Ramps Up Production of Next-Gen Missile Tracking Satellites at Expanded Florida Facility

New satellites for the SDA Tranche 1 Tracking program in production at L3Harris&...

11/12/2025

L3Harris Delivers First Meadowlands Production Unit to US Space Force

The Meadowlands system, a compact and mobile version of the CCS, uses ground-based radio frequency units to disrupt satellite communications....

11/12/2025

L3Harris Demonstrates Interoperable Network to Unify Department of War and U.S. Government Agencies

The L3Harris demonstration united tactical communications devices, counter-UAS c...

11/12/2025

2025: L3Harris Year in Review

Throughout 2025, L3Harris delivered innovative solutions to U.S. and allied warfighters across every domain. With an unrelenting commitment to excellence, our...

11/12/2025

Nielsen reveals exclusive new data and insights in annual Tops of Sports report

A Majority of the World's Population (51%) Identify As Soccer Fans The 2025 MLB postseason notched 58.2 billion viewing minutes, up +24% from the prior y...

11/12/2025

Zixi Names Roi Sasson Vice President, Engineering

WALTHAM, Mass. Video-over-IP software provider Zixi said Roi Sasson has joined the company as vice president, engineering....