Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

06/09/2026

Dolby and MagentaTV Bring Fans Closer to the FIFA World Cup 2026 in Germany with Dolby Vision and Dolby Atmos

June 9 2026, 23:00 (PDT) Dolby and MagentaTV Bring Fans Closer to the FIFA Worl...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

01/07/2026

Broadcast Management Group Appoints Kathy Samuels as Director of Creative Services

Broadcast Management Group (BMG) has announced the appointment of Kathy Samuels ...

01/07/2026

Shade Launches Custom Objects and Automations

Shade has announced Custom Objects and Automations, a platform expansion releasing June 29, 2026, that adds database and workflow automation capabilities direct...

01/07/2026

FOR-A America Adds Two Regional Sales Leaders

FOR-A America has announced the addition of Jaz Wray and Fernando Cruz to its U.S. sales team. Both report to Ernie Leon, Senior VP and Head of Sales and Strate...

01/07/2026

NBC Sports To Present All 15 MLB Games Nationally on July 4 Weekend Star-Spangled Sunday'

NBC Sports will air all 15 MLB games nationally on Sunday, July 5, across NBC, P...

01/07/2026

Clear-Com Upgrades Wireless Communications for Jeopardy! and Wheel of Fortune

Clear-Com has announced a wireless communications upgrade for Jeopardy! and Wheel of Fortune, deploying FreeSpeak II and FreeSpeak Icon systems across both prod...

01/07/2026

England Deploys Sony STATSports Live GPS Tracking at FIFA World Cup 2026

England's performance team will use Sony's STATSports APEX GPS tracking system to monitor player physical data in real time during FIFA World Cup 2026 m...

01/07/2026

Adder Technology Appoints Neil Hillier as CEO

Adder Technology has announced the appointment of Neil Hillier as Chief Executive Officer, effective July 1, 2026. Hillier succeeds Adrian Dickens, who transiti...

01/07/2026

Bitcentral Splits Into Two Companies: Bitcentral and ViewNexa

Bitcentral, Inc. has announced a strategic transaction creating two separate companies. The Production and Playout business will continue as Bitcentral, now own...

01/07/2026

DAZN48 Creator Initiative Draws Global Participation for FIFA World Cup 2026

DAZN has announced results from DAZN48, its creator initiative for the FIFA World Cup 2026. Launched in April 2026, the program received thousands of applicatio...

01/07/2026

IDEA To Induct Daktronics Sarah Rose Into Hall of Fame

Sarah Rose, VP, global services, Daktronics (NASDAQ: DAKT), will be inducted into the Information Display and Entertainment Association (IDEA) Hall of Fame at t...

01/07/2026

Gravity Media Delivers Distribution and Streaming Services for World Economic Forum in Dalian

Gravity Media and the World Economic Forum's production team provided broadc...

01/07/2026

Insight Productions Launches Insight Storm, a 53-Foot Esports Broadcast Truck

Insight Productions has announced the launch of Insight Storm, a 53-foot mobile broadcast unit built for esports production. The truck is built around a Ross Vi...

01/07/2026

ESPN Announces America 250 Content Initiatives Across Platforms

ESPN has announced several content initiatives marking America's 250th anniversary, as part of The Walt Disney Company's Disney Celebrates America pro...

01/07/2026

At Cosm, FIFA World Cup 2026 Is a Coming-of-Age Moment as Shared Reality Becomes Bucket-List Entertainment

Eleven production kits, REMI workflows, and cloud distribution bring 40 World Cu...

01/07/2026

SVG Regional Sports Production Summit Draws Industry Leaders to Denver for Deep Dive on Evolving RSN Landscape

The conference also discussed the opportunities offered an industry that is endu...

01/07/2026

Mountain West Launches MW+ Streaming Platform Powered by Kiswe

The Mountain West Conference has announced the launch of MW , a direct-to-consumer streaming platform powered by Kiswe. The platform will carry live Mountain We...

01/07/2026

Tracktion unleash Waveform 14 DAW

New AI Assistant, Multi-channel Audio, ARA2 improvements & more Tracktion's DAW software has just received its latest major update, gaining a selection ...

01/07/2026

The Crow Hill Company release Brackish Pads

Stammering, stuttering, strangulated tones The Crow Hill Company's latest creation promises to be the most original sound set they've produced to d...

01/07/2026

Sweetwater & Andertons launch Darkglass Anagram Limited Edition Guitar Essentials

Exclusive run of limited-edition modelling pedals Sweetwater and Andertons M...

01/07/2026

Call for NFVF funding applications to attend Film Festivals and Markets taking place from 01 August 2026 - 31 October 2026

The National Film and Video Foundation (NFVF) is pleased to announce that the ca...

01/07/2026

FCC Plans to Auction 160 MHZ of Midband Spectrum

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

CBS Miami Launches Hope 4 Venezuela' Relief Effort

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

Riedel, SKAARHOJ Expand Collaboration With SimplyLive Integration

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

Gray Media to Buy American Spirit Media's TV Stations

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

Cascade PBS Launches App Solution Provider 'Local Public

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

Manfrotto Introduces UNCOVER, the new premium camerabag collection for modern hybrid creators

Manfrotto Introduces UNCOVER, the new premium camera bag collection for modern h...

01/07/2026

Blackmagic Design Powers Houston Tamil Sangam Literacy Competition

Blackmagic Design Powers Houston Tamil Sangam Literacy Competition Brie Clayton July 1, 2026 0 Comments Volunteers use ATEM Mini Pro, Blackmagic Desig...

01/07/2026

Riedel, SKAARHOJ Expand Collaboration WithSimplyLive Integration

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

PlayBox Technology Publishes State of Broadcast Infrastru...

PlayBox Technology has published State of Broadcast Infrastructure 2026, an in-depth industry research report examining the technologies, operational challenges...

01/07/2026

Jigsaw24 Appoints Alan Henry as Head of Sales for Media a...

LONDON, UK, 1 JULY Jigsaw24 has appointed Alan Henry as Head of Sales for Media and Entertainment, reinforcing its continued investment in helping broadcaster...

01/07/2026

Content Vault Partner with XMA to Distribute Universal Fi...

Content Vault, the patent-pending secure content distribution platform protecting high-value media from disclosures, theft and unauthorised access, today announ...

01/07/2026

Bitcentral Announces Strategic Evolution to Create Two F...

Bitcentral, Inc. a leading provider of enterprise software and digital media solutions for news, sports and entertainment broadcasters, as well as streaming pla...

01/07/2026

Adder Technology Names Neil Hillier as CEO

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

IBCAP Opens New Anti-Piracy Lab in Denver

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

FCC Plans to Auction 160 MHZ of Mid-Band Spectrum

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

CBS Miami Launches 'Hope 4 Venezuela' Relief Effort

Share Copy link Facebook X Linkedin Bluesky Email...

01/07/2026

Groundbreaking First Nations Screen Business Accelerator launched through national partnership

Groundbreaking First Nations Screen Business Accelerator launched through nation...

01/07/2026

Chyron Launches the All-New Chyron Academy: A Reimagined, Hands-On Learning Experience for Live Broadcast Production

Chyron Launches the All-New Chyron Academy: A Reimagined, Hands-On Learning Expe...

01/07/2026

Amplium Captures Kawasaki Brave Thunders Game with Blackmagic URSA Cine Immersive

Amplium Captures Kawasaki Brave Thunders Game with Blackmagic URSA Cine Immersiv...

01/07/2026

Boris FX Optics Expands Plugin Support to Apple Photos, Capture One, and Affinity Photo

Boris FX Optics Expands Plugin Support to Apple Photos, Capture One, and Affinit...

01/07/2026

Fussy asks Britain one simple question: Are you a Tosser?

Sky Zero Footprint Fund-backed TV campaign featuring Deborah Meaden challenges consumers to rethink everyday bathroom wasteWednesday 1 July 2026 Fussy asks Bri...

01/07/2026

New Sky research reveals postcode lottery leaving girls behind in sport

Constituency-level analysis reveals where girls miss out most on sport - and where targeted action could unlock more than £640 million in economic and health be...

01/07/2026

Riedel and SKAARHOJ Expand Collaboration With SimplyLive Integration

Wuppertal July 1, 2026 Riedel and SKAARHOJ Expand Collaboration With SimplyLive IntegrationRiedel Communications today announced an expanded collaboration wit...

01/07/2026

Apple Creator Studio gets smarter, faster, and more connected - UPDATE - Posted on 30 June 2026

Apple today introduced power-packed updates to Apple Creator Studio, a groundbre...

01/07/2026

Nectar360 becomes first UK Retail Media Network to achieve IAB Europe Certification following independent ABC audit

Nectar360, the Retail Media, Loyalty and Insights business of the Sainsburys Gro...