Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

02/05/2026

Dalet Flex LTS Delivers Smarter Search, Faster Editing, and an AI-Ready Foundation for Modern Media

Dalet, a leading technology and service provider for media-rich organizations, t...

01/05/2026

NBCUniversal's Peacock to Be First Streamer to Integrate Dolby's Full Suite of Premium Picture and Sound Innovations

January 5 2026, 18:30 (PST) NBCUniversal's Peacock to Be First Streamer to ...

01/04/2026

DOLBY AND DOUYIN EMPOWER THE NEXT GENERATON OF CREATORS WITH DOLBY VISION

January 4 2026, 18:00 (PST) DOLBY AND DOUYIN EMPOWER THE NEXT GENERATON OF CREATORS WITH DOLBY VISION Douyin Users Can Now Create And Share Videos With Stun...

30/03/2026

NAB 2026: Manifold to Demonstrate 400GbE COTS FPGA Support

Manifold Technologies, a Germany-based provider of cloud infrastructure for live broadcast production, will demonstrate support for 400GbE COTS FPGA accelerator...

30/03/2026

NAB 2026: Boland Communications Introduces QD-OLED Series Monitors

Boland Communications will introduce its QD4K315HDR10, a 31.5-inch QD-OLED monitor, at NAB Show 2026 (Booth C3519, April 18-22). The company is also introducing...

30/03/2026

NAB 2026: PTZOptics to Showcase Move 4K and Horizon Platform

PTZOptics will demonstrate its Move 4K PTZ cameras and Horizon web-based control platform at NAB Show 2026 (Booth N1902). Move 4K with Horizon is now available...

30/03/2026

NAB 2026: Net Insight to Showcase Updated Nimbra Edge

Net Insight will demonstrate the next version of Nimbra Edge, its orchestration and control layer for live media services across multi-domain environments, at N...

30/03/2026

NAB 2026: Appear to Showcase Live Production Processing

Appear ASA will exhibit at NAB Show 2026 (Booth W1531, April 19-22, Las Vegas). The company completed an IPO in November 2025. Our customer-first approach is ...

30/03/2026

NAB 2026: Harmonic Announces New Live Sports Streaming Capabilities

Harmonic has announced new capabilities for its sports streaming platform, covering multiview, programmatic advertising, in-stream advertising, and content wate...

30/03/2026

NAB 2026: Ateme to Showcase GenAI, Agentic AI, and Streaming

Ateme (Booth W1723) will demonstrate broadcast, streaming, and AI-driven media workflow solutions at NAB Show 2026. GenAI and Agentic AI Ateme will demonstrat...

30/03/2026

NAB 2026: Bitmovin's Player Web X Adds Advertising Support, Vertical Video, and Proprietary ABR Algorithm

Bitmovin has announced new capabilities for Player Web X, its web video player, ...

30/03/2026

NAB 2026: Brazil's Minister of Communications and FCC Commissioner To Speak

The 2026 NAB Show (April 18-22, exhibits April 19-22, Las Vegas Convention Center) will host Brazil's Minister of Communications, Frederico de Siqueira Filh...

30/03/2026

NAB 2026: EVS To Showcase Expanded Live Production Ecosystem

EVS will exhibit at NAB Show 2026 (Booth N1841), highlighting new products and updates across its live production portfolio, including the debut of T-Motion med...

30/03/2026

NAB 2026: Solid State Logic To Demonstrate Expanded Virtual System T Platform

Solid State Logic will demonstrate its virtualized System T platform at NAB Show 2026 (Booth C6907). Demonstrations will include the VTE1 virtual DSP engine, ne...

30/03/2026

NAB 2026: Globecast To Showcase Managed Media Services Approach

Globecast will exhibit at NAB Show 2026 (Booth W3335), highlighting its hybrid service model spanning satellite, IP, fiber, and cloud. The company will demonst...

30/03/2026

NAB 2026: IP Showcase Returns as IPMX Moves to Deployment

The Alliance for IP Media Solutions (AIMS), Advanced Media Workflow Association (AMWA), and the Video Services Forum (VSF) have announced that the IP Showcase w...

30/03/2026

NAB 2026: BBright To Demonstrate Single-Stream ST 2110 Playout

At NAB Show 2026 BBright will present a demonstration of its One Stream for the World concept, showing how a single ST 2110 playout stream can simultaneously ...

30/03/2026

NAB 2026: OpenDrives To Demonstrate New Storage and Edge Products

OpenDrives will demonstrate new products at NAB Show 2026, with two locations in the West Hall: a pod (W3443-E) in the Sports Business Hub and a cabana at W1158...

30/03/2026

Behind the Mic: Amazon Prime Hosts 90th Master Tournament With Host Terry Gannon

Behind The Mic provides a roundup of recent news regarding on-air talent, including new deals, departures, and assignments compiled from press releases and repo...

30/03/2026

Op-Ed: Preparing for Agentic AI in Live Sports

The economics of live sports streaming have changed. New rights models, cloud production tools, and lower-cost distribution have made it possible for high schoo...

30/03/2026

Movimento Strings from Sonora Cinematic

MPE-capable chamber strings library announced Alongside their collection of Kontakt instruments, Sonora Cinematic have been steadily introducing a series of...

30/03/2026

UJAM release Groovemate Latigo

Latin-inspired percussion instrument announced Built on a newly developed engine and interface, UJAM's latest instrument has been designed to create Lat...

30/03/2026

Best Service launch Desert Winds

Latest Eduardo Tarilonte collaboration announced The latest library to join Best Service's ever-growing range includes four solo wind instruments that c...

30/03/2026

SOS Music Creators Survey 2026

We want to hear from you! Complete our SOS Quick Survey and enter the prize draw for a chance to win one of three $50 Amazon vouchers! Sound On Sound carri...

30/03/2026

Government of Canada Selects MAS for Strategic Tanker Fleet Sustainment

CC-330 Husky. 2024 Eric Desbiens Photography. Used with permission for the announcement and related communications. No residual rights....

30/03/2026

L3Harris Included in MDA Space Solution for RCN ISTAR Program

L3Harris Technologies will provide WESCAM CMX -8 sensor systems for integration on new Uncrewed Aircraft Systems from MDA Space, enhancing the Royal Canadian Na...

30/03/2026

EVS to Debut T-Motion Robotics at 2026 NAB Show

Share Copy link Facebook X Linkedin Bluesky Email...

30/03/2026

SDVI To Feature New Rally Media Supply Chain Management Enhancements

Share Copy link Facebook X Linkedin Bluesky Email...

30/03/2026

Boland Communications Introduces QD4K315HDR10 QD-OLED Series Monitors

Share Copy link Facebook X Linkedin Bluesky Email...

30/03/2026

Mileto Tecnologia accelerates streaming growth with Synam...

Synamedia today announced that Mileto Tecnologia, one of Brazil's largest pay-TV operators, has chosen the Synamedia Go platform to support its rapid OTT ex...

30/03/2026

FOR-A's Software-Defined, AI-Powered Development Advances with Nippon TV and NVIDIA Technology

FOR-A's Software-Defined, AI-Powered Development Advances with Nippon TV and...

30/03/2026

Give Your Astrophotography REAL Depth - After Effects Tutorial

Give Your Astrophotography REAL Depth - After Effects Tutorial Graham Quince March 30, 2026 0 Comments In this tutorial, I talk you through the full w...

30/03/2026

Alfalite returns to NAB Show alongside FOR-A, showcasing LED solutions for broadcast and mission-critical environments

Alfalite returns to NAB Show alongside FOR-A, showcasing LED solutions for broad...

30/03/2026

WideOrbit Announces New Name, New Features for Flagship Radio Automation Software

Introducing WO Aurora WideOrbit is pleased to introduce WO Aurora, a new name fo...

30/03/2026

Sky announces changes to its Diversity Advisory Council

Sky welcomes Karen Blackett CBE to its DAC and thanks Baroness Prashar and Ndidi Okezie as they step down after five yearsMonday 30 March 2026 Sky announces ch...

30/03/2026

Netflix Announces the Reunion for Love is Blind: Sweden Season 3 - Premiering April 2

Back to All News Netflix Announces the Reunion for Love is Blind: Sweden Season...

30/03/2026

Netflix unveils new images from the second season of 'Gangs of Galicia'

Back to All News Netflix unveils new images from the second season of Gangs of Galicia Entertainment 30 March 2026 GlobalSpain Link copied to clipboard Do...

30/03/2026

The Latest on Netflix Anime, Unveiled at AnimeJapan 2026

Back to All News The Latest on Netflix Anime, Unveiled at AnimeJapan 2026 Entertainment 30 March 2026 GlobalJapan Link copied to clipboard From romance an...

30/03/2026

KBRO Leverages Harmonic's Fiber-on-Demand Solution for Network Upgrades

Leading Taiwan Broadband Operator Drives Fiber Deeper with Harmonic SAN JOSE, Calif. - March 30, 2026 - Harmonic (NASDAQ: HLIT) today announced that KBRO, a lea...

30/03/2026

Top 10 Reasons Government Meetings Need Transcriptions (and Why It Matters More Than Ever)

Tyngsboro, Mass., March 30, 2026 - City councils, county commissions, school boa...

29/03/2026

Victory+ Turns to Creator Economy, Bringing In Popular Women's Sports Influencer Coach Jackie J to Host Live NWSL Alt-Cast

Cloud-based production, real-time engagement, and creator-driven storytelling ai...

28/03/2026

Harrison launch LiveTrax 3

Now features DiGiCo console integration Harrison's live recording and virtual soundcheck software has just reached its third major version, which among ...

28/03/2026

Sonora Cinematic launch Movimento Strings

MPE-capable chamber strings library announced Alongside their collection of Kontakt instruments, Sonora Cinematic have been steadily introducing a series of...

28/03/2026

Globecast Reimagines Managed Media Services for a Hybrid...

Globecast, the leading provider of broadcast, media and entertainment managed services, will showcase its reimagined approach to media operations at the 2026 NA...

28/03/2026

Fubo Inks Deals for More Baseball RSNs

Share Copy link Facebook X Linkedin Bluesky Email...

27/03/2026

SVG GameDay, Ep. 9: Chicago Cubs' Chris Simonson - Flying the W at Wrigley Field

In-venue and creative video staffers at the professional and collegiate level ha...

27/03/2026

Comcast Business Powers 2026 THE PLAYERS Championship Network and Broadcast Infrastructure

Comcast Business deployed network infrastructure for the 2026 PLAYERS Championsh...

27/03/2026

CS live Equips New OB Van With Riedel MediorNet, hi Control System, and Artist Intercom

Czech production company CS live has equipped its newest outside broadcast van w...