Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

04/07/2026

Detective Conan: Fallen Angel of the Highway Opens in Dolby Cinemas Across Japan, Presented in Dolby Atmos and Dolby ...

April 7 2026, 19:00 (PDT) Detective Conan: Fallen Angel of the Highway Opens in...

01/06/2026

Dolby Sets the New Standard for Premium Entertainment at CES 2026

January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026 Throughout the week, Dolby brings to life the latest innovatio...

05/05/2026

Lessons from fragile contexts on responding to disinformation

Experts from the world of academia, tech, business, politics and media convened for a Thomson Talks at the Cambridge Disinformation Summit in April. It's th...

05/05/2026

Samsung Galaxy S26 Ultra Phone Cameras Bring New Excitement to Street League Skateboarding

Three phones were hardwired for power and transmission to the truck; camera feat...

05/05/2026

Case Study: How Zaki Rose Rebuilt Its Production Infrastructure, and What It Means for Sports Content Creators

The creative studio behind campaigns for the NBA, Fanatics Sportsbook & Casino, ...

05/05/2026

Nielsen Co-Viewing Pilot Shows Average 4% Viewership Increase for February Live Events

Nielsen has announced results from a co-viewing pilot program covering February&...

05/05/2026

Nippon TV and FOR-A Win NAB Product of the Year and Future Best of Show Awards for viztrick AiDi

viztrick AiDi, an on-device AI solution developed by Nippon TV, delivered global...

05/05/2026

ARRI Introduces Omnibar LED Linear Fixture for Film, Live Entertainment, and Content Creation

ARRI has announced Omnibar, a battery-powered, IP65-rated multi-color LED linear...

05/05/2026

France Tlvisions Becomes First Broadcaster to Deploy Imagine Communications SNP-XS

Imagine Communications has announced that France T l visions is the first broadc...

05/05/2026

WNBA Announces Historic Canadian Media Rights Agreement with Bell Media

The Women's National Basketball Association (WNBA) and Bell Media today announced a multiyear agreement to broadcast and stream WNBA games in Canada beginni...

05/05/2026

Save the Date: SVG Remote Production Forum Heads to WBD's Techwood Studios in Atlanta on Sept. 23-24

SVG is proud to announce Warner Bros. Discovery's Techwood Studios in Atlant...

05/05/2026

Look Who's Talking: ESPN Integrates New Automated Commentator-ID Technology Into Scorebar Graphic for UFL Coverage

With no operator required, AutoMic workflow automates talent identification on U...

05/05/2026

Return Flight: How Live Broadcast Drones Died - and Were Reborn - on the Ski Slopes of Northern Italy

A crash in 2015 set the industry back, but this winter proved that drones are he...

05/05/2026

RADAR Spotlights the Next Generation of Asian Artists, From Indonesia to Taiwan

Another year, and more proof that Asia continues to shape some of the world's most exciting new sounds. This year's RADAR artists draw from deep local r...

05/05/2026

Spotify and ACL Music Fest Team Up to Give Fans a Personalized Experience for 2026

The Austin City Limits Music Fest 2026 lineup just dropped, and this year, Spoti...

05/05/2026

Bjooks to launch Beat Gems Kickstarter

New drum machine book campaign incoming Bjooks have announced that during Superbooth 2026, they will be launching a Kickstarter campaign to fund the product...

05/05/2026

Native Instruments release Komplete 26

Flagship all-in-one production bundle updated The latest version of Native Instruments' flagship virtual instrument and plug-in bundle has just been ann...

05/05/2026

Rohde & Schwarz to host RF Testing Innovations Forum 2026, helping design engineers elevate their RF expertise

Rohde & Schwarz to host RF Testing Innovations Forum 2026, helping design engine...

05/05/2026

L3Harris Provides Key Technologies for Newly Commissioned Navy Submarines

L3Harris provides communications, electronic warfare, sensors and mission systems that enable Virginia-class submarine crews to operate with confidence in conte...

05/05/2026

AgileTV consolidates its strength in 2025: EBITDA and cash conversion increase thanks to revenue growth and operational efficiency

The company grew by 7.6% in net revenue and 16.3% in EBITDA, achieving a 33% inc...

05/05/2026

Gray Media Closes Purchase of 10 Allen Media Group Stations

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

Dang Ly Joins Operative as Chief Product Officer

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

CIMM, TVB Release Local TV Currency Measurement Guidelines

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

ARRI Introduces Omnibar LED Linear Fixture

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

France Televisions Continues ST 2110 Migration With Imagi...

Project Marks First Major Broadcast Deployment of Latest Addition to SNP Lineup Imagine Communications today announced that France T l visions is the first br...

05/05/2026

Shotoku Broadcast Systems Wins 2026 NAB Show Product of t...

Shotoku Broadcast Systems Wins 2026 NAB Show Product of the Year Award Shotoku Broadcast Systems announced today that its Swoop range of robotic cranes has be...

05/05/2026

DigitalGlues creativespace Intelligence Wins Futures Best...

DigitalGlue's creative.space Intelligence Wins Future's Best of Show Award, Presented by TV Tech creative.space Intelligence (CSI), part of the creativ...

05/05/2026

Zixi Showcases Next-Generation Live Video Workflows and M...

Zixi, a leader in live video delivery and workflow orchestration, will showcase next-generation broadcast workflows at the Media Production and Technology Show ...

05/05/2026

Stingr marks its launch with a new approach to second-screen interactivity

Stingr marks its launch with a new approach to second-screen interactivity Brie Clayton May 5, 2026 0 Comments Huge leap forward in revenues and engag...

05/05/2026

Shotoku Broadcast Systems Wins 2026 NAB Show Product of the Year Award

Shotoku Broadcast Systems Wins 2026 NAB Show Product of the Year Award Brie Clayton May 5, 2026 0 Comments Shotoku Broadcast Systems announced today tha...

05/05/2026

DHD to Promote Latest Advances in Audio Production at MPT...

Following a successful NAB Show in Las Vegas, DHD will promote examples from its wide range of broadcast-quality audio production equipment at the May 13th-14th...

05/05/2026

LucidLink Redefines Cloud Media Workflows at MPTS 2026

LucidLink today announced its programme for MPTS 2026, where it will exhibit at Stand M59 at Olympia London, 13 to 14 May. The company will showcase its latest ...

05/05/2026

Limecraft Announces Version 2026-3 of its Cloud-Based Tel...

Limecraft today announces the release of Limecraft 2026.3, the third platform update in its 2026 release cycle. Limecraft is an AI-powered production platform t...

05/05/2026

Stingr marks its launch with a new approach to second-scr...

Huge leap forward in revenues and engagement...

05/05/2026

Broadcast Solutions strengthens CTO Office for technical...

Broadcast Solutions, a leading system integrator and provider of innovative solutions for the broadcast media industry, has taken another significant step in st...

05/05/2026

Operative Appoints Dang Ly as Chief Product Officer to Ac...

Operative today announced the appointment of Dang Ly as Chief Product Officer, signaling the company's accelerating commitment to delivering the next genera...

05/05/2026

World Skills Cafe Returns to IBC2026

The Media Talent Manifesto (MTM) today announces the return of the World Skills Caf at IBC2026, positioning the event as a critical industry forum to confront ...

05/05/2026

ARRI unveils Omnibar: compact, modular, battery-powered IP65 LED bars with precise pixel control

ARRI unveils Omnibar: compact, modular, battery-powered IP65 LED bars with preci...

05/05/2026

NBC Sports' NBA Playoff Viewership Up 58%

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

U.S. Court Upholds Some Patents in LG ATSC 3.0 Infringement Case

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

Gray Media and Allen Media Group Close Station Transactions

Share Copy link Facebook X Linkedin Bluesky Email...

05/05/2026

Digital Domain Welcomes Award-Nominated VFX Supervisor Jelmer Boskma

Digital Domain Welcomes Award-Nominated VFX Supervisor Jelmer Boskma Brie Clayton May 4, 2026 0 Comments Digital Domain, a global leader in visual eff...

05/05/2026

2026 Tribeca Festival Unveils Expanded Industry Programming, Reinforcing Role As Year-Round Engine For Storytellers

May 5th, 2026 Press Materials Available Here 2026 TRIBECA FESTIVAL UNVEILS EXP...

05/05/2026

Limited Series About The Greatest Soccer Team Of All Time: Netflix Releases The Trailer And Poster For Brazil '70: The Third Star

Back to All News Limited Series About The Greatest Soccer Team Of All Time: Net...

05/05/2026

FOX Sports, FOX One and Indeed Launch Nationwide Search for FOX One Chief World Cup Watcher Hired Through Indeed

FOX Sports, FOX One and Indeed Launch Nationwide Search for FOX One Chief World...

05/05/2026

Nippon TV and FOR-A Win Dual Awards for viztrick AiDi: NAB's Product of the Year and Future's Best of Show

GoVertical! Technology Recognized for Ability to Provide Real-Time 9:16 Autocrop...

04/05/2026

just:play pro 2026 and just:live pro 2026 are available to download!

just:play pro 2026 and just:live pro 2026 are available to download! More Details:At NAB 2026, ToolsOnAir showcased just:play pro 2026 and just:live pro 2026, ...

04/05/2026

just:in mac pro 2026 - The Next Level of Professional Recording on macOS

just:in mac pro 2026 - The Next Level of Professional Recording on macOS More Details:The headline innovation in just:in mac pro 2026 is the new Auto format si...