Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

22/11/2025

Deadline Extended for 2025 Best in Market Awards

The deadline for entries for the 2025 Best in Market Awards has been extended to 23:59 PST on November 28, 2025....

22/11/2025

Clear-Com Unveils 4-Channel HelixNet Beltpack - Expanding...

Clear-Com announced the upcoming launch of its 4-Channel HelixNet beltpack, a next-generation advancement of its widely used 2-channel model. The new beltpack...

22/11/2025

Marshall Electronics Features its CV625 and CV612 PTZ Cam...

Marshall Electronics is showcasing the latest additions to its CV600 Series of PTZ cameras, the CV625 and CV612, which both feature AI track and follow capabili...

22/11/2025

QuickLink StudioPro Powers LiveConnect Live Production at...

At this year's European Respiratory Society (ERS) Congress, held at the RAI Amsterdam, LiveConnect delivered an ambitious and technically complex live produ...

22/11/2025

Professional Wireless Systems PWS Delivers Flawless RF Co...

Professional Wireless Systems (PWS), a leader in wireless frequency coordination and RF system design, provided a comprehensive wireless gear package and onsite...

22/11/2025

Telestream Introduces ARGUS v23 Featuring Live Look for R...

Telestream, a global leader in media workflow technologies, today announced the release of ARGUS v2.3, which introduces Live Look, a powerful new feature that e...

22/11/2025

Peer Software Expands Data Orchestration and Analytics Pl...

Peer Software today announced significant advancements across its enterprise data orchestration and analytics platform with new releases of Peer Global File Ser...

22/11/2025

Atomos Expands Ninja TX Series Capabilities with Powerful...

At InterBEE 2025, Atomos announces a major firmware update that brings integrated camera control to the Ninja TX GO and Ninja TX its new CFexpress-based monit...

22/11/2025

AWS Announces Elemental MediaConnect Router

Today, AWS announces the general availability of AWS Elemental MediaConnect Router, a new capability that enables broadcasters and content providers to dynamica...

22/11/2025

Rise Announces 2025 Award Winners

Rise, the award-winning advocacy group for gender diversity in the broadcast media technology sector, is delighted to announce the winners for this year's R...

22/11/2025

Lightware introduces new HC60 product line to strengthen...

Lightware, industry leader in signal management, is strengthening its Taurus UCX product family with the introduction of the new HC60 lineup. The new product li...

22/11/2025

IDX Unveils CUE-J Series Batteries

CARSON, Calif. IDX has introduced the IDX CUE-J Series battery/charger kits, including the CUE-J98, CUE-J150 and CUE-J198....

22/11/2025

NBA: More Than 60 Million Watched Games in First Month, Best in 15 Years

The NBA has released encouraging viewing and social media data that the beginnings of its $76 billion deal with NBC/Peacock, Prime Video and ESPN are paying off...

22/11/2025

FCC Sets Deadlines for Comments on New NextGen TV Proposals

WASHINGTON The Federal Communications Commission has set deadlines for comments on its newest proposals for NextGen TV, aka ATSC 3.0, with comments due on Jan. ...

22/11/2025

Seeking Advice for a New Opera, Laura Kaminsky Consulted the Experts: Her Students

Seeking Advice for a New Opera, Laura Kaminsky Consulted the Experts: Her Studen...

21/11/2025

Platinum White Paper: Appear Shares Why Media Exchange Is the Missing Link in Software-Defined Live Production

Platinum White Paper: Appear Shares Why Media Exchange Is the Missing Link in So...

21/11/2025

NWSL Championship 2025: CBS Sports To Deploy Two-Point FlyCam for Match Coverage at PayPal Park

NWSL Championship 2025: CBS Sports To Deploy Two-Point FlyCam for Match Coverage...

21/11/2025

NWSL Caps 2025 Season With Awards Show, Skills Challenge Productions

NWSL Caps 2025 Season With Awards Show, Skills Challenge ProductionsA team of 70 is on the ground in California to produce both eventsBy Mark J Burns, SVG Contr...

21/11/2025

USL and NEP Ready for Largest USL Championship Final Production Ever

USL and NEP Ready for Largest USL Championship Final Production EverThe broadcast from Tulsa, OK, will air CBS and TUDN on Saturday at 12 p.m. ETBy Jason Dachma...

21/11/2025

With Two New Teams, PWHL Boosts Production Workforce and Central Review for Season 3

With Two New Teams, PWHL Boosts Production Workforce and Central Review for Seas...

21/11/2025

Dinner and a Movie: Jared Lank on Powwow Highway and Luskinikn

Jared Lank and his mother in the '90s...

21/11/2025

Explore the Lands of Oz on Spotify With This Exclusive Wicked' Experience

Fans have been counting down the days until the final theatrical chapter of Wicked is revealed. To celebrate the highly anticipated release of Wicked: For Good ...

21/11/2025

Spotify House Seoul Delivered Unforgettable Performances From Central Cee, The Kid LAROI, Jay Park, ZICO, GroovyRoom, and More

Last week, Spotify turned up the volume in Seoul with the return of Spotify Hous...

21/11/2025

Training success at SGL Carbon in Meitingen

Wiesbaden, November 21, 2025. The SGL Carbon site in Meitingen has reason to celebrate as one of its trainees received a special award. Elias Stemmer was honore...

21/11/2025

L3Harris Recognizes Employee Achievements with LHX Excellence Awards

MELBOURNE, Fla., Nov. 21, 2025 - L3Harris Technologies (NYSE: LHX) has announced this year's LHX Excellence Awards, the company's most prestigious recog...

21/11/2025

FCC Proposes Upper C-Band Rules for 2027 Auction

WASHINGTON The Federal Communications Commission by a 3-0 vote opened a notice of proposed rulemaking (NPRM) to advance Congress's mandate to clear a minimu...

21/11/2025

FCC Votes to Clear at Least 100MHz of Upper C-Band Spectrum

WASHINGTON The Federal Communications Commission by a 3-0 vote adopted a Notice of Proposed Rulemaking (NPRM) to advance Congress's mandate to clear a minim...

21/11/2025

Spectrum Expands 4K Content to Apple TV 4K and Roku Devices

STAMFORD, Conn. Charter Communications' Spectrum brand has expanded the range of devices that can offer 4K content on the Spectrum TV app to compatible Appl...

21/11/2025

Study: Salaries in Content, Connectivity Industries Continue To Grow

NAPERVILLE, Ill. Media industry employers are continuing their multiyear trend of increasing salaries for all worker segments but lag general industry raises, s...

21/11/2025

NAB Opens Nominations for 2026 Technology Awards

WASHINGTON The National Association of Broadcasters said it is accepting nominations for the 2026 NAB Technology Awards, honors that recognize excellence in bro...

21/11/2025

AAT Introduces Automated RF Line Analysis

American Amplifier Technologies has released a vector network analysis module....

21/11/2025

The Best Movie Musicals on Every Streaming Platform

The Best Movie Musicals on Every Streaming Platform From Wicked to The Sound of Music, heres where to stream all the classic movie musicals and recent hits on...

21/11/2025

Space42 and Hisdesat Lay Groundwork for UAE-Spain Partnership to Advance Satellite Innovation and Secure Communications

The agreement creates a platform for joint collaboration, technology integration...

21/11/2025

GoodGym named Grand Prix winner of the 2025 Sky Zero Footprint Fund

Sky Media's £2m award-winning sustainability initiative crowns its first charity as this year's standout changemakerFriday 21 November 2025 GoodGym nam...

21/11/2025

Standards Pavilion concludes COP30 with call to embed International Standards in every step of climate action

As COP30 draws to a close, the International Electrotechnical Commission (IEC), ...

20/11/2025

MLB Media-Rights Shakeup: NBC's New Three-Year Deal Covers Sunday Night Baseball,' Peacock-Exclusive Games, and More

MLB Media-Rights Shakeup: NBC's New Three-Year Deal Covers Sunday Night Bas...

20/11/2025

MLB Media-Rights Shakeup: New Deal Will Bring 30 National Games to ESPN's Linear Platform, MLB.TV Exclusively to ESPN App

MLB Media-Rights Shakeup: New Deal Will Bring 30 National Games to ESPN's Li...

20/11/2025

MLB Media-Rights Shakeup: Netflix Lands Opening Night, Home Run Derby, Field of Dreams

MLB Media-Rights Shakeup: Netflix Lands Opening Night, Home Run Derby, Field of ...

20/11/2025

MLB Media-Rights Shakeup Overview: ESPN, NBCU, Netflix Ink Three-Year Deals

MLB Media-Rights Shakeup Overview: ESPN, NBCU, Netflix Ink Three-Year DealsESPN gets new 30-game package, MLB.TV; NBC extends Sunday nights; Netflix adds tentpo...

20/11/2025

SVG Students To Watch: Henry Thuss, Indiana University

SVG Students To Watch: Henry Thuss, Indiana UniversityThe Southern California product has his goals set on the front benchBy Brandon Costa, Director of Digital ...

20/11/2025

Done+Dusted's Guy Carrington on Creating the Spectacular League of Legends World Championship Opening Ceremony

Done+Dusted's Guy Carrington on Creating the Spectacular League of Legends W...

20/11/2025

FIA Extreme H World Cup Host Broadcaster Aurora Goes Inside the Production of the Hydrogen-Fuelled Motorsport

FIA Extreme H World Cup Host Broadcaster Aurora Goes Inside the Production of th...

20/11/2025

Platinum White Paper: Amagi Utilizes Cloud Production for Sports Events - Multi-Camera Live Workflow with Remote Commentary and Graphics

Platinum White Paper: Amagi Utilizes Cloud Production for Sports Events - Multi-...

20/11/2025

2025 Sports Broadcasting Hall of Fame: Marc Herklotz, Steady Hand Behind the Scenes

2025 Sports Broadcasting Hall of Fame: Marc Herklotz, Steady Hand Behind the Sce...

20/11/2025

NFL Deep Dive: How 32 Cameras at Each Stadium Drive Virtual Measurement, Boundary Replays, and Skeletal Tracking

NFL Deep Dive: How 32 Cameras at Each Stadium Drive Virtual Measurement, Boundar...

20/11/2025

Zodiac Killer Project Refuses to Kill Its Darlings With an Avant-Garde Essay Film

Charlie Shackleton attends the 2025 Sundance Film Festival premiere of Zodiac K...

20/11/2025

5 Ways to Get More out of Your Playlists on Spotify

Your playlists are personal. They're the soundtracks to your road trips, your quiet mornings, and your biggest celebrations; collections of memories and dis...

20/11/2025

Spotify, Trk mziinin efsanelerine sayg duruu niteliindeki ICON Trkiye'yi duyurdu

Spotify, uzun s redir zerine al t T rk m zik k lt r n n ikon haline gelmi ...

20/11/2025

Spotify and The Hollywood Reporter' Partner on the First-Ever Podcaster Roundtable

For the first time, Spotify has teamed up with The Hollywood Reporter to cohost ...