Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

06/10/2025

France Tlvisions Wins Prestigious 2025 EBU Technology & Innovation Award in Groundbreaking Collaboration with Dalet

France T l visions, France's leading broadcaster, has received the 2025 EBU ...

04/09/2025

Monumental Sports & Entertainment and Dalet Win Prestigious 2025 NAB Show Project of the Year Award

Monumental Sports & Entertainment (MSE), in collaboration with Dalet, has been a...

30/06/2025

The Forsytes Season 2 Commissioned by MASTERPIECE on PBS

Star Studded Ensemble Cast Are Joined by Richard Rankin as Filming Begins on the Second Season [June 12, 2025 - Boston, MA]: The Forsytes, Debbie Horsfield...

30/06/2025

Artemis II Mission Advances with Successful RS-25 Engine Checkout Tests

The Artemis II Space Launch System core stage is integrated with the solid rocket boosters inside High Bay 3 of the Vehicle Assembly Building at NASAs Kennedy S...

30/06/2025

WRAL-WRAZ Raleigh Names Heather Gray as VP and GM

RALEIGH, N.C. Capitol Broadcasting Co. has named Heather Gray vice president and general manager of WRAL-TV and WRAZ-TV here....

30/06/2025

VAB Awards JJ Freeman Engineering Award to Bill Sewell of WTKR/WGNT

The Virginia Association of Broadcasters has recognized Bill Sewell, Director of Engineering at WTKR & WGNT in Norfolk, Va. as the recipient of the 2025 J.J. Fr...

30/06/2025

SBE Recruits 49 New Members

The Society of Broadcast Engineers said its annual member drive resulted in the recruitment of 49 individual members....

30/06/2025

Avid Releases Full Integration of MediaCentral, Wolftech News

BURLINGTON, Mass. Avid today released its fully integrated news platform, uniting MediaCentral and Wolftech News in a single newsroom solution, and will demonst...

30/06/2025

FCC Fines Sinclair $500,000

WASHINGTON The Federal Communication's Enforcement and Media Bureaus have entered into a Consent Decree with Sinclair Broadcast Group to resolve a variety o...

30/06/2025

Qu-Bit announce the Bloom v2

Eurorack sequencer module reimagined California-based modular synth innovators Qu-Bit have announced the launch of a new module that offers a fresh new take...

30/06/2025

Berklee at Umbria Jazz Clinics to Host 40th Anniversary Concert

Berklee at Umbria Jazz Clinics to Host 40th Anniversary Concert The celebration will be held on July 10 in Perugia, Italy. By Colette Greenstein June 30, 202...

30/06/2025

PremiumBeat Tips and Tricks

PremiumBeat Tips and Tricks Brie Clayton June 30, 2025 0 Comments When editing to impress, you'll need quality music, and if your studio happens t...

30/06/2025

Techivation launch T-De-Esser Pro Mk2

Improved dynamic behaviour, improved audio quality & more Techivation have announced the release of an upgraded edition of their very first premium plug-in,...

30/06/2025

Beln Cuesta and Karra Elejalde Star in 'El nio', the New Film by Mariano Barroso

Back to All News Bel n Cuesta and Karra Elejalde Star in El ni o, the New Film ...

30/06/2025

A New Dangerous Troll Awakens: Netflix Unleashes Teaser for 'Troll 2'

Back to All News A New Dangerous Troll Awakens: Netflix Unleashes Teaser for Troll 2Play Video Play Video Entertainment 30 June 2025 GlobalNorwayDenmarkSwe...

30/06/2025

The Focusrite Summer Sale is now on

The Focusrite Summer Sale is now on Don't miss unbeatable deals on Scarlett, Vocaster, and more. Whether you're an artist, a producer, or a podcaste...

30/06/2025

Yellowstone origin story 1923 starring Harrison Ford and Helen Mirren comes to RT One and RT Player

All 8 episodes of Season 1 of 1923 will be available on RT Player from Tuesday ...

30/06/2025

Thales 2025 Global Cloud Security Study Reveals Organizations Struggle to Secure Expanding, AI-Driven Cloud Environments

Facebook Twitter LinkedIn 52% report AI security spending is displacing tr...

30/06/2025

Thales Alenia Space to develop SOLiS very-high-throughput laser communications demonstrator

Facebook Twitter LinkedIn Cannes, June 30th, 2025 - Thales Alenia Space, t...

29/06/2025

Roland introduce the Mood Pan

Handpan-inspired instrument announced Roland have announced the launch of the Mood Pan, a unique electronic hand percussion instrument that has been designe...

29/06/2025

A Secret Society, Ritualistic Killings, and a Century-Old Curse Netflix and YRF Entertainment's 'Mandala Murders' Premieres July 25

Back to All News A Secret Society, Ritualistic Killings, and a Century-Old Curs...

28/06/2025

Press Release: NFVF Marks Youth Month by Empowering Future Creatives Through Film & TV Bursaries

Johannesburg, 27 June 2025 - As the nation commemorates Youth Month 2025, the N...

28/06/2025

FCC Chair Brendan Carr Promises Very, Very Busy, Productive Summer

WASHINGTON In a press conference following the Federal Communications Commission's May Open Meeting, Chair Brendan Carr promised the agency would move rapid...

28/06/2025

Spectrum Awards $1.1 Million in 2025 Spectrum Digital Education Grants

STAMFORD, Conn. Charter Communications has awarded $1.1 million in Spectrum Digital Education grants to 55 nonprofit organizations that work to expand access to...

28/06/2025

Sonnet Announces Echo 20 Thunderbolt 4 SuperDock Now Veri...

LAKE FOREST, Calif. June 19, 2025 What's New: Sonnet Technologies today announced the certification of its Echo 20 Thunderbolt 4 SuperDock as an Engin...

28/06/2025

IDC Names MASV One of Three Most Innovative Companies in...

MASV (massive.io), the fastest and most reliable large file transfer platform for media professionals, has been named an IDC Innovator in the IDC Innovators: Me...

28/06/2025

TV SKYLINE Expands Live Production Capabilities with Late...

Grass Valley today announced that TV SKYLINE GmbH, one of Europe's top mobile production providers, has expanded its camera inventory with 30 LDX 135 UHD/HD...

28/06/2025

AgileTV has been selected to develop and implement LIWEST...

AgileTV, a European leader in TV and video technology solutions, signed an agreement with Austrian telco LIWEST to develop and implement its TV service in Austr...

28/06/2025

Scaler 3.1 update from Scaler Music

Music theory plug-in updated Three months on from the release of the latest version of their renowned music theory plug in, Scaler Music have launched an up...

28/06/2025

The 48th Annual Indian National Finals Rodeo Shot with Blackmagic PYXIS 6K

The 48th Annual Indian National Finals Rodeo Shot with Blackmagic PYXIS 6K Brie Clayton June 27, 2025 0 Comments Filmmaker Cameron Mackey relied on Bl...

28/06/2025

Social, Streaming Don't Compete, They Compliment

Social, Streaming Don't Compete, They Compliment Andy Marken June 27, 2025 0 Comments I think we've all arrived at a very special place. Spir...

28/06/2025

Blackmagic Design Captures Filipino Rock Band Drama Singtala

Blackmagic Design Captures Filipino Rock Band Drama Singtala Brie Clayton June 27, 2025 0 Comments Blackmagic URSA Mini Pro 12K and DaVinci Resolve St...

28/06/2025

Enhance Videos Faster with Aiarty Video Enhancer - Offline, Sharp, and Natural

Enhance Videos Faster with Aiarty Video Enhancer - Offline, Sharp, and Natural Brie Clayton June 27, 2025 0 Comments If you've used AI video tools...

27/06/2025

Give Me the Backstory: Get to Know Eva Victor, the Writer-Director Behind Sorry, Baby

By Jessica Herndon One of the most exciting things about the Sundance Film Fest...

27/06/2025

Spotify Launches K-Pop Performance Video Series . . . Next Up, TWS

K-Pop remains one of the biggest genres globally, and many fans just can't get enough of it. That's why Spotify has launched a new series of K-Pop perf...

27/06/2025

Family Cybersecurity CEO: Protecting Kids, Parents, and My Sanity From Scams

In our latest blog post, Rafael Rivera highlights the rising threat of online scams, and the important role cybersecurity plays in protecting families across ge...

27/06/2025

FCC Sets Comment Deadlines on Proposed Foreign Ownership Rules

WASHINGTON The Federal Communications Commission has set deadlines for comments to a notice of proposed rulemaking (NPRM) to codify certain foreign ownership re...

27/06/2025

FCC Chair Brendan Carr Promises Very, Very Busy, Productive Summer'

WASHINGTON In a press conference following the Federal Communications Commission's May Open Meeting, Chair Brendan Carr promised the agency would move rapid...

27/06/2025

Klevgrand introduce Walls reverb plug-in

From grounded realism to bending, impossible geometries Klevgrand have announced the release of a new algorithmic reverb plug-in which they say deconstruct...

27/06/2025

GIK Acoustics: Room EQ Wizard Tutorial

Learn to use REW for room analysis Acoustic treatment is one of the most important factors in any studio, and with the extensive range of products available...

27/06/2025

HDMI Forum Introduces v2.2 of the HDMI Specification

SAN JOSE, Calif. The HDMI Forum has released Version 2.2. of the HDMI Specification with 96Gbps bandwidth and next-gen HDMI Fixed Rate Link technology to provid...

27/06/2025

TV Skyline Expands Live Production Capabilities with Grass Valley Cameras

MONTREAL Grass Valley has announced that TV Skyline GmbH, one of Europe's top mobile production providers, has expanded its camera inventory with the acquis...

27/06/2025

Fubo Adds Weigel Broadcasting's Networks

NEW YORK & CHICAGO FuboTV Inc. and Weigel Broadcasting Co. have announced a multi-year agreement for distribution of seven networks including MeTV, H&I, Movies!...

27/06/2025

Americans' Favorite Pastime is Watching TV

NEW YORK A national survey of U.S. consumers shows 66% of us watch TV all or most of the time and also multitask while doing it....

27/06/2025

Sunbeam, Findal Media Ink Deal to Broadcast ABC Miami

MIAMI Sunbeam Television has reached a multiyear agreement with Findal Media & Technology Group to broadcast the new ABC Miami beginning Aug. 4....

27/06/2025

AccuWeather Signs Weather Data Deal with AI Search Provider Perplexity

STATE COLLEGE, Pa. AccuWeather has announced a deal with Perplexity, a AI-powered search and answer engine, that will bring AccuWeathers weather data and severe...

27/06/2025

New FCC Commissioner Olivia Trusty Announces Staff Appointments

WASHINGTON After being sworn in on June 23 as the Federal Communications Commission's newest Commissioner, Olivia Trusty has hit the ground running with the...

27/06/2025

FCC sets Deadlines for Comments on Proposed Foreign Ownership Rules

WASHINGTON The Federal Communications Commissions has set deadlines for comments to a Notice of Proposed Rulemaking (NPRM) to codify certain foreign ownership r...

27/06/2025

TAG Video Systems and Gencom Technology Forge Strategic P...

TAG Video Systems, the leader in software-based IP media probing, monitoring, visualization, and analytics, has announced a new collaboration with Gencom Techno...

27/06/2025

France-tv Streamlines Master Control Capabilities with PR...

Pixel Power (A Rohde & Schwarz Company) has recently been working with France T l visions, the French national public TV broadcaster, on a number of projects fo...