Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

11/12/2024

2025 Sundance Film Festival Reveals 93 Projects for Feature Film and Episodic Programs

Top L-R: The Legend of Ochi, Rabbit Trap, East of Wall, Seeds Center Row L-R: Re...

11/12/2024

Spotify's Music Editors Reveal Their Picks for the Best Songs of 2024

The work of Spotify's global editorial experts is grounded in a deep understanding of music culture. Our editors are always at the forefront of new trends, ...

11/12/2024

Celebrate truth, knowledge and cultures with Always Was, Always Will Be this January 26 on NITV and SBS

Celebrate truth, knowledge and cultures with Always Was, Always Will Be this Jan...

11/12/2024

Tradeshows & events

Join Calrec and our distribution partners at an event near you! Here's a list of tradeshows, events and demos Calrec will be attending. If you would like to...

11/12/2024

K+ Supercharges OTT Monetization with Broadpeak's Personalized Ad Insertion

December 11, 2024 K Supercharges OTT Monetization with Broadpeak's Personalized Ad Insertion Broadpeak's broadpeak.io SaaS Platform Simplifies Ad ...

11/12/2024

TV Viewing in November Interval Reaches Highest Level Since February, Streaming Nabs Largest Share of TV Ever in The Gauge

Streaming accounts for 41.6% of time spent watching TV in November, with The Rok...

11/12/2024

Gracenote Makes Live Sports Discovery and Tune-in Easy

New Gracenote On Sports solution helps video services capitalize on sports programming to deliver improved user experiences and drive engagementNEW YORK Decem...

11/12/2024

NIELSEN SHARES COLLEGE FOOTBALL CONFERENCE CHAMPS AND OVERALL TOP 10 BASED ON SEASON VIEWERSHIP RANKINGS

154 Billion+ Minutes of College Football Watched Live This Season Across Disney,...

11/12/2024

Berklee's Music Business/Management Hosts Inaugural TEDx Event

Berklee's Music Business/Management Hosts Inaugural TEDx Event Molly Neuman, President of CD Baby, and William Tenney, founder of SunPop, were the special g...

11/12/2024

Music Supervisors Are Essential to Syching Music to Visual Media, and Demand is Rising

Music Supervisors Are Essential to Syching Music to Visual Media, and Demand is ...

11/12/2024

The Best Music Documentaries on Every Streaming Platform

The Best Music Documentaries on Every Streaming Platform Explore groundbreaking music documentaries streaming across Netflix, HBO Max, Disney , and more. From...

11/12/2024

AI-Generated Eno' Doc To Be Livestreamed Worldwide Jan. 24

A new documentary on music icon Brian Eno will be streamed for 24 hours online next month, in a demonstration of how artificial intelligence can be used to prod...

11/12/2024

Brad Turner joins EditShare as CEO

Turner is the former general manager of Harris Broadcast's media software business By Jenny Priestley Published: December 11, 2024 Turner is the forme...

11/12/2024

U.S. DoD Awards USEUCOM BPA to SES Space & Defense

Under a multi-year Blanket Purchase Agreement, SES Space & Defense will provide multi-orbit, multi-band commercial satellite services to USEUCOM leveraging an a...

11/12/2024

EditShare Taps Brad Turner as CEO

BOSTON Collaborative video workflow solutions provided EditShare has named Brad Turner as CEO....

11/12/2024

Streaming Hits Record Share of TV Viewing in November

NEW YORK Time spent watching TV in November reached a nine-month high as streaming grabbed a record share of TV viewing, accounting for 41.6% viewing, according...

11/12/2024

Ling Ling Sun, Ed Czarnecki Elected to ATSC Board

WASHINGTON The Advanced Television Systems Committee has elected Ling Ling Sun, chief technology officer at Nebraska Public Media, and Ed Czarnecki, vice presid...

11/12/2024

ESPN Launches College Football Bracket Challenge

BRISTOL, Conn. Now that the inaugural 12-team College Football Playoff lineup is set, ESPN Fantasy is launching the College Football Playoff Bracket Challenge P...

11/12/2024

Singapore's Mediacorp Taps Lawo for IP Solutions

SINGAPORE Mediacorp, Singapore's largest media conglomerate, has selected Lawo's advanced IP broadcast technology for its alternative broadcast center (...

11/12/2024

Cromorama Simplifies Color Management and Quality Control for Live Productions with the ORION-CONVERT Pipeline and AJA ColorBox

Cromorama Simplifies Color Management and Quality Control for Live Productions w...

11/12/2024

Berklee Announces 2025 Spring Signature Series

Berklee Announces 2025 Spring Signature Series This season's lineup features a Fleetwood Mac tribute, a mambo big band celebration, a gospel music extrava...

11/12/2024

Kazakh Language Gets its National LLM with a Groundbreaking Partnership of Kazakh Research Institutions and VEON's QazCode

11 Dec 2024 Kazakh Language Gets its National LLM with a Groundbreaking Partner...

11/12/2024

UKTV appoints Emma Tibbetts as Director of Programming for Drama

UKTV has appointed Emma Tibbetts as its new Director of Programming for Drama, as it finalises changes to the structure of its programming team in line with pla...

11/12/2024

Premier Rugby Sevens Finds Record-Breaking Success at 2024 PR7s All-Star Tourney in Portland

Premier Rugby Sevens Finds Record-Breaking Success at 2024 PR7s All-Star Tourney...

11/12/2024

Spin Shot: How Infront Productions is Producing Coverage of the Women's European Handball Federation Euro 2024

Spin shot: How Infront Productions is producing coverage of the Women's Euro...

11/12/2024

OpTic Gaming's Corey Dunn on How an Esports Team Has Become a Production Powerhouse

OpTic Gaming's Corey Dunn on How an Esports Team Has Become a Production Pow...

11/12/2024

It's Official: FIFA 2030 World Cup To Be Hosted by Morocco, Portugal, Spain; Saudi Arabia Lands 2034 Edition

It's Official: FIFA 2030 World Cup To Be Hosted by Morocco, Portugal, Spain;...

11/12/2024

PWHL Reimagines Its Live Broadcast Graphics As New Team Identities, Logos Debut For Season 2

PWHL Reimagines Its Live Broadcast Graphics As New Team Identities, Logos Debut ...

11/12/2024

2024 Sports Broadcasting Hall of Fame: Mark Lazarus, a Paragon of Sports-Media Excellence

2024 Sports Broadcasting Hall of Fame: Mark Lazarus, a Paragon of Sports-Media E...

11/12/2024

Sky wraps up GUINNESS WORLD RECORDS title in Christmas Wrapathon event that sees the most people wrapping gifts at the same time

Sky wraps up GUINNESS WORLD RECORDS title in Christmas Wrapathon event that see...

11/12/2024

Release Date and First Look of Swedish Crime Series 'The Breakthrough'

Back to All News Release Date and First Look of Swedish Crime Series The Breakthrough Entertainment 11 December 2024 GlobalSweden Link copied to clipboard ...

11/12/2024

The Doctors of the Joaquin Sorolla Hospital Are Back

Back to All News The Doctors of the Joaquin Sorolla Hospital Are BackPlay Video Play Video Entertainment 11 December 2024 GlobalSpain Link copied to clipb...

11/12/2024

G for Gladstone

Situated in the Garden Village of Port Sunlight on the Wirral, the historic Gladstone Theatre began life as an assembly and recreation hall with a platform stag...

11/12/2024

2024-12-11

CUPERTINO, CALIFORNIA Apple today released iOS 18.2, iPadOS 18.2, and macOS Sequoia 15.2, bringing Apple Intelligence - the easy-to-use personal intelligence sy...

11/12/2024

Built for the Era of AI, NVIDIA RTX AI PCs Enhance Content Creation, Gaming, Entertainment and More

Editor's note: This post is part of the AI Decoded series, which demystifies...

11/12/2024

Comedian Gearid Farrelly revealed as the third celebrity contestant for Dancing with the Stars 2025

Gear id will take to the floor with a same-sex male dance partner in new season ...

11/12/2024

Into the Omniverse: How OpenUSD-Based Simulation and Synthetic Data Generation Advance Robot Learning

Editor's note: This post is part of Into the Omniverse, a series focused on ...

10/12/2024

A Different Man, Hit Man, and A Real Pain Gain Golden Globe Nominations

At the top of each year, we introduce the film and TV landscape to groundbreaking projects when they premiere at the Sundance Film Festival. Many earn recogniti...

10/12/2024

Going to Mars: The Nikki Giovanni Project Imagines New Heights in Equality

PARK CITY, UTAH - JANUARY 20: (Top Row L-R) Patrice Bowman, Jarobi Moorhead, Greg Harriott, Ayana Enomoto-Hurst, Director Joe Brewster, Chris Pattishal and Terr...

10/12/2024

ST Engineering iDirect Releases Industry-first Breakthrough Multi-Orbit Capability

Pioneering satellite communication technology supports critical GEO/NGSO trackin...

10/12/2024

L3Harris Completes Design Review for New Tranche 2 Missile Tracking Satellites

L3Harris is building 18 space vehicles for SDA's Tranche 2 Tracking Layer program, designed to provide near-global missile warning and tracking capability....

10/12/2024

EditShare Announces Brad Turner as New Chief Executive Officer

EditShare Announces Brad Turner as New Chief Executive Officer Boston, MA, 10 December 2024 - EditShare, a leading provider of collaborative workflow solution...

10/12/2024

New AWARN Alliance Chief Speaks Out on Advanced Alerting's Challenges

The AWARN Alliance last month appointed Dave Arland as its new executive director, succeeding longtime leader and ATSC 3.0 advocate John Lawson....

10/12/2024

Zeam Boosts Daily Viewers by 168% in 2024

Zeam Media reported that its hyperlocal streaming service has increased its daily viewership by 168% since its launch in February, when it powered Paramount+...

10/12/2024

Charter To Pay $1.1 Million FCC Fine for EAS Violations

Charter Communications has agreed to pay a $1.1 million fine to the Federal Communications Commission for violations of emergency alert regulations and has ente...

10/12/2024

OpenAI Officially Launches Sora GenAI Video Tool

OpenAI made its Sora generative AI video tool available to the public on Monday, 10 months after its soft beta launch in February. The new version Sora Turbo th...

10/12/2024

I have a passion not only for tech, but building long-term success': Peter Bellamy joins Deltatre

TVBEurope talks to Peter Bellamy about his appointment as CRO at Deltatre, and w...

10/12/2024

Overwhelming shareholder approval for Vivendi's Canal+ spin-off

Shares in Canal will be traded in London, Paris and Amsterdam By Matthew Corrigan Published: December 10, 2024 Updated: December 11, 2024 Shares in Can...