Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

More from Nvidia

12/11/2024

GPU's Companion: NVIDIA App Supercharges RTX GPUs With AI-Powered Tools and Features

The NVIDIA app - officially releasing today - is a companion platform for conten...

07/11/2024

Jensen Huang to Discuss AI's Future with Masayoshi Son at AI Summit Japan

NVIDIA founder and CEO Jensen Huang will join SoftBank Group Chairman and CEO Masayoshi Son in a fireside chat at NVIDIA AI Summit Japan to discuss the transfor...

07/11/2024

Welcome to GeForce NOW Performance: Priority Members Get Instant Upgrade

This GFN Thursday, the GeForce NOW Priority membership is getting enhancements and a fresh name to go along with it. The new Performance membership offers more ...

06/11/2024

Hugging Face and NVIDIA to Accelerate Open-Source AI Robotics Research and Development

At the Conference for Robot Learning (CoRL) in Munich, Germany, Hugging Face and...

06/11/2024

NVIDIA Advances Robot Learning and Humanoid Development With New AI and Simulation Tools

www.1x.tech Robotics developers can greatly accelerate their work on AI-enabled...

06/11/2024

Get Plugged In: How to Use Generative AI Tools in Obsidian

Editor's note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, softwa...

05/11/2024

Austin Calling: As Texas Absorbs Influx of Residents, Rekor Taps NVIDIA Technology for Roadway Safety, Traffic Relief

Austin is drawing people to jobs, music venues, comedy clubs, barbecue and more....

04/11/2024

Give AI a Look: Any Industry Can Now Search and Summarize Vast Volumes of Visual Data

Enterprises and public sector organizations around the world are developing AI a...

31/10/2024

Startup Helps Surgeons Target Breast Cancers With AI-Powered 3D Visualizations

A new AI-powered, imaging-based technology that creates accurate three-dimensional models of tumors, veins and other soft tissue offers a promising new method t...

31/10/2024

Scale New Heights With Dragon Age: The Veilguard' in the Cloud on GeForce NOW

Even post-spooky season, GFN Thursday has some treats for GeForce NOW members: a...

30/10/2024

Spooks Await at the Haunted Sanctuary,' Built With RTX and AI

Among the artists using AI to enhance and accelerate their creative endeavors is Sabour Amirazodi, a creator and tech marketing and workflow specialist at NVIDI...

29/10/2024

A New ERA of AI Factories: NVIDIA Unveils Enterprise Reference Architectures

As the world transitions from general-purpose to accelerated computing, finding a path to building data center infrastructure at scale is becoming more importan...

28/10/2024

Bring Receipts: New NVIDIA AI Workflow Detects Fraudulent Credit Card Transactions

Financial losses from worldwide credit card transaction fraud are expected to re...

28/10/2024

Fintech Leaders Tap Generative AI for Safer, Faster, More Accurate Financial Services

An overwhelming 91% of financial services industry (FSI) companies are either as...

24/10/2024

India Should Manufacture Its Own AI,' Declares NVIDIA CEO

Artificial intelligence will be the driving force behind India's digital transformation, fueling innovation, economic growth, and global leadership, NVIDIA ...

24/10/2024

Zoom's AI-First Transformation to Boost Business Productivity, Collaboration

Zoom, a company that helped change the way people work during the COVID-19 pandemic, is continuing to reimagine the future of work by transforming itself into a...

24/10/2024

Call of Duty: Black Ops 6' Storms Into the Cloud With GeForce NOW

Attention, recruits! It's time to test combat skills and strategic prowess. Drop into the heart of the action this GFN Thursday with the launch of the highl...

23/10/2024

Healthcare Leaders Across India Bring NVIDIA NIM for Hindi Language to LLM Applications

Life sciences and healthcare organizations across India are using generative AI ...

23/10/2024

India Manufacturers Build Factory Digital Twins With NVIDIA AI and Omniverse

Manufacturers and service providers in India are adopting NVIDIA Omniverse to tap into simulation, digital twins and generative AI to accelerate their factory p...

23/10/2024

India's Robotics Ecosystem Adopts NVIDIA Isaac and Omniverse to Build Next Wave of Physical AI

In vast warehouses, Addverb's robots work tirelessly, picking, sorting and d...

23/10/2024

Open for AI: India Tech Leaders Build AI Factories for Economic Transformation

India's leading cloud infrastructure providers and server manufacturers are ramping up accelerated data center capacity. By year's end, they'll have...

23/10/2024

World's Greatest Upskill: Consulting Giants Team With NVIDIA to Transform India Into Front Office for AI Era

Information technology giants including Infosys, TCS, Tech Mahindra and Wipro ar...

23/10/2024

Start Local, Go Global: India's Startups Spur Growth and Innovation With NVIDIA Technology

India is becoming a key producer of AI for virtually every industry - powered by...

23/10/2024

NVIDIA, F5 Turbocharge Sovereign AI Cloud Security, Efficiency

To improve AI efficiency and security in sovereign cloud environments, NVIDIA and F5 are integrating NVIDIA BlueField-3 DPUs with the F5 BIG-IP Next for Kuberne...

23/10/2024

The Three Computer Solution: Powering the Next Wave of AI Robotics

ChatGPT marked the big bang moment of generative AI. Answers can be generated in response to nearly any query, helping transform digital work such as content cr...

23/10/2024

Denmark Launches Leading Sovereign AI Supercomputer to Solve Scientific Challenges With Social Impact

NVIDIA founder and CEO Jensen Huang joined the king of Denmark to launch the cou...

23/10/2024

How to Accelerate Larger LLMs Locally on RTX With LM Studio

Editor's note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and showcases new hardware, softwa...

22/10/2024

What Is Agentic AI?

AI chatbots use generative AI to provide responses based on a single interaction. A person makes a query and the chatbot uses natural language processing to rep...

22/10/2024

NVIDIA Brings Generative AI Tools, Simulation and Perception Workflows to ROS Developer Ecosystem

At ROSCon in Odense, one of Denmark's oldest cities and a hub of automation,...

21/10/2024

NVIDIA CEO Jensen Huang to Spotlight Innovation at India's AI Summit

The NVIDIA AI Summit India, taking place Oct. 23-25 at the Jio World Convention Centre in Mumbai, will bring together the brightest minds to explore how India i...

21/10/2024

NVIDIA and Microsoft Give AI Startups a Double Dose of Acceleration

NVIDIA is expanding its collaboration with Microsoft to support global AI startups across industries - with an initial focus on healthcare and life sciences com...

21/10/2024

NVIDIA Works With Deloitte to Deploy Digital AI Agents for Healthcare

Ahead of a visit to the hospital for a surgical procedure, patients often have plenty of questions about what to expect - and can be plenty nervous. To help mi...

17/10/2024

How Digital Twins Are Driving Efficiency and Cutting Emissions in Manufacturing

Improving the sustainability of manufacturing involves optimizing entire product lifecycles - from material sourcing and transportation to design, production, d...

17/10/2024

Waterways Wonder: Clearbot Autonomously Cleans Waters With Energy-Efficient AI

What started as two classmates seeking a free graduation trip to Bali subsidized by a university project ended up as an AI-driven sea-cleaning boat prototype bu...

17/10/2024

Sustainable Manufacturing and Design: How Digital Twins Are Driving Efficiency and Cutting Emissions

Improving the sustainability of manufacturing involves optimizing entire product...

17/10/2024

Get Ready to Slay: Dragon Age: The Veilguard' to Soar Into GeForce NOW at Launch

Bundle up this fall with GeForce NOW and Dragon Age: The Veilguard with a specia...

15/10/2024

We Would Like to Achieve Superhuman Productivity,' NVIDIA CEO Says as Lenovo Brings Smarter AI to Enterprises

Moving to accelerate enterprise AI innovation, NVIDIA founder and CEO Jensen Hua...

14/10/2024

MAXimum AI: RTX-Accelerated Adobe AI-Powered Features Speed Up Content Creation

At the Adobe MAX creativity conference this week, Adobe announced updates to its Adobe Creative Cloud products, including Premiere Pro and After Effects, as wel...

11/10/2024

NVIDIA AI Summit Panel Outlines Autonomous Driving Safety

The autonomous driving industry is shaped by rapid technological advancements and the need for standardization of guidelines to ensure the safety of both autono...

11/10/2024

Game-Changer: How the World's First GPU Leveled Up Gaming and Ignited the AI Era

In 1999, fans lined up at Blockbuster to rent chunky VHS tapes of The Matrix. Y2...

10/10/2024

The Next Chapter Awaits: Dive Into Diablo IV's' Latest Adventure Vessel of Hatred' on GeForce NOW

Prepare for a devilishly good time this GFN Thursday as the critically acclaimed...

10/10/2024

AI'll Be by Your Side: Mental Health Startup Enhances Therapist-Client Connections

Half of the world's population will experience a mental health disorder - bu...

09/10/2024

AI Summit: US Energy Secretary Highlights AI's Role in Science, Energy and Security

AI can help solve some of the world's biggest challenges - whether climate c...

09/10/2024

Flux and Furious: New Image Generation Model Runs Fastest on RTX AI PCs and Workstations

Editor's note: This post is part of the AI Decoded series, which demystifies...

09/10/2024

What's the ROI? Getting the Most Out of LLM Inference

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and...

08/10/2024

NVIDIA AI Summit Highlights Game-Changing Energy Efficiency and AI-Driven Innovation

Accelerated computing is sustainable computing, Bob Pette, NVIDIA's vice pre...

08/10/2024

Accelerated Computing Key to Quantum Research

A recently released joint research paper by NVIDIA, Moderna and Yale reviews how techniques from quantum machine learning (QML) may enhance drug discovery metho...

08/10/2024

Pittsburgh Steels Itself for Innovation With Launch of NVIDIA AI Tech Community

Serving as a bridge for academia, industry and public-sector groups to partner on artificial intelligence innovation, NVIDIA is launching its inaugural AI Tech ...

08/10/2024

TSMC and NVIDIA Transform Semiconductor Manufacturing With Accelerated Computing

TSMC, the world leader in semiconductor manufacturing, is moving to production with NVIDIA's computational lithography platform, called cuLitho, to accelera...

08/10/2024

SETI Institute Researchers Engage in World's First Real-Time AI Search for Fast Radio Bursts

This summer, scientists supercharged their tools in the hunt for signs of life b...