Sony Pixel Power calrec Sony

Think SMART: How to Optimize AI Factory Inference Performance

21/08/2025

From AI assistants doing deep research to autonomous vehicles making split-second navigation decisions, AI adoption is exploding across industries.

Behind every one of those interactions is inference - the stage after training where an AI model processes inputs and produces outputs in real time.

Today's most advanced AI reasoning models - capable of multistep logic and complex decision-making - generate far more tokens per interaction than older models, driving a surge in token usage and the need for infrastructure that can manufacture intelligence at scale.

AI factories are one way of meeting these growing needs.

But running inference at such a large scale isn't just about throwing more compute at the problem.

To deploy AI with maximum efficiency, inference must be evaluated based on the Think SMART framework:

Scale and complexity

Multidimensional performance

Architecture and software

Return on investment driven by performance

Technology ecosystem and install base

Scale and Complexity As models evolve from compact applications to massive, multi-expert systems, inference must keep pace with increasingly diverse workloads - from answering quick, single-shot queries to multistep reasoning involving millions of tokens.

The expanding size and intricacy of AI models introduce major implications for inference, such as resource intensity, latency and throughput, energy and costs, as well as diversity of use cases.

To meet this complexity, AI service providers and enterprises are scaling up their infrastructure, with new AI factories coming online from partners like CoreWeave, Dell Technologies, Google Cloud and Nebius.

Multidimensional Performance Scaling complex AI deployments means AI factories need the flexibility to serve tokens across a wide spectrum of use cases while balancing accuracy, latency and costs.

Some workloads, such as real-time speech-to-text translation, demand ultralow latency and a large number of tokens per user, straining computational resources for maximum responsiveness. Others are latency-insensitive and geared for sheer throughput, such as generating answers to dozens of complex questions simultaneously.

But most popular real-time scenarios operate somewhere in the middle: requiring quick responses to keep users happy and high throughput to simultaneously serve up to millions of users - all while minimizing cost per token.

For example, the NVIDIA inference platform is built to balance both latency and throughput, powering inference benchmarks on models like gpt-oss, DeepSeek-R1 and Llama 3.1.

What to Assess to Achieve Optimal Multidimensional Performance

Throughput: How many tokens can the system process per second? The more, the better for scaling workloads and revenue.

Latency: How quickly does the system respond to each individual prompt? Lower latency means a better experience for users - crucial for interactive applications.

Scalability: Can the system setup quickly adapt as demand increases, going from one to thousands of GPUs without complex restructuring or wasted resources?

Cost Efficiency: Is performance per dollar high, and are those gains sustainable as system demands grow?

Architecture and Software AI inference performance needs to be engineered from the ground up. It comes from hardware and software working in sync - GPUs, networking and code tuned to avoid bottlenecks and make the most of every cycle.

Powerful architecture without smart orchestration wastes potential; great software without fast, low-latency hardware means sluggish performance. The key is architecting a system so that it can quickly, efficiently and flexibly turn prompts into useful answers.

Enterprises can use NVIDIA infrastructure to build a system that delivers optimal performance.

Architecture Optimized for Inference at AI Factory Scale The NVIDIA Blackwell platform unlocks a 50x boost in AI factory productivity for inference - meaning enterprises can optimize throughput and interactive responsiveness, even when running the most complex models.

The NVIDIA GB200 NVL72 rack-scale system connects 36 NVIDIA Grace CPUs and 72 Blackwell GPUs with NVIDIA NVLink interconnect, delivering 40x higher revenue potential, 30x higher throughput, 25x more energy efficiency and 300x more water efficiency for demanding AI reasoning workloads.

Further, NVFP4 is a low-precision format that delivers peak performance on NVIDIA Blackwell and slashes energy, memory and bandwidth demands without skipping a beat on accuracy, so users can deliver more queries per watt and lower costs per token.

Full-Stack Inference Platform Accelerated on Blackwell Enabling inference at AI factory scale requires more than accelerated architecture. It requires a full-stack platform with multiple layers of solutions and tools that can work in concert together.

Modern AI deployments require dynamic autoscaling from one to thousands of GPUs. The NVIDIA Dynamo platform steers distributed inference to dynamically assign GPUs and optimize data flows, delivering up to 4x more performance without cost increases. New cloud integrations further improve scalability and ease of deployment.

For inference workloads focused on getting optimal performance per GPU, such as speeding up large mixture of expert models, frameworks like NVIDIA TensorRT-LLM are helping developers achieve breakthrough performance.

With its new PyTorch-centric workflow, TensorRT-LLM streamlines AI deployment by removing the need for manual engine management. These solutions aren't just powerful on their own - they're built to work in tandem. For example, using Dynamo and TensorRT-LLM, mission-critical inference providers like Baseten can immediately deliver state-of-the-art model performance even on new frontier models like gpt-oss.

On the model side, families like NVIDIA Nemotron are built with open training data for t
LINK: https://blogs.nvidia.com/blog/think-smart-optimize-ai-factory-inferenc...
See more stories from nvidia

North America Stories

23/08/2025

Imagine Communications Introduces SNP-XS at IBC2025

At IBC2025 (12-15 September, RAI Amsterdam, stand 1.B73), Imagine Communications is expanding its Selenio Network Processor (SNP) line with the launch of SNP-X...

23/08/2025

TSL Drives Interoperability at IBC2025

At IBC2025, TSL will introduce a series of workflow-driven enhancements across its control, audio monitoring, and power distribution solutions, engineered for i...

23/08/2025

Growing demand for the Lightware Taurus Smart Dock as hyb...

Budapest, Hungary, August 2025 - Lightware, industry leaders in signal management, have seen growing demand for their Taurus Smart Dock since its launch in Janu...

23/08/2025

Wurl Appoints New CEO

PALO ALTO, Calif. The streaming technology solutions provider Wurl has named Dave Bernath as its new chief executive officer....

23/08/2025

Newsmax CEO Blasts Efforts to End Ownership Caps

WASHINGTON Newsmax founder and CEO Christopher Ruddy has come out against reducing broadcast ownership caps in a filing with the Federal Communications Commissi...

23/08/2025

$2.2 Million Donation Fuels Next Phase of Berklee Bridge

$2.2 Million Donation Fuels Next Phase of Berklee Bridge The anonymous gift will amplify the impact of the student success initiative built to support student...

22/08/2025

Utah Scientific Releases NBOSS Software-Based NMOS Control Solution for Hybrid SDI/IP Workflows

Utah Scientific Releases NBOSS Software-Based NMOS Control Solution for Hybrid S...

22/08/2025

Lurker Simmers With Tension As Idol Worship Turned Sour

(L-R) Writer-director Alex Russell, Th odore Pellerin, Archie Madekwe, and Havana Rose Liu on stage for the premiere of Lurker at Eccles Theater in Park City....

22/08/2025

Studio Upgrade: MixedbyEL on Moving to the Apogee Symphony Studio

We recently spoke with MixedbyEL, a rising force in the world of audio engineering, celebrated for his dynamic work in rap, R&B, and pop. From his early days re...

22/08/2025

BitFire Launches Live Master Control in the Cloud

BitFire (bitfire.tv), a leader in software-defined live production and IP transmission, today announced the addition of cloud-based live master control capabili...

22/08/2025

Utah Scientific Releases NBOSS Software-Based NMOS Contro...

Utah Scientific today announced NBOSS, a new software-based control solution designed to streamline the management of NMOS-compliant devices in hybrid SDI/IP en...

22/08/2025

Amagi 15th Global FAST report reveals how live content is...

Amagi, a cloud-based SaaS technology solutions provider for broadcast and streaming TV, today announced the release of its 15th Global FAST Report, offering ins...

22/08/2025

Hitomi MatchBox Technology Deployed by ORF for Major SMPT...

Hitomi Broadcast, the market leader in audio/video alignment and latency solutions, announces that ORF ( sterreichischer Rundfunk), Austria s national public br...

22/08/2025

This Berklee Program Turns Class Projects into Career Breakthroughs

This Berklee Program Turns Class Projects into Career Breakthroughs Experiential Design Lab students take on creative briefs from Red Bull, Disney, and more, ...

22/08/2025

Report: FAST Channels Need More Live Content to Succeed

NEW YORK Although FAST channels are becoming ever more omnipresent in the streaming universe, the vast majority of them will need to start providing more live c...

22/08/2025

Magnite, Acxiom Announce Integration

NEW YORK Independent sell-side advertising company Magnite today announced an integration with Acxiom, the connected data and technology foundation of global ad...

22/08/2025

Six Scripps Channels Launch on Peacock

CINCINNATI The E.W. Scripps Company has announced that six of its channels are now streaming as part of Peacock's 24/7 channel offering and are available to...

22/08/2025

ESPN Launches ESPN Unlimited Streaming App

ESPN launched today its new direct-to-consumer streaming service and a set of new features on an enhanced ESPN App, making ESPN's full suite of 12 networks ...

22/08/2025

FCC Deletes More Rules

WASHINGTON The Federal Communications Commission is once again pressing the delete button as part of its Delete, Delete, Delete regulatory initiative to remov...

22/08/2025

Solid State Logic to Introduce IP-Native Stagebox at IBC2025

OXFORD, U.K. Solid State Logic will unveil its System T plug-and-play IP-native MPL 16-8 stagebox offering cost-effective connectivity for touring flypack and i...

22/08/2025

Fox Officially Launches Fox One

NEW YORK and LOS ANGELES Fox Corporation has officially launched Fox One, a new streaming service that brings together the full portfolio of Foxs sports, news a...

22/08/2025

Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

As the latest member of the NVIDIA Blackwell architecture family, the NVIDIA Blackwell Ultra GPU builds on core innovations to accelerate training and AI reason...

22/08/2025

Hot Topics at Hot Chips: Inference, Networking, AI Innovation at Every Scale - All Built on NVIDIA

AI reasoning, inference and networking will be top of mind for attendees of next...

21/08/2025

L3Harris Expands Florida Facility to Support America's Golden Dome

L3Harris Technologies Chair and CEO Christopher E. Kubasik speaks at the opening of the 94,000-square-foot satellite integration and test facility. The company ...

21/08/2025

Imagine Communications to Debut SNP-XS at IBC2025

DENVER At IBC2025, Sept. 12-15 at the RAI Amsterdam, Imagine Communications will introduce the SNP-XS, a versatile addition to its Selenio Network Processor (SN...

21/08/2025

BitFire Launches Live Master Control in the Cloud

HUDSON, Mass. BitFire, a provider of software-defined live production and IP transmission, today announced the addition of cloud-based live master control capab...

21/08/2025

Gray Media to Launch New Hyper-Personalized Video Streaming Service

ATLANTA Gray Media has laid out plans for launching a new cutting-edge, hyper-personalized streaming platform that will start going live in Grays markets in Jan...

21/08/2025

Chyron Partners with Asport for Live Sports Production and Distribution

NEW YORK and ZURICH Chyron, a provider of broadcast graphics and live production solutions, has announced a partnership with Asport, a leading sports tech innov...

21/08/2025

swXtch.io to Debut SRTx Gateway at IBC2025

NEW YORK swXtch.io has amplified its support for SRT workflows with a new specialized gateway solution primarily targeted at the live event market. To be introd...

21/08/2025

Rise Academy Revives 4K Charity Run for IBC 2025

AMSTERDAM Rise Academy, the charity dedicated to delivering practical media technology experiences, careers resources and sharing work experience opportunities ...

21/08/2025

Carr Names a New Special Assistant

WASHINGTON FCC Chairman Brendan Carr announced the appointment of Courtney Cowper as a special assistant in his office. As a special assistant in the Office of ...

21/08/2025

Live Media Group Debuts New IP-based REMI Production Truck

COLUMBUS, Ohio Live Media Group has launched its latest mobile production unit, the MU-28, which is a SMPTE 2110-7 IP-based truck built specifically for remote ...

21/08/2025

Sling TV Launches Sling Select

ENGLEWOOD, Colo. Sling TV has launched a new offering called Select that provides a package of cable channels for $19.99 a month....

21/08/2025

Viant and Wurl Partner on Scene-Level CTV Targeting and Measurement

IRVINE, Calif. CTV and programmatic ad provider Viant Technology Inc. has announced a new integration of its DSP with Wurl that provides advertisers with scene...

21/08/2025

Marshall Electronics to Show New PTZ Camera at IBC2025

TORRANCE, Calif. Marshall Electronics will highlight several new products at IBC2025, including the CV612 PTZ camera, RCP Plus camera controller and VMV-402-3GS...

21/08/2025

SVG New Sponsor Spolight: MyCaseBuilder's Steve Holand on Creating Custom Cases for Everything From Cameras to Championship Trophies

SVG New Sponsor Spolight: MyCaseBuilder's Steve Holand on Creating Custom Ca...

21/08/2025

'The Monster of Florence': Official Trailer and New Photos of the Show Directed by Stefano Sollima Coming Only on Netflix October 22

Back to All News The Monster of Florence: Official Trailer and New Photos of th...

21/08/2025

RIKEN, Japan's Leading Science Institute, Taps Fujitsu and NVIDIA for Next Flagship Supercomputer

Japan is once again building a landmark high-performance computing system - not ...

21/08/2025

Telos Alliance Highlights Advanced Dialog Intelligibility and Language Detection Capabilities at IBC2025

Telos Alliance Highlights Advanced Dialog Intelligibility and Language Detectio...

21/08/2025

Building From the Back: Will AI Benefit European Football Production and Distribution?

Building From the Back: Will AI Benefit European Football Production and Distrib...

21/08/2025

From Paris to Milan: How NBC Olympics Continues to Lead the Way in Media Management

From Paris to Milan: How NBC Olympics Continues to Lead the Way in Media Managem...

21/08/2025

FloSports Deploys AI-Driven Virtual Pan and Zoom To Streamline Production of Multi-Event, Single-Camera Sports

FloSports Deploys AI-Driven Virtual Pan and Zoom To Streamline Production of Mul...

21/08/2025

Global Gaming League Melds Music and Sports Into Its Production

Global Gaming League Melds Music and Sports Into Its Production Graphics and music create a purposeful emulation of early MTV for an esports hybrid By Dan Dale...

21/08/2025

Think SMART: How to Optimize AI Factory Inference Performance

From AI assistants doing deep research to autonomous vehicles making split-second navigation decisions, AI adoption is exploding across industries. Behind ever...

21/08/2025

Gearing Up for the Gigawatt Data Center Age

Across the globe, AI factories are rising - massive new data centers built not to serve up web pages or email, but to train and deploy intelligence itself. Inte...