
As AI models evolve and adoption grows, enterprises must perform a delicate balancing act to achieve maximum value.
That's because inference - the process of running data through a model to get an output - offers a different computational challenge than training a model.
Pretraining a model - the process of ingesting data, breaking it down into tokens and finding patterns - is essentially a one-time cost. But in inference, every prompt to a model generates tokens, each of which incur a cost.
That means that as AI model performance and use increases, so do the amount of tokens generated and their associated computational costs. For companies looking to build AI capabilities, the key is generating as many tokens as possible - with maximum speed, accuracy and quality of service - without sending computational costs skyrocketing.
As such, the AI ecosystem has been working to make inference cheaper and more efficient. Inference costs have been trending down for the past year thanks to major leaps in model optimization, leading to increasingly advanced, energy-efficient accelerated computing infrastructure and full-stack solutions.
According to the Stanford University Institute for Human-Centered AI's 2025 AI Index Report, the inference cost for a system performing at the level of GPT-3.5 dropped over 280-fold between November 2022 and October 2024. At the hardware level, costs have declined by 30% annually, while energy efficiency has improved by 40% each year. Open-weight models are also closing the gap with closed models, reducing the performance difference from 8% to just 1.7% on some benchmarks in a single year. Together, these trends are rapidly lowering the barriers to advanced AI.
As models evolve and generate more demand and create more tokens, enterprises need to scale their accelerated computing resources to deliver the next generation of AI reasoning tools or risk rising costs and energy consumption.
What follows is a primer to understand the concepts of the economics of inference, enterprises can position themselves to achieve efficient, cost-effective and profitable AI solutions at scale.
Key Terminology for the Economics of AI Inference Knowing key terms of the economics of inference helps set the foundation for understanding its importance.
Tokens are the fundamental unit of data in an AI model. They're derived from data during training as text, images, audio clips and videos. Through a process called tokenization, each piece of data is broken down into smaller constituent units. During training, the model learns the relationships between tokens so it can perform inference and generate an accurate, relevant output.
Throughput refers to the amount of data - typically measured in tokens - that the model can output in a specific amount of time, which itself is a function of the infrastructure running the model. Throughput is often measured in tokens per second, with higher throughput meaning greater return on infrastructure.
Latency is a measure of the amount of time between inputting a prompt and the start of the model's response. Lower latency means faster responses. The two main ways of measuring latency are:
Time to First Token: A measurement of the initial processing time required by the model to generate its first output token after a user prompt.
Time per Output Token: The average time between consecutive tokens - or the time it takes to generate a completion token for each user querying the model at the same time. It's also known as inter-token latency or token-to-token latency.
Time to first token and time per output token are helpful benchmarks, but they're just two pieces of a larger equation. Focusing solely on them can still lead to a deterioration of performance or cost.
To account for other interdependencies, IT leaders are starting to measure goodput, which is defined as the throughput achieved by a system while maintaining target time to first token and time per output token levels. This metric allows organizations to evaluate performance in a more holistic manner, ensuring that throughput, latency and cost are aligned to support both operational efficiency and an exceptional user experience.
Energy efficiency is the measure of how effectively an AI system converts power into computational output, expressed as performance per watt. By using accelerated computing platforms, organizations can maximize tokens per watt while minimizing energy consumption.
How the Scaling Laws Apply to Inference Cost The three AI scaling laws are also core to understanding the economics of inference:
Pretraining scaling: The original scaling law that demonstrated that by increasing training dataset size, model parameter count and computational resources, models can achieve predictable improvements in intelligence and accuracy.
Post-training: A process where models are fine-tuned for accuracy and specificity so they can be applied to application development. Techniques like retrieval-augmented generation can be used to return more relevant answers from an enterprise database.
Test-time scaling (aka long thinking or reasoning ): A technique by which models allocate additional computational resources during inference to evaluate multiple possible outcomes before arriving at the best answer.
While AI is evolving and post-training and test-time scaling techniques become more sophisticated, pretraining isn't disappearing and remains an important way to scale models. Pretraining will still be needed to support post-training and test-time scaling.
Profitable AI Takes a Full-Stack Approach In comparison to inference from a model that's only gone through pretraining and post-training, models that harness test-time scaling generate multiple tokens to solve a complex problem. This results in more accurate and relevant model outputs - but
More from Nvidia
19/03/2026
It's a double feature on GFN Thursday. This week, GeForce NOW offers smoother sights in virtual reality (VR) and a sprawling new land to conquer.
Streaming...
17/03/2026
As AI native applications scale to more users, agents and devices, the telecommu...
17/03/2026
The features on social media apps like Snapchat evolve nearly as fast as what...
17/03/2026
The paradigm of consumer computing has revolved around the concept of a personal...
12/03/2026
Editor's note: This post is part of Into the Omniverse, a series focused on ...
12/03/2026
GeForce NOW is bringing the game to the Game Developers Conference (GDC), running this week in San Francisco. While developers build the future of gaming, GeFor...
11/03/2026
Launched today, NVIDIA Nemotron 3 Super is a 120 billion parameter open model with 12 billion active parameters designed to run complex agentic AI systems at sc...
10/03/2026
Game developers and artists are building cinematic worlds and iconic characters ...
10/03/2026
Game development teams are working across larger worlds, more complex pipelines and more distributed teams than ever. At the same time, many studios still rely ...
10/03/2026
The Cat 306 CR mini-excavator weighs just under eight tons and fits inside a standard shipping container. It's the machine a contractor rents when the job s...
10/03/2026
NVIDIA and Thinking Machines Lab announced today a multiyear strategic partnersh...
09/03/2026
AI is everywhere and accelerating everything - becoming essential infrastructure...
09/03/2026
ABB Robotics and NVIDIA today announced a breakthrough partnership that brings i...
05/03/2026
March is in full bloom, and that means a fresh wave of games heading to the cloud. 15 new titles are joining the GeForce NOW library this month.
Leading the Ma...
28/02/2026
AI-RAN is moving from lab to field, showing that a software-defined approach is ...
28/02/2026
Autonomous networks - intelligent, self-managing telecommunications operations -...
26/02/2026
GeForce NOW's anniversary celebration reaches a chilling crescendo as Capcom...
26/02/2026
GeForce NOW's anniversary celebration reaches a chilling crescendo as Capcom...
24/02/2026
AI is accelerating every aspect of healthcare - from radiology and drug discover...
23/02/2026
As technologies and systems become more digitalized and connected across the world, operational technology (OT) environments and industrial control systems (ICS...
19/02/2026
The GeForce NOW anniversary celebration keeps on rolling, and this week is all about the games that make it possible. With more than 4,500 titles supported in t...
19/02/2026
AI is accelerating the telecommunications industry's transformation, becomin...
17/02/2026
India is entering a new age of industrialization, as AI transforms how the world...
17/02/2026
Agentic AI is reshaping India's tech industry, delivering leaps in services ...
17/02/2026
India is the nexus of AI innovation this week as the host of the AI Impact Summit, which brings together global heads of state and industry to chart the future ...
16/02/2026
The NVIDIA Blackwell platform has been widely adopted by leading inference provi...
12/02/2026
At leading institutions across the globe, the NVIDIA DGX Spark desktop supercomputer is bringing data center class AI to lab benches, faculty offices and studen...
12/02/2026
A diagnostic insight in healthcare. A character's dialogue in an interactive...
12/02/2026
The GeForce NOW sixth-anniversary festivities roll on this February, continuing a monthlong celebration of NVIDIA's cloud gaming service.
This week brings ...
05/02/2026
Break out the cake and green sprinkles - GeForce NOW is turning six.
Since launch, members have streamed over 1 billion hours, and the party's just getting...
04/02/2026
Editor's note: This post is part of the Nemotron Labs blog series, which exp...
03/02/2026
At 3DEXPERIENCE World in Houston, NVIDIA founder and CEO Jensen Huang and Dassau...
29/01/2026
Mercedes-Benz is marking 140 years of automotive innovation with a new S-Class b...
29/01/2026
Editor's note: This post is part of Into the Omniverse, a series focused on ...
29/01/2026
Get ready to game - the native GeForce NOW app for Linux PCs is now available in beta, letting Linux desktops tap directly into GeForce RTX performance from the...
28/01/2026
Quantum technologies are rapidly emerging as foundational capabilities for economic competitiveness, national security and scientific leadership in the 21st cen...
22/01/2026
AI-powered driver assistance technologies are becoming standard equipment, funda...
22/01/2026
The wait is over, pilots. Flight control support - one of the most community-requested features for GeForce NOW - is live starting today, following its announce...
22/01/2026
AI has taken center stage in financial services, automating the research and exe...
22/01/2026
AI-powered content generation is now embedded in everyday tools like Adobe and Canva, with a slew of agencies and studios incorporating the technology into thei...
21/01/2026
From skilled trades to startups, AI's rapid expansion is the beginning of th...
21/01/2026
From skilled trades to startups, AI's rapid expansion is the beginning of th...
15/01/2026
NVIDIA kicked off the year at CES, where the crowd buzzed about the latest gaming announcements - including the native GeForce NOW app for Linux and Amazon Fire...
13/01/2026
NVIDIA and Lilly are putting together a blueprint for what is possible in the f...
09/01/2026
Every that was easy shopping moment is made possible by teams working to hit s...
08/01/2026
The next universal technology since the smartphone is on the horizon - and it ma...
08/01/2026
In the rolling hills of Berkeley, California, an AI agent is supporting high-stakes physics experiments at the Advanced Light Source (ALS) particle accelerator....
08/01/2026
NVIDIA is wrapping up a big week at the CES trade show with a set of GeForce NOW...
07/01/2026
AI has transformed retail and consumer packaged goods (CPG) operations, enhancin...
05/01/2026
At the CES trade show running this week in Las Vegas, NVIDIA announced that the ...