Sony Pixel Power calrec Sony

How NVIDIA's Inference Software Stack Powers the Lowest Token Cost

30/06/2026

As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per token: how many useful tokens they can deliver per dollar, per watt and within required latency targets.

Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA's full-stack inference software continuously improves hardware performance. On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month.

SemiAnalysis InferenceX results comparing token cost and interactivity for NVIDIA GB300 NVL72 systems with SGLang and the NVIDIA Dynamo inference framework. Leading companies and inference providers are already seeing the compounding value of NVIDIA's inference software stack on Blackwell:

Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second.

Cognition is using the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch.

Deep Infra uses the NVIDIA inference software stack to serve frontier open source models performantly on Blackwell from day zero, including DeepSeek V4.

Together AI used NVIDIA TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real-time coding experience.

Why Software Matters for Inference Economics Traditional web, search and software-as-a-service workloads were relatively predictable: A user might load a page, refresh a feed or update a business record. These requests typically followed similar software paths, reading from or writing to a database, and scaled by adding more of the same servers.

Agentic AI is different.

Agentic AI runs distributed, stateful workflows that span LLMs, tools, memory, security, networking and accelerated computing across the data center. Agents can reason, plan, call tools, spin up specialist subagents and manage massive context across multi-turn workflows. They turn a single request into a distributed computing problem that can span hundreds of subagents, thousands of tasks and multiple large language models, running across GPUs, CPUs, DPUs and storage systems.

The software stack determines whether that complexity turns into wasted capacity or lower cost per token.

Lower cost per token comes from turning individual optimizations into system-level performance. NVIDIA's inference software stack does this by connecting three layers:

Production Operation: Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources.

Application Acceleration: Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion.

Infrastructure Access: Exposes NVIDIA GPU, networking, memory and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly.

The NVIDIA software stack spans model serving, runtime scheduling, kernels, communication libraries and hardware-aware optimizations, enabling rapid performance gains and lower serving costs as improvements compound across layers. When these layers work as one system, individual optimizations compound.

Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x.

The chart below shows the result. Capturing that gain in production is complex, requiring coordination across the full inference stack - from production operations and model runtimes to kernels, communication libraries and hardware access. NVIDIA's inference software stack is designed to make those layers work together so each optimization can build on the others.

Stacking software optimizations compounds performance gains, increasing NVIDIA Blackwell token throughput per GPU from baseline to up to 20x with disaggregated serving, large expert parallelism (Large EP), NVFP4 and multi-token prediction (MTP). Open Source Amplifies the Full-Stack Advantage That same full-stack foundation is amplified by the open source ecosystem. Many of today's most widely used open source AI frameworks and inference projects are built natively on NVIDIA CUDA, which means new research and software optimizations run with leading performance on NVIDIA GPUs from day zero.

PyTorch is a leading example. Launched in 2016 with native CUDA support, PyTorch has coevolved with NVIDIA's architecture, giving developers access to innovations such as Tensor Cores, Transformer Engine and NVFP4 directly through a familiar framework.

When breakthroughs such as DFlash speculative decode, which delivers up to 15x more throughput on existing hardware, or FastVideo, which generates 1080p videos in less than five seconds, land in PyTorch, they can run instantly on NVIDIA, helping AI factories convert research progress into lower token costs.

NVIDIA and PyTorch codevelopment helps bring new AI software innovations to developers, helping turn CUDA-native advances into production performance as PyTorch adoption grows. The same open source momentum is why when a new frontier open model like DeepSeek V4 is released, leading inference frameworks like vLLM and SGLang have day-zero deployment recipes for the NVIDIA Blackwell architecture - making the model accessible across milli
LINK: https://blogs.nvidia.com/blog/inference-software-lowest-token-cost/...
See more stories from nvidia

More from Nvidia

30/06/2026

How NVIDIA's Inference Software Stack Powers the Lowest Token Cost

As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per token: how many...

30/06/2026

Into the Omniverse: Three Workflows for Improving Vision AI Agent Accuracy With Synthetic Data and Fine-Tuning

Editor's note: This post is part of Into the Omniverse, a series focused on ...

29/06/2026

Claude Meets Blackwell Ultra: Anthropic's Models Now Run on NVIDIA GB300 in Azure

Anthropic's Claude models in Microsoft Foundry - hosted on Microsoft Azure a...

29/06/2026

Open Models, Closed Environments: Palantir Brings Secure AI to US Agencies With NVIDIA Nemotron

Showcasing the importance of open source innovation in American AI, Palantir'...

25/06/2026

The Ultimate Summer Sale Pairing: Steam Sale Meets GeForce NOW Discounts

Summer savings are heating up. From the Steam Summer Sale to GeForce NOW membership discounts, this week's GFN Thursday delivers double the deals and more w...

23/06/2026

NVIDIA and AWS Collaborate to Bring AI to Production at Scale

Building AI systems at scale is demanding, requiring low-latency inference, fast vector search, strong GPU price-performance and infrastructure that can grow wi...

23/06/2026

NVIDIA Powers Over 400 of the World's 500 Fastest Supercomputers

News Highlights: NVIDIA technology runs 81% of the TOP500 and 90% of the systems new to the list. 26 systems on the TOP500 adopted the NVIDIA Grace CPU, up ei...

23/06/2026

How Businesses Are Building Specialized AI They Can Trust

Editor's note: This post is part of the Nemotron Labs blog series, which explores how the latest open models, datasets and training techniques help business...

22/06/2026

NVIDIA Brings Trusted, 24/7 AI Agents to Telecom Operations

Telecom operators have seen remarkable returns from using generative AI to automate network management, customer care and back-office operations. Most of that i...

22/06/2026

Eco Wave Power Turns Waves Into Watts With NVIDIA AI Infrastructure and Digital Twins

The next era of AI will not be defined by compute alone. Its growth will be dete...

22/06/2026

NVIDIA Vera CPU Opens the Way for Agentic Scientific AI at Los Alamos National Laboratory

Mission, Vision and Veritas - new Los Alamos National Laboratory (LANL) supercom...

22/06/2026

From Materials Simulation to Experimental Astronomy, New NVIDIA AI Software Unlocks Scientific Discoveries

At the ISC conference running in Hamburg this week, NVIDIA is introducing new so...

22/06/2026

NAIRR Science Program Reshapes Scientific Research, Powered by NVIDIA AI Infrastructure

For the past two years, the U.S. National Science Foundation's National Arti...

22/06/2026

At ISC, JUPITER Shows What Exascale Science Looks Like

JUPITER, Europe's first exascale supercomputer at Germany's Forschungszentrum J lich, runs on NVIDIA Grace Hopper Superchips and NVIDIA Quantum-X800 Inf...

21/06/2026

Hotter Than a Hot Tub: The 45C Breakthrough to Cool AI's Biggest Machines

Hot tubs sit at about 38 to 40 degrees Celsius, warm enough that most people can only soak for about 15 minutes. NVIDIA's newest AI servers can run their co...

18/06/2026

How FERC's Large-Load Interconnection Actions Help Address Grid Stress, Improve Affordability

In a consequential grid infrastructure decision, the Federal Energy Regulatory C...

18/06/2026

Sync and Stream: GeForce NOW Connects to Members' Game Libraries Across Devices

Play favorite titles from popular game libraries, keep progress synced and jump ...

18/06/2026

At Cannes Lions, NVIDIA Partners Reshape Advertising and Marketing With AI

The digital era gave the advertising and marketing industry speed; the AI era is giving it autonomous operations. For companies building next-generation techn...

17/06/2026

France Advances Europe's AI Future With NVIDIA Technologies

A year ago at NVIDIA GTC Paris at VivaTech, France laid out plans to advance local AI - from new AI factories and national compute capacity to open frontier mod...

16/06/2026

HPE AI Factory With NVIDIA Expands for the Era of Agents

Enterprises are moving agentic AI from proof of concept to production - and the next generation of AI factories are built for the era of agents. At HPE Discove...

16/06/2026

Coherent Breaks Ground on Expanded Texas Facility, Scaling AI's Optical Backbone

AI runs at the speed of light. More and more, that light is made in Texas. Cohe...

16/06/2026

Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0

Every breakthrough AI model starts the same way: with a training run. The infrastructure running those training jobs shapes everything: how fast teams can itera...

12/06/2026

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

AgentPerf from Artificial Analysis, the industry's first agentic AI benchmark, gives developers, enterprises and infrastructure providers a clear way to com...

11/06/2026

Save Big and Play Bigger: GeForce NOW Summer Sale Brings Major Membership Savings

The GeForce NOW summer sale kicked off today with limited-time savings of up to ...

10/06/2026

NVIDIA Accelerates Google DeepMind's DiffusionGemma for Local AI

Today, Google DeepMind released DiffusionGemma - an experimental open model built for exceptionally fast text generation. NVIDIA has optimized DiffusionGemma to...

10/06/2026

For Robotaxis, Safety Must Be Built In, Not Bolted On

A car pulls up to the curb. The app says, Your ride is here. No one's in the driver's seat. For people who live in one of the dozens of cities now hos...

09/06/2026

NVIDIA Confidential Computing to Help Expand Apple's Private Cloud Compute

NVIDIA GPUs with Confidential Computing are now used for confidential inference in Apple's Private Cloud Compute (PCC), as it expands beyond Apple's dat...

07/06/2026

NVIDIA and Doosan Group Collaborate to Advance Physical AI and AI Factory Infrastructure

NVIDIA and Doosan Group are expanding their collaboration to advance new opportu...

07/06/2026

NVIDIA and LG Group Build an AI Factory to Advance Physical AI, Mobility and AI Infrastructure

NVIDIA and LG Group are building an AI factory to accelerate LG Group's next...

07/06/2026

How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies

A year ago at London Tech Week, NVIDIA founder and CEO Jensen Huang and U.K. Prime Minister Keir Starmer made a declaration: the U.K. would be an AI maker, not ...

07/06/2026

NVIDIA, KRAFTON, NC and Reigning League of Legends' Champions T1 Celebrate RTX Spark at Korea's PC Bangs

At GTC Taipei at COMPUTEX last week, NVIDIA unveiled RTX Spark, the superchip th...

04/06/2026

Seoul Purpose: How NVIDIA and South Korea Are Building the Future of AI

Home to cutting-edge sovereign AI infrastructure and robotics innovators, as well as one of the world's most passionate gaming communities, South Korea is o...

04/06/2026

Forecast: Fun Ahead - 18 Games Join in June to Stream on GeForce NOW

June's forecast with GeForce NOW: 100% chance of gaming. GeForce NOW is lining up new adventures for the month, from big-name blockbusters to quirky indies...

03/06/2026

NVIDIA Enables the Next Era Of Physical AI Research With Agent Skills For Autonomous Vehicles, Robotics And Vision AI

At CVPR, NVIDIA is unveiling new physical AI agent skills that help researchers ...

03/06/2026

NVIDIA Research Unlocks Advanced Grasping, Smarter Autonomous Driving and Agent Training at Scale

What makes a robot gripper useful isn't that it can pick up one object - it&...

02/06/2026

NVIDIA Partners With Microsoft on Unified Stack for Agentic AI Deployment, From Windows Devices to Cloud to Local

The agentic AI moment has arrived, but delivering on its promise requires more t...

02/06/2026

Industrial Software Leaders Build Secure, Autonomous AI Engineers With NVIDIA NemoClaw

Accelerated computing has revolutionized industrial engineering, compressing sim...

01/06/2026

NVIDIA Jetson Brings Agentic AI to the Physical World

Agentic AI is getting physical. At COMPUTEX on Tuesday, NVIDIA announced NVIDIA JetPack 7.2 and NVIDIA NemoClaw support on NVIDIA Jetson. JetPack 7.2 brings a...

01/06/2026

Why Financial Institutions Are Converging on Transaction Foundation Models to Build Their Own Intelligence

Financial institutions have spent years building AI: fraud models, credit models...

31/05/2026

Taiwan's Industry Titans Turbocharge World's AI Infrastructure Buildout With NVIDIA

Taiwan is home to more than 500 NVIDIA ecosystem partners. More than 1 million N...

31/05/2026

NVIDIA Factory Operations Blueprint Gives Factories a New AI Brain

As factories move from isolated automation to plant-wide intelligence, manufacturers need AI systems that can connect live machine signals, quality systems, wor...

31/05/2026

NVIDIA AI Cloud Ecosystem Expands Worldwide to Meet Global AI Compute Demand

The NVIDIA AI Cloud ecosystem is accelerating the global buildout of AI factory infrastructure. Partners are expanding capacity to meet growing demand from ente...

28/05/2026

The Name's Gaming Cloud Gaming: 007 First Light' Launches on GeForce NOW

License to stream, shaken and stirred. GeForce NOW is dialing up the espionage with the launch of 007 First Light, letting members slip into James Bond's r...

28/05/2026

NVIDIA Research Advances Robotics From Simulation to the Real World

Robotics is entering a new phase: moving from controlled demos and scripted automation toward generalizable, reliable embodied autonomy in the real world. At ...

26/05/2026

NVIDIA Vera CPU Is Packing a Heavy-Hitting Punch' Against Competition

The shift to agentic AI creates a new CPU requirement for the AI factory: fast cores, massive memory bandwidth and the ability to sustain high performance when ...

21/05/2026

NVIDIA GTC Taipei at COMPUTEX: Live Updates on What's Next in AI

The future of AI is landing in Taipei. At NVIDIA GTC Taipei at COMPUTEX, the world's developers, researchers and industry leaders are converging to dive int...

21/05/2026

License to Stream: 007 First Light' Coming to GeForce NOW With an Ultimate Bundle

The mission begins now. GeForce NOW is dialing up the action with a blockbuster...

19/05/2026

NVIDIA and Google Cloud Empower the Next Wave of AI Builders

At this year's Google I/O conference, NVIDIA and Google Cloud are accelerating the work of more than 100,000 developers in the companies' joint develope...

18/05/2026

NVIDIA CEO Jensen Huang at Dell Technologies World: Demand Is Going Parabolic, Utterly Parabolic'

Agentic AI inference at one-tenth the cost per token with NVIDIA Vera Rubin NVL7...

14/05/2026

Sea You in the Cloud: Subnautica 2' Early Access Dives Onto GeForce NOW

Editor's note: The Gaijin single sign-on feature is now up and running. Dive masks on - Subnautica 2 is making a splash on GeForce NOW day-and-date with la...