Asking an Encyclopedia-Sized Question: How To Make the World Smarter with Multi-Million Token Real-Time Inference

08/07/2025

Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law as big as an entire encyclopedia set, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.

The growing demand to decode such massive amounts of data-and let multiple GPUs quickly scale and communicate with each other-underscores the importance of FP4 compute and the high-bandwidth large NVLink domain provided by NVIDIA Blackwell systems. Helix Parallelism, introduced in this blog, is co-designed with Blackwell. It enables up to a 32x increase in the number of concurrent users at a given latency, compared to the best known prior parallelism methods for real-time decoding with ultra-long context.

In other words, it lets AI agents and virtual assistants serve more people, faster than ever before.

(Note: Context in this blog refers to the sequence of previously generated tokens, whose intermediate key and value representations are stored as KV cache and accessed at every decoding step.)

Decoding bottlenecks: KV cache and FFN weight reads To support real-time decoding at scale, a system must overcome two major bottlenecks during the decoding (aka generation) phase:

Key-Value (KV) cache streaming: When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample. This constant streaming can, in turn, saturate DRAM bandwidth, increase token-to-token latency (TTL), and quickly become a major bottleneck as context length grows.

Feed-Forward Network (FFN) weight loading: During autoregressive decoding, generating every new token requires loading large Feed-Forward Network (FFN) weights from DRAM. In low latency scenarios with small batch sizes, this memory access cost is not well amortized, making FFN weight reads a dominant source of latency.

These two bottlenecks, KV cache streaming and FFN weight loading, are difficult to optimize simultaneously using traditional parallelism strategies.

Let's take Tensor Parallelism (TP) as an example: Increasing TP can help reduce FFN stalls by distributing weight loading across multiple GPUs and improving TTL, but only up to a point. In attention schemes like Grouped Query Attention (GQA)-used in Llama models-or Multi-Latent Attention (MLA)-found in DeepSeek models-multiple query heads share a limited number of KV heads. As illustrated in Figure 2(c), when TP exceeds the number of KV heads, the system ends up duplicating the multi-million-token KV cache per sample across GPUs for self-attention. As a result, KV read volume stays high even with increased TP, once again saturating DRAM bandwidth and limiting scalability. In the case of MLA, the upper limit for TP is just one to avoid duplication of KV cache.

So how can developers scale both model size and context length without sacrificing real-time interactivity? Helix Parallelism offers a path forward.

Helix execution flow Helix is a hybrid sharding strategy that disaggregates the parallelism strategies of attention and FFNs in a temporal pipeline, effectively addressing both KV cache and FFN weight-read bottlenecks during multi-million-token decoding.

Figure 1 (below) shows how Helix orchestrates the execution of attention and FFN within a single transformer layer. Inspired by the structure of a DNA helix, Helix interweaves multiple dimensions of parallelism-KV, tensor, and expert-into a unified execution loop. By decoupling the parallelism strategy used for attention and FFN, Helix allows each stage to operate in a configuration tuned to its own bottleneck, all while reusing the same pool of GPUs. Helix reuse approach keeps GPUs efficiently utilized across stages, eliminating idle time as computation flows through the model.

data-src=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp alt=A diagram showing the execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN. class=lazyload wp-image-102939 data-srcset=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp 1522w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-282x300-png.webp 282w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-625x665-png.webp 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-108x115-png.webp 108w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-768x817-png.webp 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1443x1536-png.webp 1443w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-645x687-png.webp 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-85x90-png.webp 85w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-362x385-png.webp 362w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-103x110-png.webp 103w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1024x1090-png.webp 1024w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-507x540-png.webp 507w data-sizes=(max-width: 1522px) 100vw, 1522px />Figure 1. Execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN.

Attention phase Helix applies KV Parallelism (KVP) by sharding the multi-million-token KV cache along the sequence dimension across KVP GPUs, while applying Tensor Parallelism across attention heads (TPA), where TP

LINK:	https://developer.nvidia.com/blog/asking-an-encyclopedia-sized-questio...
	See more stories from nvidia

Most recent headlines

05/01/2027

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be demoed at CES 2026

Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...

07/10/2026

Dalet Flex LTS Delivers Smarter Media Operations from Ingest to Distribution

Dalet, a leading technology and service provider for media-rich organizations, today announced the latest Long-Term Supported (LTS) release of Dalet Flex. Build...

06/09/2026

Dolby and MagentaTV Bring Fans Closer to the FIFA World Cup 2026 in Germany with Dolby Vision and Dolby Atmos

June 9 2026, 23:00 (PDT) Dolby and MagentaTV Bring Fans Closer to the FIFA Worl...

04/08/2026

Dalet Announces Commercial Availability of Dalia, Bringing Media-Aware Agentic AI to Enterprise Productions

Dalet, a leading technology and service provider for media-rich organizations, t...

21/07/2026

HDHomeRun Enables Operation During Internet Outages

Share Copy link Facebook X Linkedin Bluesky Email...

21/07/2026

Calif. Federal Judge Pauses Paramount-WBD Merger

Share Copy link Facebook X Linkedin Bluesky Email...

21/07/2026

AWARN Rebuts Weigel Claims of 3.0 EAS Problems

Share Copy link Facebook X Linkedin Bluesky Email...

21/07/2026

FCC Announces Tentative Agenda for August Open Meeting

Share Copy link Facebook X Linkedin Bluesky Email...

21/07/2026

VIDA Introduces QC Manager for Collaborative Quality Cont...

New capability centralizes collaboration, feedback, security and status tracking within VIDA, providing a structured, auditable approach to master quality contr...

21/07/2026

Lightcraft Launches Exclusive Spark Story Beta for Filmmakers and Creators at SIGGRAPH 2026

Lightcraft Launches Exclusive Spark Story Beta for Filmmakers and Creators at SI...

21/07/2026

Indie Road Movie Where in the Hell Shot with Pocket Cinema Camera 4K

Indie Road Movie Where in the Hell Shot with Pocket Cinema Camera 4K Brie Clayton July 20, 2026 0 Comments Colorist blends vintage film looks to shape...

21/07/2026

Which USS Defiant Pulse Phaser effect is better?

Which USS Defiant Pulse Phaser effect is better? Graham Quince July 20, 2026 0 Comments Aargh, ever since @DarkRavenProductions posted a comment ask...

20/07/2026

SVG All-Stars: Rob Coons, Senior Director, StudentU, Big Ten Network

The former Northwestern broadcast-operations leader is helping train the next generation of live-sports-production talent across the Big Ten The sports-product...

20/07/2026

Give Me the Backstory: Get to Know Stacey Lee, the Filmmaker Behind Murder 101

By Lucy Spicer One of the most exciting things about the Sundance Film Festival is having a front-row seat for the bright future of independent filmmaking. Whi...

20/07/2026

Graph Tech Guitar Labs open UK Online Store

Product range now readily available in the UK Graph Tech Guitar Labs have just announced the launch of a new UK Online Store that makes it quicker and easie...

20/07/2026

Audeze launch the Maxwell 2 ANC

Popular headset gains active noise cancellation The Maxwell headet was Audeze's first foray into the gaming world, and thanks to its Dolby Atmos compati...

20/07/2026

The National Film and Video Foundation (NFVF) Call for Public Screening funding applications for Cycle 1, 2026/27 financial year is Open

The NFVF, an agency of the Department of Sport, Arts and Culture, has released t...

20/07/2026

Telemundo Inks Another Major U.S. Spanish-Language Soccer Deal

Share Copy link Facebook X Linkedin Bluesky Email...

20/07/2026

HDHomeRun Enables Operation During An Internet Outage

Share Copy link Facebook X Linkedin Bluesky Email...

20/07/2026

Starfish highlights flexible, scalable transport stream p...

Starfish Technologies will use IBC2026 to showcase the flexibility of its transport stream processing software, including the latest versions of TS Splicer (Win...

20/07/2026

Bitfocus makes the connections at IBC2026

Bitfocus, the specialist in media control and monitoring, will show at IBC2026 (Elgato stand 8.D31, Amsterdam RAI, 11 14 September) how its Buttons control la...

20/07/2026

Mediagenix Introduces Trusted Agentic AI Operating Model...

Mediagenix, a global leader in smart content solutions to profitably connect the right content to the right audience, today announced new AI capabilities that e...

20/07/2026

Big Blue Marble Cloud DRM nominated for Streaming Media R...

Big Blue Marble's Cloud DRM has been nominated in the DRM/Content Protection category of the 2026 Streaming Media Readers' Choice Awards. Only four pro...

20/07/2026

Sky brings free global roaming to millions of loyal customers when they choose Sky Mobile

Monday 20 July 2026 Sky brings free global roaming to millions of loyal custome...

20/07/2026

Red Seat Ventures Announces Introduction of Premium Creator and Podcast Communities to Amazon DSP

Red Seat Ventures Announces Introduction of Premium Creator and Podcast Communit...

20/07/2026

RT secures exclusive free-to-air Irish rights to the 2030 FIFA World Cup

FIFA World Cup Final sets new RT Player record as the most-streamed single event in the platform's history Over 1 million viewers watched live on RT 2 as ...

20/07/2026

At SIGGRAPH, NVIDIA Advances Graphics and Simulation With Agentic and Physical AI

At this year's SIGGRAPH conference, running through Thursday, July 23, in Lo...

20/07/2026

Bristol Myers Squibb Building Life Science Industry's Most Advanced AI Factory on NVIDIA Vera Rubin

Erin Davis calls it the SuperDuperPOD. That's two things in one name: phar...

19/07/2026

Halftime Show at FIFA World Cup Final Joins a Litany of Firsts for the Quadrennial Event

Justin Bieber, Madonna, Shakira, BTS make for a diverse lineup, and the venue ad...

19/07/2026

More Than Just a Game: FIFA World Cup's Lance Brass Breaks Down Stadium Production and Entertainment

The stage is set: three-time champion Argentina will defend its World Cup title ...

19/07/2026

Acustica reveal Mystic 2

Channel strip plug-in gets upgraded Acustica Audio's vintage-inspired channel strip plug-in has just been treated to an update that expands its tonal ra...

18/07/2026

More Than Just a Game: FIFA World Cups Lance Brass Breaks Down Stadium Production & Entertainment

Topics include pre-match ceremonies, live performances, the tournament's fir...

18/07/2026

As the Final Approaches, FIFA and HBS Take Stock of a World Cup That Rewrote the Production Playbook

When FIFA and HBS set out to produce the 2026 FIFA World Cup, the numbers alone ...

18/07/2026

IK Multimedia add Brown Panel Signature Collection to TONEX

Captures nine sought-after Fender amps IK Multimedia's latest TONEX expansion captures a selection of nine rare Brown Panel' Fender amps that were ...

18/07/2026

Frap Tools update the Magnolia

Latest batch ships alongside firmware update Since being unveiled at Superbooth 2025, Frap Tools' debut polysynth has been met with widespread praise, a...

18/07/2026

Netflix Viewing Hit Record 97 Billion Hours in First Half of 2026

Share Copy link Facebook X Linkedin Bluesky Email...

18/07/2026

YouTube's Creative Ecosystem Contributed $60 Billion to U.S. GDP

Share Copy link Facebook X Linkedin Bluesky Email...

17/07/2026

SVG GameDay, Ep. 24: Mercedes-Benz Stadiums Cole Gallagher - Supporting Shows in the ATL

In-venue and creative video staffers at the professional and collegiate level ha...

17/07/2026

Brooklyn Bowl Williamsburg Stagehands Vote to Join IATSE Local 4

Production workers at Brooklyn Bowl's Williamsburg location voted 15-1 to join IATSE Local 4. The bargaining unit covers 24 production workers at the venue,...

17/07/2026

DAZN and ADI Predictstreet Announce Exclusive Global Prediction Market Partnership

DAZN and ADI Predictstreet have announced an exclusive global strategic partners...

17/07/2026

Zixi and Comcast Technology Solutions Announce Integration for C-Band Satellite Replacement

Zixi and Comcast Technology Solutions (CTS) have announced a strategic integrati...

17/07/2026

Professional Fighters League Announces Multi-Year Partnership with ESPN in Brazil

Professional Fighters League (PFL) has announced a multi-year partnership with E...

17/07/2026

Spectrum Business Launches Spectrum TV Control Pro for Multi-Screen Venue Management

Spectrum Business has announced Spectrum TV Control Pro, a centralized app-based...

17/07/2026

Clark Wire and Cable Appoints Rick Fernandez as Latin American Representative

Clark Wire and Cable has announced that Rick Fernandez, Managing Director of Axxion Consulting, will serve as Independent Manufacturers Representative for Centr...

17/07/2026

TikTok, NBA, and WNBA Announce Multi-Year Global Content Partnership

TikTok, the NBA, and the WNBA have announced a multi-year global content partnership covering highlights distribution, creator access to marquee events, live-ga...

17/07/2026

MSG Entertainment Files Defamation Lawsuit Against Wired Over July 9 Article

Company alleges article contained false and misleading claims regarding customer data...

17/07/2026

Ratings Roundup: Argentina-England Semifinal Breaks Records for FOX; MLB All-Star Game Is Most Watched Since 2018

Ratings Roundup is a rundown of recent rating news and is derived from press rel...

17/07/2026

American Pachuco Gives a Legend of Stage and Screen His Due

(L-R) Edward James Olmos, Luis Valdez, Lou Diamond Phillips and Lupe Valdez attend American Pachuco: The Legend Of Luis Valdez Premiere during the 2026 Sundan...

17/07/2026

Sonuscore introduce Elysion Elements

Full engine access with reduced soundset Sonuscore have steadily been introducing a selection of reduced-cost and free versions of their flagship products r...

17/07/2026

Arturia AstroLab Silver & KeyLab Mk3 Ultra

Two new special-edition models revealed Over the past week, Arturia have launched special-edition versions of both their premium MIDI controller and stage p...

View most recent headlines