
Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law as big as an entire encyclopedia set, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
The growing demand to decode such massive amounts of data-and let multiple GPUs quickly scale and communicate with each other-underscores the importance of FP4 compute and the high-bandwidth large NVLink domain provided by NVIDIA Blackwell systems. Helix Parallelism, introduced in this blog, is co-designed with Blackwell. It enables up to a 32x increase in the number of concurrent users at a given latency, compared to the best known prior parallelism methods for real-time decoding with ultra-long context.
In other words, it lets AI agents and virtual assistants serve more people, faster than ever before.
(Note: Context in this blog refers to the sequence of previously generated tokens, whose intermediate key and value representations are stored as KV cache and accessed at every decoding step.)
Decoding bottlenecks: KV cache and FFN weight reads To support real-time decoding at scale, a system must overcome two major bottlenecks during the decoding (aka generation) phase:
Key-Value (KV) cache streaming: When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample. This constant streaming can, in turn, saturate DRAM bandwidth, increase token-to-token latency (TTL), and quickly become a major bottleneck as context length grows.
Feed-Forward Network (FFN) weight loading: During autoregressive decoding, generating every new token requires loading large Feed-Forward Network (FFN) weights from DRAM. In low latency scenarios with small batch sizes, this memory access cost is not well amortized, making FFN weight reads a dominant source of latency.
These two bottlenecks, KV cache streaming and FFN weight loading, are difficult to optimize simultaneously using traditional parallelism strategies.
Let's take Tensor Parallelism (TP) as an example: Increasing TP can help reduce FFN stalls by distributing weight loading across multiple GPUs and improving TTL, but only up to a point. In attention schemes like Grouped Query Attention (GQA)-used in Llama models-or Multi-Latent Attention (MLA)-found in DeepSeek models-multiple query heads share a limited number of KV heads. As illustrated in Figure 2(c), when TP exceeds the number of KV heads, the system ends up duplicating the multi-million-token KV cache per sample across GPUs for self-attention. As a result, KV read volume stays high even with increased TP, once again saturating DRAM bandwidth and limiting scalability. In the case of MLA, the upper limit for TP is just one to avoid duplication of KV cache.
So how can developers scale both model size and context length without sacrificing real-time interactivity? Helix Parallelism offers a path forward.
Helix execution flow Helix is a hybrid sharding strategy that disaggregates the parallelism strategies of attention and FFNs in a temporal pipeline, effectively addressing both KV cache and FFN weight-read bottlenecks during multi-million-token decoding.
Figure 1 (below) shows how Helix orchestrates the execution of attention and FFN within a single transformer layer. Inspired by the structure of a DNA helix, Helix interweaves multiple dimensions of parallelism-KV, tensor, and expert-into a unified execution loop. By decoupling the parallelism strategy used for attention and FFN, Helix allows each stage to operate in a configuration tuned to its own bottleneck, all while reusing the same pool of GPUs. Helix reuse approach keeps GPUs efficiently utilized across stages, eliminating idle time as computation flows through the model.
data-src=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp alt=A diagram showing the execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN. class=lazyload wp-image-102939 data-srcset=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp 1522w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-282x300-png.webp 282w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-625x665-png.webp 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-108x115-png.webp 108w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-768x817-png.webp 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1443x1536-png.webp 1443w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-645x687-png.webp 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-85x90-png.webp 85w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-362x385-png.webp 362w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-103x110-png.webp 103w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1024x1090-png.webp 1024w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-507x540-png.webp 507w data-sizes=(max-width: 1522px) 100vw, 1522px />Figure 1. Execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN.
Attention phase Helix applies KV Parallelism (KVP) by sharding the multi-million-token KV cache along the sequence dimension across KVP GPUs, while applying Tensor Parallelism across attention heads (TPA), where TP
Most recent headlines
05/01/2027
Worlds first 802.15.4ab-UWB chip verified by Calterah and Rohde & Schwarz to be ...
01/06/2026
January 6 2026, 05:30 (PST) Dolby Sets the New Standard for Premium Entertainment at CES 2026
Throughout the week, Dolby brings to life the latest innovatio...
01/05/2026
January 5 2026, 18:30 (PST) NBCUniversal's Peacock to Be First Streamer to ...
01/04/2026
January 4 2026, 18:00 (PST) DOLBY AND DOUYIN EMPOWER THE NEXT GENERATON OF CREATORS WITH DOLBY VISION
Douyin Users Can Now Create And Share Videos With Stun...
15/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
15/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
15/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
15/01/2026
NVIDIA kicked off the year at CES, where the crowd buzzed about the latest gaming announcements - including the native GeForce NOW app for Linux and Amazon Fire...
14/01/2026
Staines-upon-Thames, UK, 13th January, 2026 ITV, one of the UKs leading broadcasters, has selected Yospace, the global leader in Dynamic Ad Insertion (DAI), to ...
14/01/2026
Tech Focus: Audio Consoles, Part 2 - New Options for Virtual MixingA variety of solutions offer both technical and economic benefitsBy Dan Daley, Audio Editor
...
14/01/2026
Tech Focus: Audio Consoles, Part 1 - Key Component Evolves Toward the Totally Vi...
14/01/2026
SVG Summit 2025: Audio from Monday Workshops Now AvailableListen to sessions from Live Production Innovation, AI Production Tools, Cloud Production, Content Wor...
14/01/2026
The L3Harris large T7 robotic systems will provide U.S. Navy and U.S. Marines wi...
14/01/2026
Steiger Media's adoption of Calrec's compact Argo M console not only makes its innovative new hybrid truck faster, more efficient, and agile, but also e...
14/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
14/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
14/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
14/01/2026
January 14th, 2026
TRIBECA ANNOUNCES BEST NEW YORK SHORT AWARD FOR 25TH ANNIVERSARY FESTIVAL
In Celebration of Its 25th Anniversary, Tribeca Introduces a N...
14/01/2026
Wednesday 14 January 2026
Sky News announces Cathy Newman to lead flagship new political programme
Sky News today announces that award-winning journalist and ...
14/01/2026
Back to All News
State of Fear, The First Spin-Off of a Netflix Brazil Producti...
14/01/2026
The first stamp of An Post's 2026 Stamp Programme, marking 100 Years of Broadcasting, was unveiled at the GPO by Patrick O'Donovan TD, Minister for Cult...
14/01/2026
It's official! Beverley Callard has landed in Carrigstown. The beloved actor, known for her unforgettable roles and iconic screen presence, is joining the c...
13/01/2026
Independent media in Brazil and Colombia is facing an urgent crisis of traditional business models alongside a deteriorating security environment, according to ...
13/01/2026
NHL Situation Room 2.0: How Sony Hawk-Eye Powers Centralized Officiating, Player...
13/01/2026
NBC Sports Ices the Audio for the 2026 Prevagen U.S. Figure Skating Championship...
13/01/2026
DMF and MXL in practice: Which vendors are adopting it, and how fast is the ecos...
13/01/2026
CES 2026: Five Important Sports-Tech BuzzwordsThe terms highlight innovations for sports production at the showBy Daniel Frankel, SVG Contributor
Tuesday, Jan...
13/01/2026
For TGL Season 2, Unity 6 Boosts Virtual-Graphic Quality; COSM 360 Cameras Impro...
13/01/2026
Resetting Expectations? The State of the Sports Industry with Devoncroft's J...
13/01/2026
Top Row L-R: Ana Katz, Natalia Almada, Bao Nguyen, Tatiana Maslany, A.V. Rockwell, Dr. Heather Berlin
Second Row L-R: Sophie Barthes, Azazel Jacobs, Janicza Br...
13/01/2026
DoW to invest $1B in planned independently traded Missile Solutions business...
13/01/2026
L3Harris Chairman and CEO Christopher Kubasik and Under Secretary of War for Acq...
13/01/2026
April 10, 2025
First Gulf has taken a significant step in its U.S. expansion with the launch of its first industrial development in the country.
First Westla...
13/01/2026
April 11, 2025
Canadian footwear retailer SoftMoc has signed a lease for 145,600 square feet at 901 Hopkins Street in Whitby, where the space will serve as a w...
13/01/2026
April 14, 2025
First Gulf is proud to announce that 25 Ontario has officially received its occupancy permit, marking the transition from an active construction...
13/01/2026
April 28, 2025
First Gulf has been awarded a design-build lease for a new 350,000 square foot office and warehouse facility for Sherwin-Williams. This project ...
13/01/2026
August 13, 2025
First Gulf Expands U.S. Industrial Footprint with First Savanna...
13/01/2026
August 13, 2025
First Gulf is proud to partner with Toromont Industries Ltd. to...
13/01/2026
October 10, 2025
First Gulf is pleased to announce that PPFD, a leading third-party logistics company, has leased 146,536 square feet at 901 Hopkins Street in ...
13/01/2026
Singapore - January 13, 2026 - Nielsen today announced the appointment of Matty Lin to its Commercial Organization as APAC regional sales leader.
Based in Sing...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Nine-week performance series brings music, dance, theatre, and storytelling to downtown Durham, January - March 2026 (Durham, NC) The Chamber Orchestra of the T...
13/01/2026
Berklee Launches AIMS, an Artist-Centered Summit on Music and AI Hosted by the Berklee Emerging Artistic Technology Lab (BEATL), the event will focus on the i...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...
13/01/2026
Share Share by:
Copy link
Facebook
X
Whatsapp
Pinterest
Flipboard...