
Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law as big as an entire encyclopedia set, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
The growing demand to decode such massive amounts of data-and let multiple GPUs quickly scale and communicate with each other-underscores the importance of FP4 compute and the high-bandwidth large NVLink domain provided by NVIDIA Blackwell systems. Helix Parallelism, introduced in this blog, is co-designed with Blackwell. It enables up to a 32x increase in the number of concurrent users at a given latency, compared to the best known prior parallelism methods for real-time decoding with ultra-long context.
In other words, it lets AI agents and virtual assistants serve more people, faster than ever before.
(Note: Context in this blog refers to the sequence of previously generated tokens, whose intermediate key and value representations are stored as KV cache and accessed at every decoding step.)
Decoding bottlenecks: KV cache and FFN weight reads To support real-time decoding at scale, a system must overcome two major bottlenecks during the decoding (aka generation) phase:
Key-Value (KV) cache streaming: When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample. This constant streaming can, in turn, saturate DRAM bandwidth, increase token-to-token latency (TTL), and quickly become a major bottleneck as context length grows.
Feed-Forward Network (FFN) weight loading: During autoregressive decoding, generating every new token requires loading large Feed-Forward Network (FFN) weights from DRAM. In low latency scenarios with small batch sizes, this memory access cost is not well amortized, making FFN weight reads a dominant source of latency.
These two bottlenecks, KV cache streaming and FFN weight loading, are difficult to optimize simultaneously using traditional parallelism strategies.
Let's take Tensor Parallelism (TP) as an example: Increasing TP can help reduce FFN stalls by distributing weight loading across multiple GPUs and improving TTL, but only up to a point. In attention schemes like Grouped Query Attention (GQA)-used in Llama models-or Multi-Latent Attention (MLA)-found in DeepSeek models-multiple query heads share a limited number of KV heads. As illustrated in Figure 2(c), when TP exceeds the number of KV heads, the system ends up duplicating the multi-million-token KV cache per sample across GPUs for self-attention. As a result, KV read volume stays high even with increased TP, once again saturating DRAM bandwidth and limiting scalability. In the case of MLA, the upper limit for TP is just one to avoid duplication of KV cache.
So how can developers scale both model size and context length without sacrificing real-time interactivity? Helix Parallelism offers a path forward.
Helix execution flow Helix is a hybrid sharding strategy that disaggregates the parallelism strategies of attention and FFNs in a temporal pipeline, effectively addressing both KV cache and FFN weight-read bottlenecks during multi-million-token decoding.
Figure 1 (below) shows how Helix orchestrates the execution of attention and FFN within a single transformer layer. Inspired by the structure of a DNA helix, Helix interweaves multiple dimensions of parallelism-KV, tensor, and expert-into a unified execution loop. By decoupling the parallelism strategy used for attention and FFN, Helix allows each stage to operate in a configuration tuned to its own bottleneck, all while reusing the same pool of GPUs. Helix reuse approach keeps GPUs efficiently utilized across stages, eliminating idle time as computation flows through the model.
data-src=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp alt=A diagram showing the execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN. class=lazyload wp-image-102939 data-srcset=https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-png.webp 1522w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-282x300-png.webp 282w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-625x665-png.webp 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-108x115-png.webp 108w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-768x817-png.webp 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1443x1536-png.webp 1443w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-645x687-png.webp 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-85x90-png.webp 85w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-362x385-png.webp 362w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-103x110-png.webp 103w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-1024x1090-png.webp 1024w, https://developer-blogs.nvidia.com/wp-content/uploads/2025/07/image2-2-507x540-png.webp 507w data-sizes=(max-width: 1522px) 100vw, 1522px />Figure 1. Execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN.
Attention phase Helix applies KV Parallelism (KVP) by sharding the multi-million-token KV cache along the sequence dimension across KVP GPUs, while applying Tensor Parallelism across attention heads (TPA), where TP
North America Stories
30/08/2025
WASHINGTON The Federal Communications Commission has adopted its FY 2025 Regulatory Fees Order that establishes the regulatory fee rates for the broadcast stati...
29/08/2025
L3Harris Technologies has concentrated decades of expertise across the entire enterprise to develop affordable and reliable best-of-breed solutions to rapidly c...
29/08/2025
BURBANK, Calif. The CW Network and the Pac-12 Conference have announced a new media rights deal that will extend their broadcast partnership beginning with the ...
29/08/2025
NEW YORK Gracenote has released a new analysis of its global video dataset showing that the number of FAST channels grew nearly 14% from Q1 2025 and 76% since 2...
29/08/2025
SAN JOSE, Calif. Harmonic has announced a series of improvements to its live sports streaming solution that the company said will improve fan engagement, protec...
29/08/2025
NEW YORK and LOS ANGELES Fox Corp. and YouTube TV last night announced a renewal of the full portfolio of Fox networks, including Fox News Channel, Fox Business...
29/08/2025
Budapest, Hungary, August 2025 - The integration of Microsoft Teams Rooms (MTR) with Lightware's Taurus universal matrix switchers delivers a new level of f...
29/08/2025
Frequency, the engine behind many of the world's best-known streaming television channels, today announced it will launch Studio Live, a next-generation uni...
29/08/2025
In an era when AI and cyber resilience are essential, Scality will mark the 10th anniversary of Scality Day on October 16, 2025 in Paris. This flagship global e...
29/08/2025
Disguise's In-House Creative and Technical Teams Pre-Visualised, Programmed and Delivered Content for the Experience, All Powered by EX 3+
Technology solu...
29/08/2025
Disguise will be demonstrating the latest workflows for TV, film and live events on a number of partner booths at the show
Disguise, the industry-leading tech...
29/08/2025
STOCKHOLM, Sweden Accedo will showcase Accedo Compose, its AI agent-powered modular orchestration layer that assists streaming providers in transitioning client...
29/08/2025
LOS ANGELES Cineverse has announced that it is working with Xperi to bring four of its streaming channels to automobiles for the first time as part of the DTS A...
29/08/2025
DALLAS & ATLANTA Gray Media has announced an agreement with the sports streaming service Victory+ to simulcast 17 Dallas Stars NHL games in 15 television market...
29/08/2025
NEW YORK AND CULVER CITY Comcast NBCUniversal and Amazon have announced new and extended distribution agreements that will expand the content available on their...
29/08/2025
FOOTHILL RANCH, Calif. RED Digital Cinema will feature its Cine-Broadcast Module supporting live broadcast workflows during IBC2025, Sept. 12-15, at the RAI Ams...
29/08/2025
Back to All News
RIV4LRIES: The Trailer of the New Series With Samuele Carrino ...
28/08/2025
By Kristin Feeley, Director, Documentary Film & Artist Programs
If you want to tell untold stories, if you want to give voice to the voiceless, you've got ...
28/08/2025
Directed by Steven Bognar and Julia Reichert, Sundance Institute-supported Amer...
28/08/2025
L3Harris will provide the Polish F-16V fleet with the Viper Shield electronic warfare system as part of an upgrade program....
28/08/2025
NEW YORK FuboTV today announced that it will launch Fubo Sports, a skinny bundle that focuses on sports with a subscription price of $56 monthly....
28/08/2025
NEVADA City, Calif. At IBC2025, Sept. 12-15 at the RAI Amsterdam, Telestream will debut its new Global Ingest strategy, introducing a next-generation ingest arc...
28/08/2025
Dr. Rhoda Bernard Releases Groundbreaking Debut Book on Accessible Arts Educatio...
28/08/2025
TAG Video Systems, the leader in software-based IP end-to-end workflow monitoring, deep probing, and real-time visualization, has named Oliver Gappa as Sales Di...
28/08/2025
AI-based voice enhancement will be among a series of innovations making their IBC 2025 debut on the DHD stand B46 in Hall 8 at the RAI Amsterdam Convention Cent...
28/08/2025
Telef nica Servicios Audiovisuales (TSA), the leading system integrator and service provider in the media sector in Spain, with the support of Appear, the globa...
28/08/2025
To fully immerse sailing fans in the world's biggest offshore yacht race, production company, Optical Media turned to LiveU's On-site Production solutio...
28/08/2025
Working with Calrec on its most recent overhaul, radio and television broadcaster, WNED has migrated to a fully IP infrastructure with multiple Type R consoles,...
28/08/2025
Cleeng, the Subscriber Retention Management (SRM ) inventor, has unveiled Cleeng Pro, the first-ever direct-to-consumer (D2C) subscription management platform t...
28/08/2025
Zixi, the industry leader in live broadcast-quality video over IP, today announced that French media distribution platform OKAST has selected Zixi to enable rel...
28/08/2025
Solution offers a streamlined, speaker-free architecture to optimize integration with premium external loudspeakers and advanced loudness metering
Nixer Pro Au...
28/08/2025
Cinegy, the premier provider of software-defined television technology, has announced a strategic partnership with Vision One Touch Film Production Services L.L...
28/08/2025
Telestream, a global leader in media workflow technologies, will debut its new Global Ingest strategy at IBC2025, introducing a next-generation ingest architect...
28/08/2025
Tier 1 operator selects Broadpeak to power high-performance, unified CDN solution across Norway, Sweden and Finland
Broadpeak, a leader in streaming and moneti...
28/08/2025
Leading video software provider, Synamedia, today announced that 24 Frames Digital, one of India's leading live event streaming service providers, has chose...
28/08/2025
Meet VisualOn at IBC2025: See What's Next in AI-Powered Video Streaming Join VisualOn at IBC2025 and discover how our AI-driven Optimizer and advanced media...
28/08/2025
IBC stand 5.F81
Wowza to Reveal Next-Gen Video Streaming Innovations at IBC 2025
Amsterdam, August 28, 2025 Wowza, a leader in video streaming infrastructur...
28/08/2025
VIDA, the secure cloud-native media asset management platform, is launching at IBC Show 2025 Media Factory, a drag-and-drop workflow automation engine designed ...
28/08/2025
WASHINGTON The Federal Communications Commission has admitted that it inadvertently removed some rules relating to NextGen TV/ATSC 3.0 and has moved to correct ...
28/08/2025
WASHINGTON The National Association of Broadcasters (NAB) has announced the launch of the NextGen TV News Technology Lab, a three-year initiative designed to he...
28/08/2025
BRASILIA Brazilian President Luiz In cio Lula da Silva has signed an official presidential decree establishing DTV+ (TV 3.0) incorporating many parts of the ATS...
28/08/2025
With the FCC's spectrum auction authority back in hand, lines in the sand are being drawn for the potential reallocation of the Upper C-Band for 5G mobile b...
28/08/2025
As the use of generative AI becomes more common in media operations and production, Netflix has laid out detailed guidelines for their use and provided guidance...
28/08/2025
Fox and YouTube TV have agreed to a short-term extension in their carriage talks as they try to replace the existing agreement that expired on August 27 at 5 pm...
28/08/2025
Boston Conservatory at Berklee Announces Center Stage Performances for 2025-2026...
28/08/2025
The new oil is sports': Saudis share masterplan to boost esports on global stage By Adrian Pennington
Tuesday, August 26, 2025 - 08:49
Print This Story...
28/08/2025
Esports World Cup 2025: Team Falcons Defend Title as Broadcast Production Ramps ...
28/08/2025
Report: Mobile-first content has overtaken the big screen as the way fans watch ...
28/08/2025
New Era: A Pro's Guide to ESPN's Expanded College Football Playoff Graph...
28/08/2025
College Football Kickoff 2025: The CW Heads Into ACC, Pac-12 Schedule With New P...