Think Smart and Ask an Encyclopedia-Sized Question: Multi-Million Token Real-Time Inference for 32X More Users

Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law as big as an entire encyclopedia set, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.

The growing demand to decode such massive amounts of data—and let multiple GPUs quickly scale and communicate with each other—underscores the importance of FP4 compute and the high-bandwidth large NVLink domain provided by NVIDIA Blackwell systems. Helix Parallelism, introduced in this blog, is co-designed with Blackwell. It enables up to a 32x increase in the number of concurrent users at a given latency, compared to the best known prior parallelism methods for real-time decoding with ultra-long context.

In other words, it lets AI agents and virtual assistants serve more people, faster than ever before.

(Note: Context in this blog refers to the sequence of previously generated tokens, whose intermediate key and value representations are stored as KV cache and accessed at every decoding step.)

Decoding bottlenecks: KV cache and FFN weight reads

To support real-time decoding at scale, a system must overcome two major bottlenecks during the decoding (aka generation) phase:

Key-Value (KV) cache streaming: When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample. This constant streaming can, in turn, saturate DRAM bandwidth, increase token-to-token latency (TTL), and quickly become a major bottleneck as context length grows.
Feed-Forward Network (FFN) weight loading: During autoregressive decoding, generating every new token requires loading large Feed-Forward Network (FFN) weights from DRAM. In low latency scenarios with small batch sizes, this memory access cost is not well amortized, making FFN weight reads a dominant source of latency.

These two bottlenecks, KV cache streaming and FFN weight loading, are difficult to optimize simultaneously using traditional parallelism strategies.

Let’s take Tensor Parallelism (TP) as an example: Increasing TP can help reduce FFN stalls by distributing weight loading across multiple GPUs and improving TTL, but only up to a point. In attention schemes like Grouped Query Attention (GQA)—used in Llama models—or Multi-Latent Attention (MLA)—found in DeepSeek models—multiple query heads share a limited number of KV heads. As illustrated in Figure 2(c), when TP exceeds the number of KV heads, the system ends up duplicating the multi-million-token KV cache per sample across GPUs for self-attention. As a result, KV read volume stays high even with increased TP, once again saturating DRAM bandwidth and limiting scalability. In the case of MLA, the upper limit for TP is just one to avoid duplication of KV cache.

So how can developers scale both model size and context length without sacrificing real-time interactivity? Helix Parallelism offers a path forward.

Helix execution flow

Helix is a hybrid sharding strategy that disaggregates the parallelism strategies of attention and FFNs in a temporal pipeline, effectively addressing both KV cache and FFN weight-read bottlenecks during multi-million-token decoding.

Figure 1 (below) shows how Helix orchestrates the execution of attention and FFN within a single transformer layer. Inspired by the structure of a DNA helix, Helix interweaves multiple dimensions of parallelism—KV, tensor, and expert—into a unified execution loop. By decoupling the parallelism strategy used for attention and FFN, Helix allows each stage to operate in a configuration tuned to its own bottleneck, all while reusing the same pool of GPUs. Helix reuse approach keeps GPUs efficiently utilized across stages, eliminating idle time as computation flows through the model.

A diagram showing the execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN. — *Figure 1. Execution flow of Helix Parallelism. Helix reuses the same pool of N GPUs per layer by switching between N=KVPxTPA during attention and N=TPFxEP during FFN.*

Attention phase

Helix applies KV Parallelism (KVP) by sharding the multi-million-token KV cache along the sequence dimension across KVP GPUs, while applying Tensor Parallelism across attention heads (TPA), where TPA is the number of GPUs each QKV projection is split across, and is kept less than or equal to the number of KV heads to avoid duplication.

This sharding strategy is also illustrated in Figure 2(d) through a simplified toy example. This results in a total of N=KVPxTPA GPUs collaborating on the attention computation without duplicating KV cache across GPUs. N here represents the total number of GPUs used for end-to-end inference execution. The same set of N GPUs will be reused in FFN phase.

A diagram showing an overview of different attention sharding strategies using GQA with 4 query heads (Q=4) and 2 KV heads (K=2). When TP exceeds the number of KV heads, duplication becomes necessary, adding memory and bandwidth overhead. Helix avoids this by combining TP=2 with KVP=2, forming a 2D layout. — Figure 2. An overview of different attention sharding strategies. When TP exceeds the number of KV heads, duplication becomes necessary, adding memory and bandwidth overhead. Helix avoids this by combining TP=2 with KVP=2, forming a 2D layout.

To avoid a pre-attention all-gather, Helix ensures that each KVP GPU holds all query heads associated with its local KV head(s) and redundantly computes QKV projections. This enables fully local FlashAttention on each KV shard.

After local FlashAttention, a single all-to-all, along the query-head dimension across KVP GPUs, exchanges partial attention outputs and log-sum-exp scalars. Importantly, this communication cost scales with batch size and hidden dimension, and is independent of the KV cache length, making it efficient even as context length scales into multi millions of tokens. Each GPU locally reconstructs exact SoftMax-normalized outputs.

This all-to-all also triggers the reprovisioning of KVP GPUs into a TP group (TP = N = KVP x TPA) for the attention output linear computation. Critically, this all-to-all phase benefits from NVLink/NVL72’s high-bandwidth interconnect, enabling fast collective communication across large GPU counts. Figure 2 provides a high-level overview of different sharding schemes in attention and their corresponding layout.

To further reduce TTL, we introduce Helix overlap pipeline-batch-wise (HOP-B), a fine-grained pipelining technique that overlaps communication and computation across the batch, as illustrated in Figure 3. As soon as the attention output for one token is computed, Helix launches the all-to-all exchange for that token, while simultaneously computing attention for the next. This tight overlap hides communication latency behind useful work, keeping GPU utilization high and further accelerating real-time decoding.

Two charts showing how HOP-B reduces exposed all-to-all via overlap. — *Figure 3. HOP-B reduces exposed all-to-all via overlap.*

In Figure 3 (above), the top shows that without HOP-B, eight requests perform attention computation in lockstep. It is followed by sequential all-to-all communication. The bottom of the chart shows that with HOP-B, communication for one request overlaps with compute for the next, reducing TTL through fine-grained pipelining.

FFN phase

After attention, the same pool of N=KVPxTPA GPUs is reprovisioned without idle time to execute the FFN block. The output from the all-to-all step is already partitioned across N GPUs by hidden dimension, allowing the post-attention linear projection to run immediately in TP mode (TP=N). Each GPU performs a local matrix multiply using its weight shard and participates in an all-reduce across TP=N GPUs to construct the correct output.

After the post-attention linear projection, Helix reconfigures the same N GPUs for FFN computation using either a 1D TP (N=TPF) layout in dense models or a 2D TP x Expert Parallel (N=TPFxEP) grid in MoE models.

Distributed KV concatenation

During decoding, each new token is broadcast to all KVP GPUs for query computation. To prevent DRAM hotspots, Helix staggers KV cache updates across KVP ranks in a round-robin fashion, e.g., tokens one through 16 go to KVP 0, 17 through 32 to KVP 1, and so on. This ensures uniform KV growth, balances memory usage across GPUs, and maintains consistent throughput, regardless of sequence length or batch size.

Simulated results on Blackwell

Helix sets a new performance benchmark for long-context LLM decoding. Figure 4 (below) shows the normalized throughput–latency Pareto frontier for DeepSeek-R1 671B during decoding with a (hypothetical) 1-million-token context. The Pareto frontier is derived by an exhaustive simulation over thousands of configurations, systematically varying model partitioning strategies (TP, EP, PP, and KVP), and batch sizes to find best throughput-latency tradeoffs. In particular:

For a fixed latency budget, Helix can improve the number of concurrent users by up to 32x (i.e., achieving 32x higher tokens/s/GPU).
For low concurrency settings, Helix can improve user interactivity by up to 1.5x (i.e., reducing minimum achievable TTL by up to 1.5x).

These gains were made possible by sharding both the KV cache and FFN weights across all available devices, dramatically reducing DRAM pressure and improving compute efficiency. Helix pushes the throughput-latency Pareto frontier, allowing higher throughput even at lower latency. For further details please refer to the paper here.

A line chart showing Pareto frontier serving DeepSeek-R1 with 1 million context length. Evaluations are performed using a high-fidelity simulator calibrated to the latest NVIDIA GB200 NVL72 (Blackwell) hardware and leveraging FP4 precision for efficiency. — Figure 4. Pareto frontier serving DeepSeek-R1 with 1 million context length. Evaluations are performed using a high-fidelity simulator calibrated to the latest NVIDIA GB200 NVL72 (Blackwell) hardware and leveraging FP4 precision for efficiency.

Stay tuned

Helix Parallelism, co-designed with Blackwell’s latest capabilities, provides a blueprint for serving multi-million-token models at scale without compromising interactivity. Stay tuned as we bring this optimization to inference frameworks.