Page cover image

KV Cache

KV (Key/Value) Caches are a crucial optimization technique in the generation phase of auto-regressive models like GPT, used within the multi-head attention (MHA) mechanism. They are particularly important in speeding up the generation phase by storing pre-computed key (K) and value (V) elements. In TensorRT-LLM, each Transformer layer has its own KV cache. Let's delve into greater detail about these caches:

Contiguous KV Cache

  • Structure: This is a monolithic tensor, a single large block of memory.

  • Shape: The shape of this tensor is [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].

  • Memory Usage: It tends to use more memory than necessary, especially when sequences are shorter than the maximum sequence length. This over-allocation can be inefficient as it may take many steps in the generation process to fully utilize this space.

Paged KV Cache

  • Design: The paged KV cache breaks down the cache into smaller blocks.

  • Management: These blocks are managed by a cache manager that assigns blocks to requests and recycles them as needed.

  • Efficiency: This approach is more memory-efficient compared to the contiguous KV cache, especially for variable-length sequences.

INT8/FP8 KV Caches

  • Support for Lower Precision: Despite the GPT attention operator typically working with FP32, FP16, and BFloat16, TensorRT-LLM supports INT8 and FP8 KV caches.

  • Quantization and Dequantization:

    • For quantization, inputs are scaled to 8 bits using a kv_orig_quant_scale tensor.

    • During generation, values read from the cache are dequantized on-the-fly in the MHA/MQA kernel using the kv_quant_orig_scale tensor.

  • Scaling Factor: The scaling factor is critical here, stored in respective tensors and used for quantization and dequantization processes.

Working Principle

  • In Generation Phase: The GPT attention operator uses these caches to store the values of K and V that have been computed in previous steps.

  • Role in Attention Mechanism: When the model processes a new token, it utilizes these cached values instead of recomputing them, significantly speeding up the inference process, especially for longer sequences.

Importance in Auto-regressive Models

  • Efficiency: KV caches play a key role in optimizing the generation process in auto-regressive models, as they avoid redundant computation of K and V for each new token generated.

  • Memory Management: Choosing between contiguous and paged caches depends on the specific requirements of memory efficiency and the nature of the sequences being processed.

In summary, KV Caches in TensorRT-LLM offer a sophisticated way to manage the computational load in the attention mechanism of Transformer models. By intelligently caching and reusing key and value pairs, they enable faster and more memory-efficient generation of new tokens in sequences, which is critical for the performance of large-scale models like GPT.

Last updated

Was this helpful?