KV Cache

KV (Key/Value) Caches are a crucial optimization technique in the generation phase of auto-regressive models like GPT, used within the multi-head attention (MHA) mechanism. They are particularly important in speeding up the generation phase by storing pre-computed key (K) and value (V) elements. In TensorRT-LLM, each Transformer layer has its own KV cache. Let's delve into greater detail about these caches:

Contiguous KV Cache

Structure: This is a monolithic tensor, a single large block of memory.
Shape: The shape of this tensor is [max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head].
Memory Usage: It tends to use more memory than necessary, especially when sequences are shorter than the maximum sequence length. This over-allocation can be inefficient as it may take many steps in the generation process to fully utilize this space.

Paged KV Cache

Design: The paged KV cache breaks down the cache into smaller blocks.
Management: These blocks are managed by a cache manager that assigns blocks to requests and recycles them as needed.
Efficiency: This approach is more memory-efficient compared to the contiguous KV cache, especially for variable-length sequences.

INT8/FP8 KV Caches

Support for Lower Precision: Despite the GPT attention operator typically working with FP32, FP16, and BFloat16, TensorRT-LLM supports INT8 and FP8 KV caches.
Quantization and Dequantization:
- For quantization, inputs are scaled to 8 bits using a kv_orig_quant_scale tensor.
- During generation, values read from the cache are dequantized on-the-fly in the MHA/MQA kernel using the kv_quant_orig_scale tensor.
Scaling Factor: The scaling factor is critical here, stored in respective tensors and used for quantization and dequantization processes.

Working Principle

In Generation Phase: The GPT attention operator uses these caches to store the values of K and V that have been computed in previous steps.
Role in Attention Mechanism: When the model processes a new token, it utilizes these cached values instead of recomputing them, significantly speeding up the inference process, especially for longer sequences.

Importance in Auto-regressive Models

Efficiency: KV caches play a key role in optimizing the generation process in auto-regressive models, as they avoid redundant computation of K and V for each new token generated.
Memory Management: Choosing between contiguous and paged caches depends on the specific requirements of memory efficiency and the nature of the sequences being processed.

In summary, KV Caches in TensorRT-LLM offer a sophisticated way to manage the computational load in the attention mechanism of Transformer models. By intelligently caching and reusing key and value pairs, they enable faster and more memory-efficient generation of new tokens in sequences, which is critical for the performance of large-scale models like GPT.

PreviousTransformer Feed-Forward Layers Are Key-Value Memories NextEfficient Streaming Language Models with Attention Sinks

Last updated 1 year ago

Was this helpful?