KV Cache
KV (Key/Value) Caches are a crucial optimization technique in the generation phase of auto-regressive models like GPT, used within the multi-head attention (MHA) mechanism. They are particularly important in speeding up the generation phase by storing pre-computed key (K) and value (V) elements. In TensorRT-LLM, each Transformer layer has its own KV cache. Let's delve into greater detail about these caches:
Contiguous KV Cache
Structure: This is a monolithic tensor, a single large block of memory.
Shape: The shape of this tensor is
[max_batch_size * max_beam_width, 2, num_heads, max_seqlen, hidden_dim_per_head]
.Memory Usage: It tends to use more memory than necessary, especially when sequences are shorter than the maximum sequence length. This over-allocation can be inefficient as it may take many steps in the generation process to fully utilize this space.
Paged KV Cache
Design: The paged KV cache breaks down the cache into smaller blocks.
Management: These blocks are managed by a cache manager that assigns blocks to requests and recycles them as needed.
Efficiency: This approach is more memory-efficient compared to the contiguous KV cache, especially for variable-length sequences.
INT8/FP8 KV Caches
Support for Lower Precision: Despite the GPT attention operator typically working with FP32, FP16, and BFloat16, TensorRT-LLM supports INT8 and FP8 KV caches.
Quantization and Dequantization:
For quantization, inputs are scaled to 8 bits using a
kv_orig_quant_scale
tensor.During generation, values read from the cache are dequantized on-the-fly in the MHA/MQA kernel using the
kv_quant_orig_scale
tensor.
Scaling Factor: The scaling factor is critical here, stored in respective tensors and used for quantization and dequantization processes.
Working Principle
In Generation Phase: The GPT attention operator uses these caches to store the values of K and V that have been computed in previous steps.
Role in Attention Mechanism: When the model processes a new token, it utilizes these cached values instead of recomputing them, significantly speeding up the inference process, especially for longer sequences.
Importance in Auto-regressive Models
Efficiency: KV caches play a key role in optimizing the generation process in auto-regressive models, as they avoid redundant computation of K and V for each new token generated.
Memory Management: Choosing between contiguous and paged caches depends on the specific requirements of memory efficiency and the nature of the sequences being processed.
In summary, KV Caches in TensorRT-LLM offer a sophisticated way to manage the computational load in the attention mechanism of Transformer models. By intelligently caching and reusing key and value pairs, they enable faster and more memory-efficient generation of new tokens in sequences, which is critical for the performance of large-scale models like GPT.
Last updated