Attention Mechanism
This document describes the implementation of multi-head attention (MHA), multi-query attention (MQA), and group-query attention (GQA) for auto-regressive GPT-like models in TensorRT-LLM.
These are advanced attention mechanisms used in deep learning models, particularly for tasks involving sequences such as language processing.
Key Points
Attention Variants
MHA: A sequence of batched matrix multiplication, softmax, and another batched matrix multiplication.
MQA & GQA: Variants of MHA with fewer key/value (K/V) heads than query heads. They are optimized for efficiency and lower computational load.
Input Modes - Padded and Packed Tensors
Padded mode involves filling shorter sequences to a maximum length, leading to excessive memory use.
Packed mode is more efficient, where sequences are packed together, and the system is provided with sequence lengths. It's recommended over padded mode.
Context and Generation Phases in Auto-Regressive Models
Context Phase: Has different implementations depending on the
context_fmha_type
setting. It can store intermediate Q*K^T tensor in memory or use a single kernel for MHA/MQA, including the Flash Attention algorithm for larger sequences.Generation Phase: Implemented using a single kernel, capable of handling pre-processing and applying techniques like RoPE and quantization/dequantization.
Inflight Batching
This feature processes sequences in context and generation phases together, improving latency and GPU utilization. Requires packed input tensors.
KV Cache(s)
KV caches store past K and V elements to speed up the generation phase. There are two types: contiguous and paged KV caches.
Additional Features
Rotary Positional Embedding (RoPE): Integrated into the GPT attention operation for positional encoding.
ALiBi: Applied to the Q*K^T product.
Scaling Factors: Used in MHA for scaling the output of the Q*K^T product.
Cross Attention: Supports both self and cross-attention, making it suitable for a variety of decoder models.
Relative Attention Bias (RAB): Adds an attention bias based on relative positions, supporting both regular and implicit modes.
Important Considerations:
The document emphasizes the efficiency and memory benefits of using packed mode over padded mode.
The implementation and optimizations are geared towards improving performance and reducing latency in GPT-like models.
These enhancements are significant for tasks requiring heavy sequence processing and attention mechanisms, like large-scale language models.
Last updated