Attention Mechanism

This document describes the implementation of multi-head attention (MHA), multi-query attention (MQA), and group-query attention (GQA) for auto-regressive GPT-like models in TensorRT-LLM.

These are advanced attention mechanisms used in deep learning models, particularly for tasks involving sequences such as language processing.

Key Points

Attention Variants

MHA: A sequence of batched matrix multiplication, softmax, and another batched matrix multiplication.
MQA & GQA: Variants of MHA with fewer key/value (K/V) heads than query heads. They are optimized for efficiency and lower computational load.

Input Modes - Padded and Packed Tensors

Padded mode involves filling shorter sequences to a maximum length, leading to excessive memory use.
Packed mode is more efficient, where sequences are packed together, and the system is provided with sequence lengths. It's recommended over padded mode.

Context and Generation Phases in Auto-Regressive Models

Context Phase: Has different implementations depending on the context_fmha_type setting. It can store intermediate Q*K^T tensor in memory or use a single kernel for MHA/MQA, including the Flash Attention algorithm for larger sequences.
Generation Phase: Implemented using a single kernel, capable of handling pre-processing and applying techniques like RoPE and quantization/dequantization.

Inflight Batching

This feature processes sequences in context and generation phases together, improving latency and GPU utilization. Requires packed input tensors.

KV Cache(s)

KV caches store past K and V elements to speed up the generation phase. There are two types: contiguous and paged KV caches.

Additional Features

Rotary Positional Embedding (RoPE): Integrated into the GPT attention operation for positional encoding.
ALiBi: Applied to the Q*K^T product.
Scaling Factors: Used in MHA for scaling the output of the Q*K^T product.
Cross Attention: Supports both self and cross-attention, making it suitable for a variety of decoder models.
Relative Attention Bias (RAB): Adds an attention bias based on relative positions, supporting both regular and implicit modes.

Important Considerations:

The document emphasizes the efficiency and memory benefits of using packed mode over padded mode.
The implementation and optimizations are geared towards improving performance and reducing latency in GPT-like models.
These enhancements are significant for tasks requiring heavy sequence processing and attention mechanisms, like large-scale language models.

PreviousTransformer Architecture NextMulti Head Attention

Last updated 1 year ago

Was this helpful?