# Attention Mechanism

This document describes the implementation of multi-head attention (MHA), multi-query attention (MQA), and group-query attention (GQA) for auto-regressive GPT-like models in TensorRT-LLM.&#x20;

These are advanced attention mechanisms used in deep learning models, particularly for tasks involving sequences such as language processing.

### <mark style="color:blue;">**Key Points**</mark>

<mark style="color:green;">**Attention Variants**</mark>

* **MHA**: A sequence of batched matrix multiplication, softmax, and another batched matrix multiplication.
* **MQA & GQA**: Variants of MHA with fewer key/value (K/V) heads than query heads. They are optimized for efficiency and lower computational load.

<mark style="color:green;">**Input Modes - Padded and Packed Tensors**</mark>

* Padded mode involves filling shorter sequences to a maximum length, leading to excessive memory use.
* Packed mode is more efficient, where sequences are packed together, and the system is provided with sequence lengths. It's recommended over padded mode.

<mark style="color:green;">**Context and Generation Phases in Auto-Regressive Models**</mark>

* **Context Phase**: Has different implementations depending on the `context_fmha_type` setting. It can store intermediate Q\*K^T tensor in memory or use a single kernel for MHA/MQA, including the Flash Attention algorithm for larger sequences.
* **Generation Phase**: Implemented using a single kernel, capable of handling pre-processing and applying techniques like RoPE and quantization/dequantization.

<mark style="color:green;">**Inflight Batching**</mark>

* This feature processes sequences in context and generation phases together, improving latency and GPU utilization. Requires packed input tensors.

<mark style="color:green;">**KV Cache(s)**</mark>

* KV caches store past K and V elements to speed up the generation phase. There are two types: contiguous and paged KV caches.

<mark style="color:green;">**Additional Features**</mark>

* **Rotary Positional Embedding (RoPE)**: Integrated into the GPT attention operation for positional encoding.
* **ALiBi**: Applied to the Q\*K^T product.
* **Scaling Factors**: Used in MHA for scaling the output of the Q\*K^T product.
* **Cross Attention**: Supports both self and cross-attention, making it suitable for a variety of decoder models.
* **Relative Attention Bias (RAB)**: Adds an attention bias based on relative positions, supporting both regular and implicit modes.

**Important Considerations:**

* The document emphasizes the efficiency and memory benefits of using packed mode over padded mode.
* The implementation and optimizations are geared towards improving performance and reducing latency in GPT-like models.
* These enhancements are significant for tasks requiring heavy sequence processing and attention mechanisms, like large-scale language models.
