Attention Mechanism

This document describes the implementation of multi-head attention (MHA), multi-query attention (MQA), and group-query attention (GQA) for auto-regressive GPT-like models in TensorRT-LLM.

These are advanced attention mechanisms used in deep learning models, particularly for tasks involving sequences such as language processing.

Key Points

Attention Variants

  • MHA: A sequence of batched matrix multiplication, softmax, and another batched matrix multiplication.

  • MQA & GQA: Variants of MHA with fewer key/value (K/V) heads than query heads. They are optimized for efficiency and lower computational load.

Input Modes - Padded and Packed Tensors

  • Padded mode involves filling shorter sequences to a maximum length, leading to excessive memory use.

  • Packed mode is more efficient, where sequences are packed together, and the system is provided with sequence lengths. It's recommended over padded mode.

Context and Generation Phases in Auto-Regressive Models

  • Context Phase: Has different implementations depending on the context_fmha_type setting. It can store intermediate Q*K^T tensor in memory or use a single kernel for MHA/MQA, including the Flash Attention algorithm for larger sequences.

  • Generation Phase: Implemented using a single kernel, capable of handling pre-processing and applying techniques like RoPE and quantization/dequantization.

Inflight Batching

  • This feature processes sequences in context and generation phases together, improving latency and GPU utilization. Requires packed input tensors.

KV Cache(s)

  • KV caches store past K and V elements to speed up the generation phase. There are two types: contiguous and paged KV caches.

Additional Features

  • Rotary Positional Embedding (RoPE): Integrated into the GPT attention operation for positional encoding.

  • ALiBi: Applied to the Q*K^T product.

  • Scaling Factors: Used in MHA for scaling the output of the Q*K^T product.

  • Cross Attention: Supports both self and cross-attention, making it suitable for a variety of decoder models.

  • Relative Attention Bias (RAB): Adds an attention bias based on relative positions, supporting both regular and implicit modes.

Important Considerations:

  • The document emphasizes the efficiency and memory benefits of using packed mode over padded mode.

  • The implementation and optimizations are geared towards improving performance and reducing latency in GPT-like models.

  • These enhancements are significant for tasks requiring heavy sequence processing and attention mechanisms, like large-scale language models.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023