tensorrt_llm.functional.gpt_attention

The tensorrt_llm.functional.gpt_attention function is a method designed to perform multi-head attention in GPT-like (Generative Pre-trained Transformer) models within the TensorRT-LLM framework.

This function is critical in transformer models for capturing the relationships between different positions in the input sequence. Let's break down the parameters and usage of this function:

Parameters:

  1. qkv (Tensor): The input Query-Key-Value (QKV) tensor with shape [batch_beam_size, max_seqlen, qkv_dim] in padded mode or [1, num_tokens, qkv_dim] in packed mode.

  2. past_key_value (Tensor): Stores Key-Value (KV) cache data, important for efficient sequential processing in transformer models.

  3. sequence_lengths (Tensor): Contains the length of each sequence.

  4. host_past_key_value_lengths (Tensor): Host tensor (on CPU) that stores the lengths of past key-value pairs.

  5. host_max_attention_window_sizes (Tensor): Host tensor that defines the maximum attention window size.

  6. context_lengths (Tensor): Stores the context-phase sequence length for each request.

  7. cache_indirection (Tensor): Helps reconstruct paths during beam search operations.

  8. host_request_types (Tensor): Indicates if a request is in context or generation phase.

  9. num_heads, num_kv_heads, hidden_size_per_head: Define the configuration of the attention heads.

  10. q_scaling (float): Scaling factor applied to the output of the Q*K^T product.

  11. rotary_embedding_dim, rotary_embedding_base, rotary_embedding_scale_type, rotary_embedding_scale, rotary_embedding_max_positions: Parameters for Rotary Positional Embeddings (RoPE).

  12. position_embedding_type (PositionEmbeddingType): Type of position embedding used.

  13. kv_orig_quant_scale, kv_quant_orig_scale: Scaling factors for quantization and dequantization in the KV cache.

  14. kv_cache_quant_mode (QuantMode): Specifies the quantization mode for the KV cache.

  15. max_context_length (int): Length of the longest input sequence.

  16. mask_type (AttentionMaskType): Type of mask used in the attention mechanism.

  17. alibi_slopes (Tensor): Slopes for ALiBi (Attention with Linear Biases).

  18. tp_size, tp_rank: Parameters for tensor parallelism.

  19. kv_cache_block_pointers, host_kv_cache_block_pointers: Block pointers for the KV cache.

  20. do_cross_attention, cross_qkv, cross_qkv_length, encoder_input_lengths: Parameters for cross-attention in encoder-decoder models.

  21. relative_attention_bias, max_distance: Parameters for relative attention bias.

  22. host_context_lengths (Tensor): Host tensor with lengths of different inputs.

  23. qkv_bias (Tensor): Bias for QKV.

The tensorrt_llm.functional.gpt_attention function is a sophisticated method designed to perform multi-head attention in GPT-like (Generative Pre-trained Transformer) models within the TensorRT-LLM framework. This function is critical in transformer models for capturing the relationships between different positions in the input sequence. Let's break down the parameters and usage of this function:

Parameters:

  1. qkv (Tensor): The input Query-Key-Value (QKV) tensor with shape [batch_beam_size, max_seqlen, qkv_dim] in padded mode or [1, num_tokens, qkv_dim] in packed mode.

  2. past_key_value (Tensor): Stores Key-Value (KV) cache data, important for efficient sequential processing in transformer models.

  3. sequence_lengths (Tensor): Contains the length of each sequence.

  4. host_past_key_value_lengths (Tensor): Host tensor (on CPU) that stores the lengths of past key-value pairs.

  5. host_max_attention_window_sizes (Tensor): Host tensor that defines the maximum attention window size.

  6. context_lengths (Tensor): Stores the context-phase sequence length for each request.

  7. cache_indirection (Tensor): Helps reconstruct paths during beam search operations.

  8. host_request_types (Tensor): Indicates if a request is in context or generation phase.

  9. num_heads, num_kv_heads, hidden_size_per_head: Define the configuration of the attention heads.

  10. q_scaling (float): Scaling factor applied to the output of the Q*K^T product.

  11. rotary_embedding_dim, rotary_embedding_base, rotary_embedding_scale_type, rotary_embedding_scale, rotary_embedding_max_positions: Parameters for Rotary Positional Embeddings (RoPE).

  12. position_embedding_type (PositionEmbeddingType): Type of position embedding used.

  13. kv_orig_quant_scale, kv_quant_orig_scale: Scaling factors for quantization and dequantization in the KV cache.

  14. kv_cache_quant_mode (QuantMode): Specifies the quantization mode for the KV cache.

  15. max_context_length (int): Length of the longest input sequence.

  16. mask_type (AttentionMaskType): Type of mask used in the attention mechanism.

  17. alibi_slopes (Tensor): Slopes for ALiBi (Attention with Linear Biases).

  18. tp_size, tp_rank: Parameters for tensor parallelism.

  19. kv_cache_block_pointers, host_kv_cache_block_pointers: Block pointers for the KV cache.

  20. do_cross_attention, cross_qkv, cross_qkv_length, encoder_input_lengths: Parameters for cross-attention in encoder-decoder models.

  21. relative_attention_bias, max_distance: Parameters for relative attention bias.

  22. host_context_lengths (Tensor): Host tensor with lengths of different inputs.

  23. qkv_bias (Tensor): Bias for QKV.

Returns:

A tuple of two tensors. The first tensor is the output of the attention layer, and the second tensor (optional) is related to the KV cache.

Usage:

This function is used in constructing the GPT-like model's attention mechanism in TensorRT-LLM. You'll typically use this as part of building the model, where you provide the necessary tensors (like QKV, past_key value, etc.) and configuration parameters (like number of heads, scaling factors, etc.). The function will then add an attention layer operation to your model's computational graph, facilitating the core attention mechanism of transformer models.

Points to Note:

  • The function is designed to support various modes and configurations, including different types of attention (self-attention, cross-attention), rotary positional embeddings, and quantization modes.

  • It's crucial to understand the role of each parameter and how they fit into your specific model architecture.

  • Given the complexity and the number of parameters, it's important to refer to the official TensorRT-LLM documentation for detailed explanations and examples.

In summary, tensorrt_llm.functional.gpt_attention is a highly configurable function for adding a crucial component (attention mechanism) to GPT-like models within the TensorRT-LLM framework. Its flexibility allows it to be adapted to various model configurations and requirements.

Last updated