Best Practices for Tuning the Performance of TensorRT-LLM
Introduction TensorRT-LLM is a powerful library for optimising and deploying language models (LMs) on NVIDIA GPUs.
To achieve optimal performance, it is essential to understand the various build options and runtime configurations available in TensorRT-LLM.
This tutorial will provide an in-depth explanation of the key optimization techniques and best practices for tuning the performance of TensorRT-LLM models.
Optimisation Types and Build Options
GPT Attention Plugin and Context Fused Multi-Head Attention
Arguments/Python Classes:
--gpt_attention_plugin
: Enables the GPT attention plugin--context_fmha
: Enables fused multi-head attention during the context phase
Explanation: The GPT attention plugin utilises efficient kernels and enables in-place updates of the KV cache, reducing memory consumption and eliminating unnecessary memory copy operations. Enabling fused multi-head attention during the context phase triggers a single kernel that performs the MHA/MQA/GQA block, further optimizing performance.
Remove Input Padding
Arguments/Python Classes:
--remove_input_padding
: Enables the removal of input padding
Explanation: Removing input padding packs different tokens together, reducing computations and memory consumption. When input padding is removed, the maximum number of tokens can be set to a lower value, allowing for more efficient memory allocation and execution of requests.
Maximum Number of Tokens
Arguments/Python Classes:
--max_num_tokens
: Sets the maximum number of tokens
Explanation: Tuning the
--max_num_tokens
parameter is crucial for optimal performance. It should be estimated based on the maximum batch size, input length, and a rough estimation of the number of requests in the context phase. Increasing--max_num_tokens
appropriately improves performance by allowing TensorRT-LLM to allocate more memory for the KV cache and execute more requests together.
Paged KV Cache
Arguments/Python Classes:
--paged_kv_cache
: Enables the paged KV cache
Explanation: The paged KV cache efficiently manages memory for the KV cache, leading to increased batch sizes and improved efficiency. It is enabled by default and can be controlled using the
--paged_kv_cache
argument.
In-flight Sequence Batching
Explanation: In-flight sequence batching schedules sequences in the context phase together with sequences in the generation phase, enhancing efficiency and reducing latency. It is enabled by default when the GPT attention plugin, input padding removal, and paged KV cache are all enabled.
Multi-Block Mode
Arguments/Python Classes:
--multi_block_mode
: Enables multi-block mode
Explanation: Multi-block mode can be beneficial when the batch size and number of attention heads are not large enough to fully utilize the GPU. It is recommended to enable multi-block mode using the
--multi_block_mode
argument when the input sequence length is greater than 1024 and the product of sequence count and number of heads is less than half the number of streaming multiprocessors.
Custom AllReduce Plugin
Arguments/Python Classes:
--use_custom_all_reduce
: Enables the custom AllReduce plugin
Explanation: On NVLink-based nodes, enabling the custom AllReduce plugin activates a latency-optimized algorithm for the AllReduce operation. It is recommended to use the
--use_custom_all_reduce
argument on NVLink-based systems but not on PCIE-based systems.
Embedding Parallelism, Embedding Sharing, and Look-Up Plugin
Arguments/Python Classes:
--use_parallel_embedding
: Enables embedding parallelism--use_embedding_sharing
: Enables embedding sharing--use_lookup_plugin
: Enables the look-up plugin--use_gemm_plugin
: Enables the GEMM plugin--embedding_sharding_dim
: Sets the sharding dimension of the embedding lookup table
Explanation: Embedding parallelism enables sharding of the embedding table across multiple GPUs, reducing memory usage and improving throughput. Embedding sharing allows the sharing of the embedding table between the
look_up
andlm_head
layers. To enable these features, the model must share the embedding table, and both the look-up and GEMM plugins must be enabled. The sharding dimension of the embedding lookup table should be set correctly using the--embedding_sharding_dim
argument.
Horizontal Fusion in Gated-MLP
Arguments/Python Classes:
--use_fused_mlp
: Enables horizontal fusion in Gated-MLP
Explanation: Horizontal fusion in Gated-MLP combines two Matmul operations into a single one followed by a separate SwiGLU kernel. It is recommended to enable this feature using the
--use_fused_mlp
argument when both the model and batch sizes are large. However, for FP8 post-training quantization (PTQ), enabling horizontal fusion may slightly reduce accuracy.
GEMM Plugin
Arguments/Python Classes:
--use_gemm_plugin
: Enables the GEMM plugin
Explanation: The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM operations. It is recommended to enable the GEMM plugin for better performance and smaller GPU memory usage when using FP16 and BF16 precision. However, for FP8 precision, it is recommended to disable the GEMM plugin.
BERT Attention Plugin and Context Fused Multi-Head Attention
Arguments/Python Classes:
--bert_attention_plugin
: Enables the BERT attention plugin--context_fmha
: Enables fused multi-head attention during the context phase
Explanation: The BERT attention plugin and context fused multi-head attention are recommended for the BERT model. They are enabled by default using the
--bert_attention_plugin
and--context_fmha
arguments.
Runtime Options
GPT Model Type
Explanation: The GPT model type can be set to
V1
,inflight_batching
, orinflight_fused_batching
. It is recommended to useinflight_fused_batching
to increase throughput and reduce latency.
Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction
Arguments/Python Classes:
max_tokens_in_paged_kv_cache
: Sets the maximum number of tokens in the KV cache managerkv_cache_free_gpu_mem_fraction
: Sets the maximum fraction of GPU memory used for the KV cache
Explanation: These parameters control the maximum number of tokens handled by the KV cache manager. Setting them properly helps manage the available memory for the KV cache manager during inference. Increasing the memory available to the KV cache manager tends to improve achievable throughput. It is recommended to leave
max_tokens_in_paged_kv_cache
unset and test with a high value (e.g., 0.95) forkv_cache_free_gpu_mem_fraction
to target high throughput.
Batch Scheduler Policy
There are two batch scheduler policies: MAX_UTILIZATION
and GUARANTEED_NO_EVICT
.
The MAX_UTILIZATION
policy packs as many requests as possible at each iteration of the forward loop, maximizing GPU utilization but risking the need to pause requests if the KV cache size limit is reached.
The GUARANTEED_NO_EVICT
policy guarantees that a started request is never paused.
If the goal is to maximize throughput, MAX_UTILIZATION
should be tried, keeping in mind that it may impact latency if requests have to be paused.
TensorRT Overlap
When TensorRT overlap is enabled, available requests are partitioned into two micro-batches that can run concurrently, allowing TensorRT-LLM to hide exposed CPU runtime.
It is recommended to enable TensorRT overlap to increase throughput, but it may not provide performance benefits when the model size is not large enough to overlap the host overhead or when the number of requests is too small.
Maximum Attention Window Size
Arguments/Python Classes:
max_attention_window_size
: Sets the maximum number of tokens attended to when generating one token
Explanation: The
max_attention_window_size
flag sets the maximum number of tokens attended to when using techniques like sliding window attention. When set to a smaller value thanmax_input_length + max_output_length
, only the KV cache of the lastmax_attention_window_size
tokens will be stored, improving runtime performance at the expense of reduced accuracy. Users can modify this value to balance performance and accuracy.
Chunked Context
Arguments/Python Classes:
enable_chunked_context
: Enables context chunkingmax_num_tokens
: Sets the maximum number of tokens
Explanation: Enabling context chunking by specifying
enable_chunked_context
increases the chance of batch processing between the context and generation phases, balancing the calculation amount of each iteration and increasing throughput. When context chunking is enabled, different performance can be obtained by adjustingmax_num_tokens
. The recommended value formax_num_tokens
isN * tokens_per_block
, whereN
is an integer starting from 1 and increased until the best performance is achieved.
Conclusion
Tuning the performance of TensorRT-LLM models involves careful consideration of build options and runtime configurations.
By leveraging the optimization techniques and best practices discussed in this tutorial, such as the GPT attention plugin, input padding removal, paged KV cache, in-flight sequence batching, and various runtime options, you can significantly enhance the performance and efficiency of your TensorRT-LLM deployments.
Remember to experiment with different settings and configurations to find the optimal balance between performance, memory usage, and accuracy for your specific use case. The TensorRT-LLM library provides a wide range of options and flexibility to customize and optimize your LLM inference pipeline.
For more detailed information and advanced optimization techniques, refer to the TensorRT-LLM documentation and the NVIDIA NeMo framework.
Last updated