Best Practices for Tuning the Performance of TensorRT-LLM

Introduction TensorRT-LLM is a powerful library for optimising and deploying language models (LMs) on NVIDIA GPUs.

To achieve optimal performance, it is essential to understand the various build options and runtime configurations available in TensorRT-LLM.

This tutorial will provide an in-depth explanation of the key optimization techniques and best practices for tuning the performance of TensorRT-LLM models.

Optimisation Types and Build Options

GPT Attention Plugin and Context Fused Multi-Head Attention

Arguments/Python Classes:
- --gpt_attention_plugin: Enables the GPT attention plugin
- --context_fmha: Enables fused multi-head attention during the context phase
Explanation: The GPT attention plugin utilises efficient kernels and enables in-place updates of the KV cache, reducing memory consumption and eliminating unnecessary memory copy operations. Enabling fused multi-head attention during the context phase triggers a single kernel that performs the MHA/MQA/GQA block, further optimizing performance.

Remove Input Padding

Arguments/Python Classes:
- --remove_input_padding: Enables the removal of input padding
Explanation: Removing input padding packs different tokens together, reducing computations and memory consumption. When input padding is removed, the maximum number of tokens can be set to a lower value, allowing for more efficient memory allocation and execution of requests.

Maximum Number of Tokens

Arguments/Python Classes:
- --max_num_tokens: Sets the maximum number of tokens
Explanation: Tuning the --max_num_tokens parameter is crucial for optimal performance. It should be estimated based on the maximum batch size, input length, and a rough estimation of the number of requests in the context phase. Increasing --max_num_tokens appropriately improves performance by allowing TensorRT-LLM to allocate more memory for the KV cache and execute more requests together.

Paged KV Cache

Arguments/Python Classes:
- --paged_kv_cache: Enables the paged KV cache
Explanation: The paged KV cache efficiently manages memory for the KV cache, leading to increased batch sizes and improved efficiency. It is enabled by default and can be controlled using the --paged_kv_cache argument.

In-flight Sequence Batching

Explanation: In-flight sequence batching schedules sequences in the context phase together with sequences in the generation phase, enhancing efficiency and reducing latency. It is enabled by default when the GPT attention plugin, input padding removal, and paged KV cache are all enabled.

Multi-Block Mode

Arguments/Python Classes:
- --multi_block_mode: Enables multi-block mode
Explanation: Multi-block mode can be beneficial when the batch size and number of attention heads are not large enough to fully utilize the GPU. It is recommended to enable multi-block mode using the --multi_block_mode argument when the input sequence length is greater than 1024 and the product of sequence count and number of heads is less than half the number of streaming multiprocessors.

Custom AllReduce Plugin

Arguments/Python Classes:
- --use_custom_all_reduce: Enables the custom AllReduce plugin
Explanation: On NVLink-based nodes, enabling the custom AllReduce plugin activates a latency-optimized algorithm for the AllReduce operation. It is recommended to use the --use_custom_all_reduce argument on NVLink-based systems but not on PCIE-based systems.

Arguments/Python Classes:
- --use_parallel_embedding: Enables embedding parallelism
- --use_embedding_sharing: Enables embedding sharing
- --use_lookup_plugin: Enables the look-up plugin
- --use_gemm_plugin: Enables the GEMM plugin
- --embedding_sharding_dim: Sets the sharding dimension of the embedding lookup table
Explanation: Embedding parallelism enables sharding of the embedding table across multiple GPUs, reducing memory usage and improving throughput. Embedding sharing allows the sharing of the embedding table between the look_up and lm_head layers. To enable these features, the model must share the embedding table, and both the look-up and GEMM plugins must be enabled. The sharding dimension of the embedding lookup table should be set correctly using the --embedding_sharding_dim argument.

Horizontal Fusion in Gated-MLP

Arguments/Python Classes:
- --use_fused_mlp: Enables horizontal fusion in Gated-MLP
Explanation: Horizontal fusion in Gated-MLP combines two Matmul operations into a single one followed by a separate SwiGLU kernel. It is recommended to enable this feature using the --use_fused_mlp argument when both the model and batch sizes are large. However, for FP8 post-training quantization (PTQ), enabling horizontal fusion may slightly reduce accuracy.

GEMM Plugin

Arguments/Python Classes:
- --use_gemm_plugin: Enables the GEMM plugin
Explanation: The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM operations. It is recommended to enable the GEMM plugin for better performance and smaller GPU memory usage when using FP16 and BF16 precision. However, for FP8 precision, it is recommended to disable the GEMM plugin.

BERT Attention Plugin and Context Fused Multi-Head Attention

Arguments/Python Classes:
- --bert_attention_plugin: Enables the BERT attention plugin
- --context_fmha: Enables fused multi-head attention during the context phase
Explanation: The BERT attention plugin and context fused multi-head attention are recommended for the BERT model. They are enabled by default using the --bert_attention_plugin and --context_fmha arguments.

Runtime Options

GPT Model Type

Explanation: The GPT model type can be set to V1, inflight_batching, or inflight_fused_batching. It is recommended to use inflight_fused_batching to increase throughput and reduce latency.

Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction

Arguments/Python Classes:
- max_tokens_in_paged_kv_cache: Sets the maximum number of tokens in the KV cache manager
- kv_cache_free_gpu_mem_fraction: Sets the maximum fraction of GPU memory used for the KV cache
Explanation: These parameters control the maximum number of tokens handled by the KV cache manager. Setting them properly helps manage the available memory for the KV cache manager during inference. Increasing the memory available to the KV cache manager tends to improve achievable throughput. It is recommended to leave max_tokens_in_paged_kv_cache unset and test with a high value (e.g., 0.95) for kv_cache_free_gpu_mem_fraction to target high throughput.

Batch Scheduler Policy

There are two batch scheduler policies: MAX_UTILIZATION and GUARANTEED_NO_EVICT.

The MAX_UTILIZATION policy packs as many requests as possible at each iteration of the forward loop, maximizing GPU utilization but risking the need to pause requests if the KV cache size limit is reached.

The GUARANTEED_NO_EVICT policy guarantees that a started request is never paused.

If the goal is to maximize throughput, MAX_UTILIZATION should be tried, keeping in mind that it may impact latency if requests have to be paused.

TensorRT Overlap

When TensorRT overlap is enabled, available requests are partitioned into two micro-batches that can run concurrently, allowing TensorRT-LLM to hide exposed CPU runtime.

It is recommended to enable TensorRT overlap to increase throughput, but it may not provide performance benefits when the model size is not large enough to overlap the host overhead or when the number of requests is too small.

Maximum Attention Window Size

Arguments/Python Classes:
- max_attention_window_size: Sets the maximum number of tokens attended to when generating one token
Explanation: The max_attention_window_size flag sets the maximum number of tokens attended to when using techniques like sliding window attention. When set to a smaller value than max_input_length + max_output_length, only the KV cache of the last max_attention_window_size tokens will be stored, improving runtime performance at the expense of reduced accuracy. Users can modify this value to balance performance and accuracy.

Chunked Context

Arguments/Python Classes:
- enable_chunked_context: Enables context chunking
- max_num_tokens: Sets the maximum number of tokens
Explanation: Enabling context chunking by specifying enable_chunked_context increases the chance of batch processing between the context and generation phases, balancing the calculation amount of each iteration and increasing throughput. When context chunking is enabled, different performance can be obtained by adjusting max_num_tokens. The recommended value for max_num_tokens is N * tokens_per_block, where N is an integer starting from 1 and increased until the best performance is achieved.

Conclusion

Tuning the performance of TensorRT-LLM models involves careful consideration of build options and runtime configurations.

By leveraging the optimization techniques and best practices discussed in this tutorial, such as the GPT attention plugin, input padding removal, paged KV cache, in-flight sequence batching, and various runtime options, you can significantly enhance the performance and efficiency of your TensorRT-LLM deployments.

Remember to experiment with different settings and configurations to find the optimal balance between performance, memory usage, and accuracy for your specific use case. The TensorRT-LLM library provides a wide range of options and flexibility to customize and optimize your LLM inference pipeline.

For more detailed information and advanced optimization techniques, refer to the TensorRT-LLM documentation and the NVIDIA NeMo framework.

PreviousGeneral Notes on Model Architecture NextOptimisation Techniques

Last updated 1 year ago

Was this helpful?