# Best Practices for Tuning the Performance of TensorRT-LLM

Introduction TensorRT-LLM is a powerful library for optimising and deploying language models (LMs) on NVIDIA GPUs.&#x20;

To achieve optimal performance, it is essential to understand the various build options and runtime configurations available in TensorRT-LLM.&#x20;

This tutorial will provide an in-depth explanation of the key optimization techniques and best practices for tuning the performance of TensorRT-LLM models.

### <mark style="color:blue;">Optimisation Types and Build Options</mark>

#### <mark style="color:green;">GPT Attention Plugin and Context Fused Multi-Head Attention</mark>

* Arguments/Python Classes:
  * `--gpt_attention_plugin`: Enables the GPT attention plugin
  * `--context_fmha`: Enables fused multi-head attention during the context phase
* Explanation: The GPT attention plugin utilises efficient kernels and enables in-place updates of the KV cache, reducing memory consumption and eliminating unnecessary memory copy operations. Enabling fused multi-head attention during the context phase triggers a single kernel that performs the MHA/MQA/GQA block, further optimizing performance.

#### <mark style="color:green;">Remove Input Padding</mark>

* Arguments/Python Classes:
  * `--remove_input_padding`: Enables the removal of input padding
* Explanation: Removing input padding packs different tokens together, reducing computations and memory consumption. When input padding is removed, the maximum number of tokens can be set to a lower value, allowing for more efficient memory allocation and execution of requests.

#### <mark style="color:green;">Maximum Number of Tokens</mark>

* Arguments/Python Classes:
  * `--max_num_tokens`: Sets the maximum number of tokens
* Explanation: Tuning the `--max_num_tokens` parameter is crucial for optimal performance. It should be estimated based on the maximum batch size, input length, and a rough estimation of the number of requests in the context phase. Increasing `--max_num_tokens` appropriately improves performance by allowing TensorRT-LLM to allocate more memory for the KV cache and execute more requests together.

#### <mark style="color:green;">Paged KV Cache</mark>

* Arguments/Python Classes:
  * `--paged_kv_cache`: Enables the paged KV cache
* Explanation: The paged KV cache efficiently manages memory for the KV cache, leading to increased batch sizes and improved efficiency. It is enabled by default and can be controlled using the `--paged_kv_cache` argument.

#### <mark style="color:green;">In-flight Sequence Batching</mark>

* Explanation: In-flight sequence batching schedules sequences in the context phase together with sequences in the generation phase, enhancing efficiency and reducing latency. It is enabled by default when the GPT attention plugin, input padding removal, and paged KV cache are all enabled.

#### <mark style="color:green;">Multi-Block Mode</mark>

* Arguments/Python Classes:
  * `--multi_block_mode`: Enables multi-block mode
* Explanation: Multi-block mode can be beneficial when the batch size and number of attention heads are not large enough to fully utilize the GPU. It is recommended to enable multi-block mode using the `--multi_block_mode` argument when the input sequence length is greater than 1024 and the product of sequence count and number of heads is less than half the number of streaming multiprocessors.

#### <mark style="color:green;">Custom AllReduce Plugin</mark>

* Arguments/Python Classes:
  * `--use_custom_all_reduce`: Enables the custom AllReduce plugin
* Explanation: On NVLink-based nodes, enabling the custom AllReduce plugin activates a latency-optimized algorithm for the AllReduce operation. It is recommended to use the `--use_custom_all_reduce` argument on NVLink-based systems but not on PCIE-based systems.

#### <mark style="color:green;">Embedding Parallelism, Embedding Sharing, and Look-Up Plugin</mark>

* Arguments/Python Classes:
  * `--use_parallel_embedding`: Enables embedding parallelism
  * `--use_embedding_sharing`: Enables embedding sharing
  * `--use_lookup_plugin`: Enables the look-up plugin
  * `--use_gemm_plugin`: Enables the GEMM plugin
  * `--embedding_sharding_dim`: Sets the sharding dimension of the embedding lookup table
* Explanation: Embedding parallelism enables sharding of the embedding table across multiple GPUs, reducing memory usage and improving throughput. Embedding sharing allows the sharing of the embedding table between the `look_up` and `lm_head` layers. To enable these features, the model must share the embedding table, and both the look-up and GEMM plugins must be enabled. The sharding dimension of the embedding lookup table should be set correctly using the `--embedding_sharding_dim` argument.

#### <mark style="color:green;">Horizontal Fusion in Gated-MLP</mark>

* Arguments/Python Classes:
  * `--use_fused_mlp`: Enables horizontal fusion in Gated-MLP
* Explanation: Horizontal fusion in Gated-MLP combines two Matmul operations into a single one followed by a separate SwiGLU kernel. It is recommended to enable this feature using the `--use_fused_mlp` argument when both the model and batch sizes are large. However, for FP8 post-training quantization (PTQ), enabling horizontal fusion may slightly reduce accuracy.

#### <mark style="color:green;">GEMM Plugin</mark>

* Arguments/Python Classes:
  * `--use_gemm_plugin`: Enables the GEMM plugin
* Explanation: The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM operations. It is recommended to enable the GEMM plugin for better performance and smaller GPU memory usage when using FP16 and BF16 precision. However, for FP8 precision, it is recommended to disable the GEMM plugin.

#### <mark style="color:green;">BERT Attention Plugin and Context Fused Multi-Head Attention</mark>

* Arguments/Python Classes:
  * `--bert_attention_plugin`: Enables the BERT attention plugin
  * `--context_fmha`: Enables fused multi-head attention during the context phase
* Explanation: The BERT attention plugin and context fused multi-head attention are recommended for the BERT model. They are enabled by default using the `--bert_attention_plugin` and `--context_fmha` arguments.

### <mark style="color:blue;">Runtime Options</mark>

#### <mark style="color:green;">GPT Model Type</mark>

* Explanation: The GPT model type can be set to `V1`, `inflight_batching`, or `inflight_fused_batching`. It is recommended to use `inflight_fused_batching` to increase throughput and reduce latency.

#### <mark style="color:green;">Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction</mark>

* Arguments/Python Classes:
  * `max_tokens_in_paged_kv_cache`: Sets the maximum number of tokens in the KV cache manager
  * `kv_cache_free_gpu_mem_fraction`: Sets the maximum fraction of GPU memory used for the KV cache
* Explanation: These parameters control the maximum number of tokens handled by the KV cache manager. Setting them properly helps manage the available memory for the KV cache manager during inference. Increasing the memory available to the KV cache manager tends to improve achievable throughput. It is recommended to leave `max_tokens_in_paged_kv_cache` unset and test with a high value (e.g., 0.95) for `kv_cache_free_gpu_mem_fraction` to target high throughput.

### <mark style="color:blue;">Batch Scheduler Policy</mark>

There are two batch scheduler policies: <mark style="color:yellow;">`MAX_UTILIZATION`</mark> and <mark style="color:yellow;">`GUARANTEED_NO_EVICT`</mark>.&#x20;

The <mark style="color:yellow;">`MAX_UTILIZATION`</mark> policy packs as many requests as possible at each iteration of the forward loop, maximizing GPU utilization but risking the need to pause requests if the KV cache size limit is reached.&#x20;

The <mark style="color:yellow;">`GUARANTEED_NO_EVICT`</mark> policy guarantees that a started request is never paused.&#x20;

If the goal is to maximize throughput, <mark style="color:yellow;">`MAX_UTILIZATION`</mark> should be tried, keeping in mind that it may impact latency if requests have to be paused.

### <mark style="color:blue;">TensorRT Overlap</mark>

When TensorRT overlap is enabled, available requests are partitioned into two micro-batches that can run concurrently, allowing TensorRT-LLM to hide exposed CPU runtime.&#x20;

It is recommended to enable TensorRT overlap to increase throughput, but it may not provide performance benefits when the model size is not large enough to overlap the host overhead or when the number of requests is too small.

### <mark style="color:blue;">Maximum Attention Window Size</mark>

* Arguments/Python Classes:
  * `max_attention_window_size`: Sets the maximum number of tokens attended to when generating one token
* Explanation: The `max_attention_window_size` flag sets the maximum number of tokens attended to when using techniques like sliding window attention. When set to a smaller value than `max_input_length + max_output_length`, only the KV cache of the last `max_attention_window_size` tokens will be stored, improving runtime performance at the expense of reduced accuracy. Users can modify this value to balance performance and accuracy.

### <mark style="color:blue;">Chunked Context</mark>

* Arguments/Python Classes:
  * `enable_chunked_context`: Enables context chunking
  * `max_num_tokens`: Sets the maximum number of tokens
* Explanation: Enabling context chunking by specifying `enable_chunked_context` increases the chance of batch processing between the context and generation phases, balancing the calculation amount of each iteration and increasing throughput. When context chunking is enabled, different performance can be obtained by adjusting `max_num_tokens`. The recommended value for `max_num_tokens` is `N * tokens_per_block`, where `N` is an integer starting from 1 and increased until the best performance is achieved.

### <mark style="color:blue;">Conclusion</mark>&#x20;

Tuning the performance of TensorRT-LLM models involves careful consideration of build options and runtime configurations.&#x20;

By leveraging the optimization techniques and best practices discussed in this tutorial, such as the GPT attention plugin, input padding removal, paged KV cache, in-flight sequence batching, and various runtime options, you can significantly enhance the performance and efficiency of your TensorRT-LLM deployments.

Remember to experiment with different settings and configurations to find the optimal balance between performance, memory usage, and accuracy for your specific use case. The TensorRT-LLM library provides a wide range of options and flexibility to customize and optimize your LLM inference pipeline.

For more detailed information and advanced optimization techniques, refer to the TensorRT-LLM documentation and the NVIDIA NeMo framework.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/best-practices-for-tuning-the-performance-of-tensorrt-llm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
