LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Optimisation Types and Build Options
  • Runtime Options
  • Batch Scheduler Policy
  • TensorRT Overlap
  • Maximum Attention Window Size
  • Chunked Context
  • Conclusion

Was this helpful?

Best Practices for Tuning the Performance of TensorRT-LLM

Introduction TensorRT-LLM is a powerful library for optimising and deploying language models (LMs) on NVIDIA GPUs.

To achieve optimal performance, it is essential to understand the various build options and runtime configurations available in TensorRT-LLM.

This tutorial will provide an in-depth explanation of the key optimization techniques and best practices for tuning the performance of TensorRT-LLM models.

Optimisation Types and Build Options

GPT Attention Plugin and Context Fused Multi-Head Attention

  • Arguments/Python Classes:

    • --gpt_attention_plugin: Enables the GPT attention plugin

    • --context_fmha: Enables fused multi-head attention during the context phase

  • Explanation: The GPT attention plugin utilises efficient kernels and enables in-place updates of the KV cache, reducing memory consumption and eliminating unnecessary memory copy operations. Enabling fused multi-head attention during the context phase triggers a single kernel that performs the MHA/MQA/GQA block, further optimizing performance.

Remove Input Padding

  • Arguments/Python Classes:

    • --remove_input_padding: Enables the removal of input padding

  • Explanation: Removing input padding packs different tokens together, reducing computations and memory consumption. When input padding is removed, the maximum number of tokens can be set to a lower value, allowing for more efficient memory allocation and execution of requests.

Maximum Number of Tokens

  • Arguments/Python Classes:

    • --max_num_tokens: Sets the maximum number of tokens

  • Explanation: Tuning the --max_num_tokens parameter is crucial for optimal performance. It should be estimated based on the maximum batch size, input length, and a rough estimation of the number of requests in the context phase. Increasing --max_num_tokens appropriately improves performance by allowing TensorRT-LLM to allocate more memory for the KV cache and execute more requests together.

Paged KV Cache

  • Arguments/Python Classes:

    • --paged_kv_cache: Enables the paged KV cache

  • Explanation: The paged KV cache efficiently manages memory for the KV cache, leading to increased batch sizes and improved efficiency. It is enabled by default and can be controlled using the --paged_kv_cache argument.

In-flight Sequence Batching

  • Explanation: In-flight sequence batching schedules sequences in the context phase together with sequences in the generation phase, enhancing efficiency and reducing latency. It is enabled by default when the GPT attention plugin, input padding removal, and paged KV cache are all enabled.

Multi-Block Mode

  • Arguments/Python Classes:

    • --multi_block_mode: Enables multi-block mode

  • Explanation: Multi-block mode can be beneficial when the batch size and number of attention heads are not large enough to fully utilize the GPU. It is recommended to enable multi-block mode using the --multi_block_mode argument when the input sequence length is greater than 1024 and the product of sequence count and number of heads is less than half the number of streaming multiprocessors.

Custom AllReduce Plugin

  • Arguments/Python Classes:

    • --use_custom_all_reduce: Enables the custom AllReduce plugin

  • Explanation: On NVLink-based nodes, enabling the custom AllReduce plugin activates a latency-optimized algorithm for the AllReduce operation. It is recommended to use the --use_custom_all_reduce argument on NVLink-based systems but not on PCIE-based systems.

Embedding Parallelism, Embedding Sharing, and Look-Up Plugin

  • Arguments/Python Classes:

    • --use_parallel_embedding: Enables embedding parallelism

    • --use_embedding_sharing: Enables embedding sharing

    • --use_lookup_plugin: Enables the look-up plugin

    • --use_gemm_plugin: Enables the GEMM plugin

    • --embedding_sharding_dim: Sets the sharding dimension of the embedding lookup table

  • Explanation: Embedding parallelism enables sharding of the embedding table across multiple GPUs, reducing memory usage and improving throughput. Embedding sharing allows the sharing of the embedding table between the look_up and lm_head layers. To enable these features, the model must share the embedding table, and both the look-up and GEMM plugins must be enabled. The sharding dimension of the embedding lookup table should be set correctly using the --embedding_sharding_dim argument.

Horizontal Fusion in Gated-MLP

  • Arguments/Python Classes:

    • --use_fused_mlp: Enables horizontal fusion in Gated-MLP

  • Explanation: Horizontal fusion in Gated-MLP combines two Matmul operations into a single one followed by a separate SwiGLU kernel. It is recommended to enable this feature using the --use_fused_mlp argument when both the model and batch sizes are large. However, for FP8 post-training quantization (PTQ), enabling horizontal fusion may slightly reduce accuracy.

GEMM Plugin

  • Arguments/Python Classes:

    • --use_gemm_plugin: Enables the GEMM plugin

  • Explanation: The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM operations. It is recommended to enable the GEMM plugin for better performance and smaller GPU memory usage when using FP16 and BF16 precision. However, for FP8 precision, it is recommended to disable the GEMM plugin.

BERT Attention Plugin and Context Fused Multi-Head Attention

  • Arguments/Python Classes:

    • --bert_attention_plugin: Enables the BERT attention plugin

    • --context_fmha: Enables fused multi-head attention during the context phase

  • Explanation: The BERT attention plugin and context fused multi-head attention are recommended for the BERT model. They are enabled by default using the --bert_attention_plugin and --context_fmha arguments.

Runtime Options

GPT Model Type

  • Explanation: The GPT model type can be set to V1, inflight_batching, or inflight_fused_batching. It is recommended to use inflight_fused_batching to increase throughput and reduce latency.

Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction

  • Arguments/Python Classes:

    • max_tokens_in_paged_kv_cache: Sets the maximum number of tokens in the KV cache manager

    • kv_cache_free_gpu_mem_fraction: Sets the maximum fraction of GPU memory used for the KV cache

  • Explanation: These parameters control the maximum number of tokens handled by the KV cache manager. Setting them properly helps manage the available memory for the KV cache manager during inference. Increasing the memory available to the KV cache manager tends to improve achievable throughput. It is recommended to leave max_tokens_in_paged_kv_cache unset and test with a high value (e.g., 0.95) for kv_cache_free_gpu_mem_fraction to target high throughput.

Batch Scheduler Policy

There are two batch scheduler policies: MAX_UTILIZATION and GUARANTEED_NO_EVICT.

The MAX_UTILIZATION policy packs as many requests as possible at each iteration of the forward loop, maximizing GPU utilization but risking the need to pause requests if the KV cache size limit is reached.

The GUARANTEED_NO_EVICT policy guarantees that a started request is never paused.

If the goal is to maximize throughput, MAX_UTILIZATION should be tried, keeping in mind that it may impact latency if requests have to be paused.

TensorRT Overlap

When TensorRT overlap is enabled, available requests are partitioned into two micro-batches that can run concurrently, allowing TensorRT-LLM to hide exposed CPU runtime.

It is recommended to enable TensorRT overlap to increase throughput, but it may not provide performance benefits when the model size is not large enough to overlap the host overhead or when the number of requests is too small.

Maximum Attention Window Size

  • Arguments/Python Classes:

    • max_attention_window_size: Sets the maximum number of tokens attended to when generating one token

  • Explanation: The max_attention_window_size flag sets the maximum number of tokens attended to when using techniques like sliding window attention. When set to a smaller value than max_input_length + max_output_length, only the KV cache of the last max_attention_window_size tokens will be stored, improving runtime performance at the expense of reduced accuracy. Users can modify this value to balance performance and accuracy.

Chunked Context

  • Arguments/Python Classes:

    • enable_chunked_context: Enables context chunking

    • max_num_tokens: Sets the maximum number of tokens

  • Explanation: Enabling context chunking by specifying enable_chunked_context increases the chance of batch processing between the context and generation phases, balancing the calculation amount of each iteration and increasing throughput. When context chunking is enabled, different performance can be obtained by adjusting max_num_tokens. The recommended value for max_num_tokens is N * tokens_per_block, where N is an integer starting from 1 and increased until the best performance is achieved.

Conclusion

Tuning the performance of TensorRT-LLM models involves careful consideration of build options and runtime configurations.

By leveraging the optimization techniques and best practices discussed in this tutorial, such as the GPT attention plugin, input padding removal, paged KV cache, in-flight sequence batching, and various runtime options, you can significantly enhance the performance and efficiency of your TensorRT-LLM deployments.

Remember to experiment with different settings and configurations to find the optimal balance between performance, memory usage, and accuracy for your specific use case. The TensorRT-LLM library provides a wide range of options and flexibility to customize and optimize your LLM inference pipeline.

For more detailed information and advanced optimization techniques, refer to the TensorRT-LLM documentation and the NVIDIA NeMo framework.

PreviousGeneral Notes on Model ArchitectureNextOptimisation Techniques

Last updated 1 year ago

Was this helpful?

Page cover image