LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Short Message Generation
  • Long Text Summarization
  • Code Generation and Completion
  • Language Translation

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

Optimisation Techniques

Short Message Generation

  • Enable the GPT attention plugin (--gpt_attention_plugin) and fused multi-head attention (--context_fmha) to accelerate attention computation and reduce memory consumption.

  • Set a lower value for --max_num_tokens to allocate memory efficiently for shorter sequences.

  • Enable input padding removal (--remove_input_padding) to reduce computations and memory usage.

  • Use a smaller --embedding_dim to reduce the size of the embedding table, as short messages may not require high-dimensional embeddings.

  • Experiment with different --hidden_size values to find the optimal balance between model capacity and inference speed.

  • Enable the paged KV cache (--paged_kv_cache) to efficiently manage memory for the key-value cache.

  • Set a higher value for kv_cache_free_gpu_mem_fraction to allocate more memory for the KV cache, improving throughput.

Long Text Summarization

  • Enable multi-block mode (--multi_block_mode) to efficiently process longer input sequences.

  • Increase the --max_num_tokens value to accommodate longer input and output sequences.

  • Use a larger --hidden_size and --ffn_hidden_size to increase the model's capacity for capturing long-range dependencies.

  • Enable chunked context (enable_chunked_context) to increase the chance of batch processing between the context and generation phases, improving throughput.

  • Adjust the max_num_tokens value in chunked context mode to find the optimal balance between performance and memory usage.

  • Experiment with different values for max_attention_window_size to control the trade-off between runtime performance and accuracy when using sliding window attention.

  • Enable TensorRT overlap to hide CPU runtime overhead and improve throughput.

Code Generation and Completion

  • Use a larger --embedding_dim to capture the complexity and structure of code snippets.

  • Increase the --hidden_size and --num_attention_heads to enhance the model's ability to understand code semantics and generate accurate completions.

  • Enable the GPT attention plugin (--gpt_attention_plugin) and fused multi-head attention (--context_fmha) for efficient attention computation.

  • Experiment with different activation functions (--hidden_act) to find the one that works best for code generation tasks (e.g., gelu or relu).

  • Use a smaller --max_num_tokens value to focus on generating concise and relevant code completions.

  • Enable horizontal fusion in Gated-MLP (--use_fused_mlp) to combine matrix multiplications and activations, improving computational efficiency.

Language Translation

  • Use a larger --hidden_size and --ffn_hidden_size to capture the complexities of translation between languages.

  • Increase the --num_attention_heads to enable the model to attend to different aspects of the input and generate accurate translations.

  • Enable embedding sharing (--use_embedding_sharing) to share the embedding table between the source and target languages, reducing memory usage.

  • Experiment with different values for --max_num_tokens to find the optimal balance between translation quality and inference speed.

  • Enable the custom AllReduce plugin (--use_custom_all_reduce) on NVLink-based systems to optimize communication between GPUs during distributed training.

  • Use a suitable --dtype (e.g., float16 or bfloat16) to reduce memory consumption and improve computational efficiency without sacrificing translation quality.

These are just a few examples of how you can optimise Llama2 for different use cases using the TensorRT-LLM optimization techniques.

The optimal configuration may vary depending on your specific requirements, hardware setup, and the characteristics of your dataset.

It's important to experiment with different combinations of optimization flags and hyperparameters to find the sweet spot that maximises performance while maintaining the desired level of accuracy for each use case.

Additionally, keep an eye on the TensorRT-LLM documentation and the NVIDIA NeMo framework for updates and new optimization techniques that may be introduced in the future.

Remember to measure the performance and quality of your optimized models using appropriate metrics and benchmarks relevant to each use case.

This will help you validate the effectiveness of your optimizations and make informed decisions about the trade-offs between speed, memory usage, and accuracy.

PreviousBest Practices for Tuning the Performance of TensorRT-LLMNextBatch Manager

Last updated 1 year ago

Was this helpful?

Page cover image