Optimisation Techniques

Short Message Generation

Enable the GPT attention plugin (--gpt_attention_plugin) and fused multi-head attention (--context_fmha) to accelerate attention computation and reduce memory consumption.
Set a lower value for --max_num_tokens to allocate memory efficiently for shorter sequences.
Enable input padding removal (--remove_input_padding) to reduce computations and memory usage.
Use a smaller --embedding_dim to reduce the size of the embedding table, as short messages may not require high-dimensional embeddings.
Experiment with different --hidden_size values to find the optimal balance between model capacity and inference speed.
Enable the paged KV cache (--paged_kv_cache) to efficiently manage memory for the key-value cache.
Set a higher value for kv_cache_free_gpu_mem_fraction to allocate more memory for the KV cache, improving throughput.

Long Text Summarization

Enable multi-block mode (--multi_block_mode) to efficiently process longer input sequences.
Increase the --max_num_tokens value to accommodate longer input and output sequences.
Use a larger --hidden_size and --ffn_hidden_size to increase the model's capacity for capturing long-range dependencies.
Enable chunked context (enable_chunked_context) to increase the chance of batch processing between the context and generation phases, improving throughput.
Adjust the max_num_tokens value in chunked context mode to find the optimal balance between performance and memory usage.
Experiment with different values for max_attention_window_size to control the trade-off between runtime performance and accuracy when using sliding window attention.
Enable TensorRT overlap to hide CPU runtime overhead and improve throughput.

Code Generation and Completion

Use a larger --embedding_dim to capture the complexity and structure of code snippets.
Increase the --hidden_size and --num_attention_heads to enhance the model's ability to understand code semantics and generate accurate completions.
Enable the GPT attention plugin (--gpt_attention_plugin) and fused multi-head attention (--context_fmha) for efficient attention computation.
Experiment with different activation functions (--hidden_act) to find the one that works best for code generation tasks (e.g., gelu or relu).
Use a smaller --max_num_tokens value to focus on generating concise and relevant code completions.
Enable horizontal fusion in Gated-MLP (--use_fused_mlp) to combine matrix multiplications and activations, improving computational efficiency.

Language Translation

Use a larger --hidden_size and --ffn_hidden_size to capture the complexities of translation between languages.
Increase the --num_attention_heads to enable the model to attend to different aspects of the input and generate accurate translations.
Enable embedding sharing (--use_embedding_sharing) to share the embedding table between the source and target languages, reducing memory usage.
Experiment with different values for --max_num_tokens to find the optimal balance between translation quality and inference speed.
Enable the custom AllReduce plugin (--use_custom_all_reduce) on NVLink-based systems to optimize communication between GPUs during distributed training.
Use a suitable --dtype (e.g., float16 or bfloat16) to reduce memory consumption and improve computational efficiency without sacrificing translation quality.

These are just a few examples of how you can optimise Llama2 for different use cases using the TensorRT-LLM optimization techniques.

The optimal configuration may vary depending on your specific requirements, hardware setup, and the characteristics of your dataset.

It's important to experiment with different combinations of optimization flags and hyperparameters to find the sweet spot that maximises performance while maintaining the desired level of accuracy for each use case.

Additionally, keep an eye on the TensorRT-LLM documentation and the NVIDIA NeMo framework for updates and new optimization techniques that may be introduced in the future.

Remember to measure the performance and quality of your optimized models using appropriate metrics and benchmarks relevant to each use case.

This will help you validate the effectiveness of your optimizations and make informed decisions about the trade-offs between speed, memory usage, and accuracy.

PreviousBest Practices for Tuning the Performance of TensorRT-LLM NextBatch Manager

Last updated 1 year ago

Was this helpful?