Optimisation Techniques
Short Message Generation
Enable the GPT attention plugin (
--gpt_attention_plugin) and fused multi-head attention (--context_fmha) to accelerate attention computation and reduce memory consumption.Set a lower value for
--max_num_tokensto allocate memory efficiently for shorter sequences.Enable input padding removal (
--remove_input_padding) to reduce computations and memory usage.Use a smaller
--embedding_dimto reduce the size of the embedding table, as short messages may not require high-dimensional embeddings.Experiment with different
--hidden_sizevalues to find the optimal balance between model capacity and inference speed.Enable the paged KV cache (
--paged_kv_cache) to efficiently manage memory for the key-value cache.Set a higher value for
kv_cache_free_gpu_mem_fractionto allocate more memory for the KV cache, improving throughput.
Long Text Summarization
Enable multi-block mode (
--multi_block_mode) to efficiently process longer input sequences.Increase the
--max_num_tokensvalue to accommodate longer input and output sequences.Use a larger
--hidden_sizeand--ffn_hidden_sizeto increase the model's capacity for capturing long-range dependencies.Enable chunked context (
enable_chunked_context) to increase the chance of batch processing between the context and generation phases, improving throughput.Adjust the
max_num_tokensvalue in chunked context mode to find the optimal balance between performance and memory usage.Experiment with different values for
max_attention_window_sizeto control the trade-off between runtime performance and accuracy when using sliding window attention.Enable TensorRT overlap to hide CPU runtime overhead and improve throughput.
Code Generation and Completion
Use a larger
--embedding_dimto capture the complexity and structure of code snippets.Increase the
--hidden_sizeand--num_attention_headsto enhance the model's ability to understand code semantics and generate accurate completions.Enable the GPT attention plugin (
--gpt_attention_plugin) and fused multi-head attention (--context_fmha) for efficient attention computation.Experiment with different activation functions (
--hidden_act) to find the one that works best for code generation tasks (e.g.,geluorrelu).Use a smaller
--max_num_tokensvalue to focus on generating concise and relevant code completions.Enable horizontal fusion in Gated-MLP (
--use_fused_mlp) to combine matrix multiplications and activations, improving computational efficiency.
Language Translation
Use a larger
--hidden_sizeand--ffn_hidden_sizeto capture the complexities of translation between languages.Increase the
--num_attention_headsto enable the model to attend to different aspects of the input and generate accurate translations.Enable embedding sharing (
--use_embedding_sharing) to share the embedding table between the source and target languages, reducing memory usage.Experiment with different values for
--max_num_tokensto find the optimal balance between translation quality and inference speed.Enable the custom AllReduce plugin (
--use_custom_all_reduce) on NVLink-based systems to optimize communication between GPUs during distributed training.Use a suitable
--dtype(e.g.,float16orbfloat16) to reduce memory consumption and improve computational efficiency without sacrificing translation quality.
These are just a few examples of how you can optimise Llama2 for different use cases using the TensorRT-LLM optimization techniques.
The optimal configuration may vary depending on your specific requirements, hardware setup, and the characteristics of your dataset.
It's important to experiment with different combinations of optimization flags and hyperparameters to find the sweet spot that maximises performance while maintaining the desired level of accuracy for each use case.
Additionally, keep an eye on the TensorRT-LLM documentation and the NVIDIA NeMo framework for updates and new optimization techniques that may be introduced in the future.
Remember to measure the performance and quality of your optimized models using appropriate metrics and benchmarks relevant to each use case.
This will help you validate the effectiveness of your optimizations and make informed decisions about the trade-offs between speed, memory usage, and accuracy.
Last updated
Was this helpful?


