Optimisation Techniques
Short Message Generation
Enable the GPT attention plugin (
--gpt_attention_plugin
) and fused multi-head attention (--context_fmha
) to accelerate attention computation and reduce memory consumption.Set a lower value for
--max_num_tokens
to allocate memory efficiently for shorter sequences.Enable input padding removal (
--remove_input_padding
) to reduce computations and memory usage.Use a smaller
--embedding_dim
to reduce the size of the embedding table, as short messages may not require high-dimensional embeddings.Experiment with different
--hidden_size
values to find the optimal balance between model capacity and inference speed.Enable the paged KV cache (
--paged_kv_cache
) to efficiently manage memory for the key-value cache.Set a higher value for
kv_cache_free_gpu_mem_fraction
to allocate more memory for the KV cache, improving throughput.
Long Text Summarization
Enable multi-block mode (
--multi_block_mode
) to efficiently process longer input sequences.Increase the
--max_num_tokens
value to accommodate longer input and output sequences.Use a larger
--hidden_size
and--ffn_hidden_size
to increase the model's capacity for capturing long-range dependencies.Enable chunked context (
enable_chunked_context
) to increase the chance of batch processing between the context and generation phases, improving throughput.Adjust the
max_num_tokens
value in chunked context mode to find the optimal balance between performance and memory usage.Experiment with different values for
max_attention_window_size
to control the trade-off between runtime performance and accuracy when using sliding window attention.Enable TensorRT overlap to hide CPU runtime overhead and improve throughput.
Code Generation and Completion
Use a larger
--embedding_dim
to capture the complexity and structure of code snippets.Increase the
--hidden_size
and--num_attention_heads
to enhance the model's ability to understand code semantics and generate accurate completions.Enable the GPT attention plugin (
--gpt_attention_plugin
) and fused multi-head attention (--context_fmha
) for efficient attention computation.Experiment with different activation functions (
--hidden_act
) to find the one that works best for code generation tasks (e.g.,gelu
orrelu
).Use a smaller
--max_num_tokens
value to focus on generating concise and relevant code completions.Enable horizontal fusion in Gated-MLP (
--use_fused_mlp
) to combine matrix multiplications and activations, improving computational efficiency.
Language Translation
Use a larger
--hidden_size
and--ffn_hidden_size
to capture the complexities of translation between languages.Increase the
--num_attention_heads
to enable the model to attend to different aspects of the input and generate accurate translations.Enable embedding sharing (
--use_embedding_sharing
) to share the embedding table between the source and target languages, reducing memory usage.Experiment with different values for
--max_num_tokens
to find the optimal balance between translation quality and inference speed.Enable the custom AllReduce plugin (
--use_custom_all_reduce
) on NVLink-based systems to optimize communication between GPUs during distributed training.Use a suitable
--dtype
(e.g.,float16
orbfloat16
) to reduce memory consumption and improve computational efficiency without sacrificing translation quality.
These are just a few examples of how you can optimise Llama2 for different use cases using the TensorRT-LLM optimization techniques.
The optimal configuration may vary depending on your specific requirements, hardware setup, and the characteristics of your dataset.
It's important to experiment with different combinations of optimization flags and hyperparameters to find the sweet spot that maximises performance while maintaining the desired level of accuracy for each use case.
Additionally, keep an eye on the TensorRT-LLM documentation and the NVIDIA NeMo framework for updates and new optimization techniques that may be introduced in the future.
Remember to measure the performance and quality of your optimized models using appropriate metrics and benchmarks relevant to each use case.
This will help you validate the effectiveness of your optimizations and make informed decisions about the trade-offs between speed, memory usage, and accuracy.
Last updated