Build arguments
In this script:
Configuration | Explanation | Suggested Values |
---|---|---|
max_input_len | The maximum length of the input sequence. | 512 - 1024 (depending on the model architecture and available GPU memory) |
max_output_len | The maximum length of the output sequence. | 256 - 512 (depending on the desired output length and available GPU memory) |
max_batch_size | The maximum batch size for the engine. | 1 - 32 (depending on the available GPU memory and desired throughput) |
max_beam_width | The maximum beam width for beam search during generation. | 1 - 8 (higher values can improve output quality but increase computational cost) |
max_num_tokens | The maximum number of tokens to generate. | 100 - 500 (depending on the desired output length) |
opt_num_tokens | The optimal number of tokens to generate. | 50 - 200 (depending on the desired output length and trade-off between quality and efficiency) |
max_prompt_embedding_table_size | The maximum size of the prompt embedding table (for prompt tuning). | 0 - 10000 (depending on the number of prompt templates used for prompt tuning) |
gather_context_logits | Whether to gather context logits during generation. | False (set to True for debugging or analysis purposes) |
gather_generation_logits | Whether to gather generation logits during generation. | False (set to True for debugging or analysis purposes) |
strongly_typed | Whether to use strongly typed TensorRT networks. | True (enables additional optimizations and error checking) |
builder_opt | The optimization level for the TensorRT builder. | 3 (default value, higher values may result in longer build times but potentially better performance) |
profiling_verbosity | The verbosity level for TensorRT profiling. | "layer_names_only" (provides a good balance between profiling information and readability) |
enable_debug_output | Whether to enable debug output for the TensorRT network. | False (set to True for debugging purposes) |
max_draft_len | The maximum length of the draft sequence (for Medusa models). | 0 - 200 (depending on the desired draft length for Medusa models) |
use_refit | Whether to use the refit feature for multi-GPU building. | False (set to True for multi-GPU builds to reduce build time) |
input_timing_cache | The path to the input timing cache file. | "timing_cache.bin" (provide a path to a previously generated timing cache file to speed up the build process) |
output_timing_cache | The path to the output timing cache file. | "output_cache.bin" (provide a path to store the generated timing cache for future builds) |
lora_config | A | Depends on the specific LoRA adaptation requirements and available pre-trained LoRA weights |
auto_parallel_config | An | Depends on the number of available GPUs and the desired trade-off between build time and inference performance |
weight_sparsity | Whether to enable weight sparsity for the engine. | False (set to True if the model weights are sparse and you want to optimize for storage and computation) |
plugin_config | A | Depends on the specific requirements and available custom TensorRT plugins |
max_encoder_input_len | The maximum length of the encoder input sequence (for encoder-decoder models). | 512 - 1024 (depending on the encoder architecture and available GPU memory) |
use_fused_mlp | Whether to use fused MLP layers for optimization. | True (can improve performance by reducing memory accesses and kernel launches) |
dry_run | Whether to perform a dry run without actually building the engine. | False (set to True for testing purposes or to validate the build configuration without spending time on the actual build) |
visualize_network | Whether to visualize the TensorRT network as a DOT graph. | False (set to True for debugging or understanding the network structure) |
Last updated