Build arguments
In this script:
max_input_len
The maximum length of the input sequence.
512 - 1024 (depending on the model architecture and available GPU memory)
max_output_len
The maximum length of the output sequence.
256 - 512 (depending on the desired output length and available GPU memory)
max_batch_size
The maximum batch size for the engine.
1 - 32 (depending on the available GPU memory and desired throughput)
max_beam_width
The maximum beam width for beam search during generation.
1 - 8 (higher values can improve output quality but increase computational cost)
max_num_tokens
The maximum number of tokens to generate.
100 - 500 (depending on the desired output length)
opt_num_tokens
The optimal number of tokens to generate.
50 - 200 (depending on the desired output length and trade-off between quality and efficiency)
max_prompt_embedding_table_size
The maximum size of the prompt embedding table (for prompt tuning).
0 - 10000 (depending on the number of prompt templates used for prompt tuning)
gather_context_logits
Whether to gather context logits during generation.
False (set to True for debugging or analysis purposes)
gather_generation_logits
Whether to gather generation logits during generation.
False (set to True for debugging or analysis purposes)
strongly_typed
Whether to use strongly typed TensorRT networks.
True (enables additional optimizations and error checking)
builder_opt
The optimization level for the TensorRT builder.
3 (default value, higher values may result in longer build times but potentially better performance)
profiling_verbosity
The verbosity level for TensorRT profiling.
"layer_names_only" (provides a good balance between profiling information and readability)
enable_debug_output
Whether to enable debug output for the TensorRT network.
False (set to True for debugging purposes)
max_draft_len
The maximum length of the draft sequence (for Medusa models).
0 - 200 (depending on the desired draft length for Medusa models)
use_refit
Whether to use the refit feature for multi-GPU building.
False (set to True for multi-GPU builds to reduce build time)
input_timing_cache
The path to the input timing cache file.
"timing_cache.bin" (provide a path to a previously generated timing cache file to speed up the build process)
output_timing_cache
The path to the output timing cache file.
"output_cache.bin" (provide a path to store the generated timing cache for future builds)
lora_config
A LoraBuildConfig
object specifying LoRA configuration for the model.
Depends on the specific LoRA adaptation requirements and available pre-trained LoRA weights
auto_parallel_config
An AutoParallelConfig
object specifying auto-parallel configuration for the model.
Depends on the number of available GPUs and the desired trade-off between build time and inference performance
weight_sparsity
Whether to enable weight sparsity for the engine.
False (set to True if the model weights are sparse and you want to optimize for storage and computation)
plugin_config
A PluginConfig
object specifying plugin configuration for the engine.
Depends on the specific requirements and available custom TensorRT plugins
max_encoder_input_len
The maximum length of the encoder input sequence (for encoder-decoder models).
512 - 1024 (depending on the encoder architecture and available GPU memory)
use_fused_mlp
Whether to use fused MLP layers for optimization.
True (can improve performance by reducing memory accesses and kernel launches)
dry_run
Whether to perform a dry run without actually building the engine.
False (set to True for testing purposes or to validate the build configuration without spending time on the actual build)
visualize_network
Whether to visualize the TensorRT network as a DOT graph.
False (set to True for debugging or understanding the network structure)
Last updated