Build arguments

In this script:

Configuration
Explanation
Suggested Values

max_input_len

The maximum length of the input sequence.

512 - 1024 (depending on the model architecture and available GPU memory)

max_output_len

The maximum length of the output sequence.

256 - 512 (depending on the desired output length and available GPU memory)

max_batch_size

The maximum batch size for the engine.

1 - 32 (depending on the available GPU memory and desired throughput)

max_beam_width

The maximum beam width for beam search during generation.

1 - 8 (higher values can improve output quality but increase computational cost)

max_num_tokens

The maximum number of tokens to generate.

100 - 500 (depending on the desired output length)

opt_num_tokens

The optimal number of tokens to generate.

50 - 200 (depending on the desired output length and trade-off between quality and efficiency)

max_prompt_embedding_table_size

The maximum size of the prompt embedding table (for prompt tuning).

0 - 10000 (depending on the number of prompt templates used for prompt tuning)

gather_context_logits

Whether to gather context logits during generation.

False (set to True for debugging or analysis purposes)

gather_generation_logits

Whether to gather generation logits during generation.

False (set to True for debugging or analysis purposes)

strongly_typed

Whether to use strongly typed TensorRT networks.

True (enables additional optimizations and error checking)

builder_opt

The optimization level for the TensorRT builder.

3 (default value, higher values may result in longer build times but potentially better performance)

profiling_verbosity

The verbosity level for TensorRT profiling.

"layer_names_only" (provides a good balance between profiling information and readability)

enable_debug_output

Whether to enable debug output for the TensorRT network.

False (set to True for debugging purposes)

max_draft_len

The maximum length of the draft sequence (for Medusa models).

0 - 200 (depending on the desired draft length for Medusa models)

use_refit

Whether to use the refit feature for multi-GPU building.

False (set to True for multi-GPU builds to reduce build time)

input_timing_cache

The path to the input timing cache file.

"timing_cache.bin" (provide a path to a previously generated timing cache file to speed up the build process)

output_timing_cache

The path to the output timing cache file.

"output_cache.bin" (provide a path to store the generated timing cache for future builds)

lora_config

A LoraBuildConfig object specifying LoRA configuration for the model.

Depends on the specific LoRA adaptation requirements and available pre-trained LoRA weights

auto_parallel_config

An AutoParallelConfig object specifying auto-parallel configuration for the model.

Depends on the number of available GPUs and the desired trade-off between build time and inference performance

weight_sparsity

Whether to enable weight sparsity for the engine.

False (set to True if the model weights are sparse and you want to optimize for storage and computation)

plugin_config

A PluginConfig object specifying plugin configuration for the engine.

Depends on the specific requirements and available custom TensorRT plugins

max_encoder_input_len

The maximum length of the encoder input sequence (for encoder-decoder models).

512 - 1024 (depending on the encoder architecture and available GPU memory)

use_fused_mlp

Whether to use fused MLP layers for optimization.

True (can improve performance by reducing memory accesses and kernel launches)

dry_run

Whether to perform a dry run without actually building the engine.

False (set to True for testing purposes or to validate the build configuration without spending time on the actual build)

visualize_network

Whether to visualize the TensorRT network as a DOT graph.

False (set to True for debugging or understanding the network structure)

Last updated