trtllm-build CLI configurations
The trtllm-build
command-line tool is part of the TensorRT-LLM framework.
This tool is used to build TensorRT engines for large language models (LLMs) using the TensorRT-LLM library.
Here are the configuration options:
Options
--checkpoint_dir
: Specifies the directory containing the model checkpoint files.--model_config
: Specifies the path to the model configuration file.--build_config
: Specifies the path to the build configuration file.--model_cls_file
: Specifies the path to the Python file containing the model class definition.--model_cls_name
: Specifies the name of the model class within the specified model class file.--input_timing_cache
: Specifies the path to read the timing cache file. It will be ignored if the file does not exist.--output_timing_cache
: Specifies the path to write the timing cache file.--log_level
: Sets the logging level for the tool.--profiling_verbosity
: Specifies the profiling verbosity for the generated TensorRT engine. Options are "layer_names_only", "detailed", or "none".--enable_debug_output
: Enables debug output during the build process.--output_dir
: Specifies the path to save the serialized engine files and model configurations.--workers
: Specifies the number of workers for building engines in parallel.--max_batch_size
: Sets the maximum batch size for the model.--max_input_len
: Sets the maximum input sequence length.--max_output_len
: Sets the maximum output sequence length.--max_beam_width
: Sets the maximum beam width for beam search decoding.--max_num_tokens
: Sets the maximum number of tokens to generate.--opt_num_tokens
: Specifies the optimised number of tokens, which should be set as close as possible to the actual number of tokens in the workload.--tp_size
: Specifies the tensor parallelism size.--pp_size
: Specifies the pipeline parallelism size.--max_prompt_embedding_table_size
or--max_multimodal_len
: Enables support for prompt tuning or multimodal input when set to a value greater than 0.--use_fused_mlp
: Enables horizontal fusion in GatedMLP to reduce layer input traffic and potentially improve performance.--gather_all_token_logits
: Enables bothgather_context_logits
andgather_generation_logits
.--gather_context_logits
: Enables gathering of context logits.--gather_generation_logits
: Enables gathering of generation logits.--strongly_typed
: Enables strongly typed optimization to reduce engine building time. This option is introduced with TensorRT 9.1.0.1+.--builder_opt
: Specifies the builder optimization level.--logits_dtype
: Specifies the data type for logits. Options are "float16" or "float32".--weight_only_precision
: Specifies the precision for weight-only quantization. Options are "int8" or "int4".--weight_sparsity
: Enables weight sparsity optimization.--max_draft_len
: Specifies the maximum length of draft tokens for speculative decoding in the target model.--lora_dir
: Specifies the directory(ies) containing LoRA (Low-Rank Adaptation) weights. If multiple directories are provided, the configuration from the first directory will be used.--lora_ckpt_source
: Specifies the source of the LoRA checkpoint. Options are "hf" (Hugging Face) or "nemo".--lora_target_modules
: Specifies the modules to apply LoRA adaptation. Options include various attention and MLP modules.--max_lora_rank
: Specifies the maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of the LoRA plugin.--auto_parallel
: Specifies the MPI world size for auto-parallel execution.--gpus_per_node
: Specifies the number of GPUs each node has in a multi-node setup. This is a cluster specification and can be greater or smaller than the world size.--cluster_key
: Specifies the unique name for the target GPU type. It is inferred from the current GPU type if not specified. Options include various NVIDIA GPU models.--max_encoder_input_len
: Specifies the maximum encoder input length when using encoder-decoder models. Settingmax_input_len
to 1 starts generation from thedecoder_start_token_id
of length 1.
Plugin Configuration Options
The help message also includes a section for plugin configuration options.
Each option corresponds to a specific plugin and allows enabling or disabling the plugin and specifying its data type (float16, float32, or bfloat16).
Some notable plugin options include:
--bert_attention_plugin
: Configures the BERT attention plugin.--gpt_attention_plugin
: Configures the GPT attention plugin.--gemm_plugin
: Configures the GEMM (General Matrix Multiplication) plugin.--nccl_plugin
: Configures the NCCL (NVIDIA Collective Communications Library) plugin.--lookup_plugin
: Configures the lookup table plugin.--lora_plugin
: Configures the LoRA plugin.--moe_plugin
: Configures the MoE (Mixture of Experts) plugin.--mamba_conv1d_plugin
: Configures the MaMBa (Masked Multi-Branch Attention) Conv1D plugin.
Other plugin options include enabling or disabling features such as context FMHA (Fast Multi-Head Attention), paged key-value cache, input padding removal, custom all-reduce, multi-block mode, XQA (Extreme Quantization Aware), half-precision attention QK accumulation, paged context FMHA, FP8 context FMHA, context FMHA for generation, multiple profiles, paged state, and streaming LLM.
The trtllm-build
tool provides a wide range of options and configurations for building TensorRT engines for large language models.
It allows customisation of model parameters, parallelism settings, quantization, LoRA adaptation, plugin configurations, and various optimization techniques.
The help message serves as a comprehensive reference for users to understand and utilize the available options when building TensorRT engines for their specific LLM use cases.
Last updated