LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. TensorRT-LLM Libraries

trtllm-build CLI configurations

The trtllm-build command-line tool is part of the TensorRT-LLM framework.

This tool is used to build TensorRT engines for large language models (LLMs) using the TensorRT-LLM library.

Here are the configuration options:

Options

  • --checkpoint_dir: Specifies the directory containing the model checkpoint files.

  • --model_config: Specifies the path to the model configuration file.

  • --build_config: Specifies the path to the build configuration file.

  • --model_cls_file: Specifies the path to the Python file containing the model class definition.

  • --model_cls_name: Specifies the name of the model class within the specified model class file.

  • --input_timing_cache: Specifies the path to read the timing cache file. It will be ignored if the file does not exist.

  • --output_timing_cache: Specifies the path to write the timing cache file.

  • --log_level: Sets the logging level for the tool.

  • --profiling_verbosity: Specifies the profiling verbosity for the generated TensorRT engine. Options are "layer_names_only", "detailed", or "none".

  • --enable_debug_output: Enables debug output during the build process.

  • --output_dir: Specifies the path to save the serialized engine files and model configurations.

  • --workers: Specifies the number of workers for building engines in parallel.

  • --max_batch_size: Sets the maximum batch size for the model.

  • --max_input_len: Sets the maximum input sequence length.

  • --max_output_len: Sets the maximum output sequence length.

  • --max_beam_width: Sets the maximum beam width for beam search decoding.

  • --max_num_tokens: Sets the maximum number of tokens to generate.

  • --opt_num_tokens: Specifies the optimised number of tokens, which should be set as close as possible to the actual number of tokens in the workload.

  • --tp_size: Specifies the tensor parallelism size.

  • --pp_size: Specifies the pipeline parallelism size.

  • --max_prompt_embedding_table_size or --max_multimodal_len: Enables support for prompt tuning or multimodal input when set to a value greater than 0.

  • --use_fused_mlp: Enables horizontal fusion in GatedMLP to reduce layer input traffic and potentially improve performance.

  • --gather_all_token_logits: Enables both gather_context_logits and gather_generation_logits.

  • --gather_context_logits: Enables gathering of context logits.

  • --gather_generation_logits: Enables gathering of generation logits.

  • --strongly_typed: Enables strongly typed optimization to reduce engine building time. This option is introduced with TensorRT 9.1.0.1+.

  • --builder_opt: Specifies the builder optimization level.

  • --logits_dtype: Specifies the data type for logits. Options are "float16" or "float32".

  • --weight_only_precision: Specifies the precision for weight-only quantization. Options are "int8" or "int4".

  • --weight_sparsity: Enables weight sparsity optimization.

  • --max_draft_len: Specifies the maximum length of draft tokens for speculative decoding in the target model.

  • --lora_dir: Specifies the directory(ies) containing LoRA (Low-Rank Adaptation) weights. If multiple directories are provided, the configuration from the first directory will be used.

  • --lora_ckpt_source: Specifies the source of the LoRA checkpoint. Options are "hf" (Hugging Face) or "nemo".

  • --lora_target_modules: Specifies the modules to apply LoRA adaptation. Options include various attention and MLP modules.

  • --max_lora_rank: Specifies the maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of the LoRA plugin.

  • --auto_parallel: Specifies the MPI world size for auto-parallel execution.

  • --gpus_per_node: Specifies the number of GPUs each node has in a multi-node setup. This is a cluster specification and can be greater or smaller than the world size.

  • --cluster_key: Specifies the unique name for the target GPU type. It is inferred from the current GPU type if not specified. Options include various NVIDIA GPU models.

  • --max_encoder_input_len: Specifies the maximum encoder input length when using encoder-decoder models. Setting max_input_len to 1 starts generation from the decoder_start_token_id of length 1.

Plugin Configuration Options

  • The help message also includes a section for plugin configuration options.

  • Each option corresponds to a specific plugin and allows enabling or disabling the plugin and specifying its data type (float16, float32, or bfloat16).

  • Some notable plugin options include:

    • --bert_attention_plugin: Configures the BERT attention plugin.

    • --gpt_attention_plugin: Configures the GPT attention plugin.

    • --gemm_plugin: Configures the GEMM (General Matrix Multiplication) plugin.

    • --nccl_plugin: Configures the NCCL (NVIDIA Collective Communications Library) plugin.

    • --lookup_plugin: Configures the lookup table plugin.

    • --lora_plugin: Configures the LoRA plugin.

    • --moe_plugin: Configures the MoE (Mixture of Experts) plugin.

    • --mamba_conv1d_plugin: Configures the MaMBa (Masked Multi-Branch Attention) Conv1D plugin.

  • Other plugin options include enabling or disabling features such as context FMHA (Fast Multi-Head Attention), paged key-value cache, input padding removal, custom all-reduce, multi-block mode, XQA (Extreme Quantization Aware), half-precision attention QK accumulation, paged context FMHA, FP8 context FMHA, context FMHA for generation, multiple profiles, paged state, and streaming LLM.

The trtllm-build tool provides a wide range of options and configurations for building TensorRT engines for large language models.

It allows customisation of model parameters, parallelism settings, quantization, LoRA adaptation, plugin configurations, and various optimization techniques.

The help message serves as a comprehensive reference for users to understand and utilize the available options when building TensorRT engines for their specific LLM use cases.

Previoustrt-llm build commandNextLLama2 installation

Last updated 1 year ago

Was this helpful?

Page cover image