LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

trtllm build configuration file

To create a script that takes the buildconfig.yaml file and parses its arguments into the trtllm build command line, you can use the argparse and yaml modules in Python. Here's an example script that accomplishes this:

import argparse
import subprocess
import yaml

def parse_buildconfig(config_file):
    with open(config_file, 'r') as f:
        config = yaml.safe_load(f)

    args = []

    # Model Configuration
    if 'model' in config:
        model_config = config['model']
        if 'model_dir' in model_config:
            args.extend(['--checkpoint_dir', model_config['model_dir']])
        if 'output_dir' in model_config:
            args.extend(['--output_dir', model_config['output_dir']])
        if 'dtype' in model_config:
            args.extend(['--logits_dtype', model_config['dtype']])

    # Checkpoint Configuration
    if 'checkpoint' in config:
        checkpoint_config = config['checkpoint']
        if 'checkpoint_dir' in checkpoint_config:
            args.extend(['--checkpoint_dir', checkpoint_config['checkpoint_dir']])
        if 'tp_size' in checkpoint_config:
            args.extend(['--tp_size', str(checkpoint_config['tp_size'])])
        if 'pp_size' in checkpoint_config:
            args.extend(['--pp_size', str(checkpoint_config['pp_size'])])
        if 'vocab_size' in checkpoint_config:
            args.extend(['--vocab_size', str(checkpoint_config['vocab_size'])])
        if 'n_positions' in checkpoint_config:
            args.extend(['--n_positions', str(checkpoint_config['n_positions'])])
        if 'n_layer' in checkpoint_config:
            args.extend(['--n_layer', str(checkpoint_config['n_layer'])])
        if 'n_head' in checkpoint_config:
            args.extend(['--n_head', str(checkpoint_config['n_head'])])
        if 'n_embd' in checkpoint_config:
            args.extend(['--n_embd', str(checkpoint_config['n_embd'])])
        if 'inter_size' in checkpoint_config:
            args.extend(['--inter_size', str(checkpoint_config['inter_size'])])
        if 'n_kv_head' in checkpoint_config:
            args.extend(['--n_kv_head', str(checkpoint_config['n_kv_head'])])
        if 'rms_norm_eps' in checkpoint_config:
            args.extend(['--rms_norm_eps', str(checkpoint_config['rms_norm_eps'])])
        if 'bos_token_id' in checkpoint_config:
            args.extend(['--bos_token_id', str(checkpoint_config['bos_token_id'])])
        if 'eos_token_id' in checkpoint_config:
            args.extend(['--eos_token_id', str(checkpoint_config['eos_token_id'])])
        if 'tie_word_embeddings' in checkpoint_config:
            args.extend(['--tie_word_embeddings', str(checkpoint_config['tie_word_embeddings'])])
        if 'use_cache' in checkpoint_config:
            args.extend(['--use_cache', str(checkpoint_config['use_cache'])])
        if 'torch_dtype' in checkpoint_config:
            args.extend(['--torch_dtype', checkpoint_config['torch_dtype']])
        if 'hidden_act' in checkpoint_config:
            args.extend(['--hidden_act', checkpoint_config['hidden_act']])

    # Build Configuration
    if 'build' in config:
        build_config = config['build']
        if 'max_input_len' in build_config:
            args.extend(['--max_input_len', str(build_config['max_input_len'])])
        if 'max_output_len' in build_config:
            args.extend(['--max_output_len', str(build_config['max_output_len'])])
        if 'max_batch_size' in build_config:
            args.extend(['--max_batch_size', str(build_config['max_batch_size'])])
        if 'max_beam_width' in build_config:
            args.extend(['--max_beam_width', str(build_config['max_beam_width'])])
        if 'max_prompt_embedding_table_size' in build_config:
            args.extend(['--max_prompt_embedding_table_size', str(build_config['max_prompt_embedding_table_size'])])
        if 'gather_context_logits' in build_config:
            args.extend(['--gather_context_logits', str(build_config['gather_context_logits'])])
        if 'gather_generation_logits' in build_config:
            args.extend(['--gather_generation_logits', str(build_config['gather_generation_logits'])])
        if 'strongly_typed' in build_config:
            args.extend(['--strongly_typed', str(build_config['strongly_typed'])])
        if 'profiling_verbosity' in build_config:
            args.extend(['--profiling_verbosity', build_config['profiling_verbosity']])
        if 'enable_debug_output' in build_config:
            args.extend(['--enable_debug_output', str(build_config['enable_debug_output'])])
        if 'max_draft_len' in build_config:
            args.extend(['--max_draft_len', str(build_config['max_draft_len'])])
        if 'use_refit' in build_config:
            args.extend(['--use_refit', str(build_config['use_refit'])])
        if 'weight_sparsity' in build_config:
            args.extend(['--weight_sparsity', str(build_config['weight_sparsity'])])
        if 'max_encoder_input_len' in build_config:
            args.extend(['--max_encoder_input_len', str(build_config['max_encoder_input_len'])])
        if 'use_fused_mlp' in build_config:
            args.extend(['--use_fused_mlp', str(build_config['use_fused_mlp'])])
        if 'dry_run' in build_config:
            args.extend(['--dry_run', str(build_config['dry_run'])])
        if 'visualize_network' in build_config:
            args.extend(['--visualize_network', str(build_config['visualize_network'])])

    return args

def main():
    parser = argparse.ArgumentParser(description='Parse buildconfig.yaml and run trtllm-build')
    parser.add_argument('--config', type=str, required=True, help='Path to the buildconfig.yaml file')
    args = parser.parse_args()

    buildconfig_args = parse_buildconfig(args.config)
    command = ['trtllm-build'] + buildconfig_args

    subprocess.run(command, check=True)

if __name__ == '__main__':
    main()
# TensorRT-LLM Build Configuration File

model:
  model_dir: ./llama-2-7b-chat-hf  # Path to the pretrained model directory
  output_dir: ./llama-2-7b-chat-engine  # Path to save the built engine
  dtype: float16  # Data type for the model (choices: float32, float16, bfloat16)

checkpoint:
  checkpoint_dir: ../llama-2-7b-chat-hf-output  # Path to the TensorRT-LLM checkpoint directory
  tp_size: 1  # Tensor parallelism size, increase for multi-GPU tensor parallelism
  pp_size: 1  # Pipeline parallelism size, increase for multi-GPU pipeline parallelism
  vocab_size: 32000  # Vocabulary size of the model
  n_positions: 2048  # Maximum number of positions (sequence length)
  n_layer: 32  # Number of layers in the model
  n_head: 32  # Number of attention heads
  n_embd: 4096  # Hidden size of the model
  inter_size: 11008  # Intermediate size of the model's feed-forward layers
  #meta_ckpt_dir:  # Path to the meta checkpoint directory
  #n_kv_head:  # Number of key-value heads (defaults to n_head if not specified)
  #rms_norm_eps: 1e-6  # Epsilon value for RMS normalization
  #use_weight_only: false  # Enable weight-only quantization
  #weight_only_precision: int8  # Precision for weight-only quantization (choices: int8, int4)
  #smoothquant: 0.5  # Smoothquant parameter for quantization
  #per_channel: false  # Enable per-channel quantization
  #per_token: false  # Enable per-token quantization
  #int8_kv_cache: false  # Enable int8 quantization for key-value cache
  #ammo_quant_ckpt_path:  # Path to the quantized checkpoint file in .npz format
  #per_group: false  # Enable per-group quantization for GPTQ/AWQ quantization
  #load_by_shard: false  # Load the pretrained model shard-by-shard
  #hidden_act: silu  # Activation function used in the model (default: silu)
  #rotary_base: 10000.0  # Base value for rotary positional embeddings
  #group_size: 128  # Group size used in GPTQ quantization
  #dataset_cache_dir:  # Path to the dataset cache directory
  #load_model_on_cpu: false  # Load the model on CPU
  #use_parallel_embedding: false  # Enable embedding parallelism
  #embedding_sharding_dim: 0  # Dimension for embedding sharding (choices: 0, 1)
  #use_embedding_sharing: false  # Enable embedding sharing to reduce engine size
  #workers: 1  # Number of workers for parallel checkpoint conversion
  #moe_num_experts: 0  # Number of experts for Mixture of Experts (MoE) layers
  #moe_top_k: 0  # Top-k value for MoE layers (defaults to 1 if moe_num_experts is set)
  #moe_tp_mode: 0  # Parallelism mode for distributing MoE experts in tensor parallelism
  #moe_renorm_mode: 1  # Renormalization mode for MoE gate logits
  #save_config_only: false  # Only save the model configuration without building the engine
  #disable_weight_only_quant_plugin: false  # Disable the weight-only quantization plugin

build:
  max_input_len: 256  # Maximum input sequence length
  max_output_len: 256  # Maximum output sequence length
  max_batch_size: 8  # Maximum batch size
  max_beam_width: 1  # Maximum beam width for beam search
  #max_num_tokens:  # Maximum number of tokens to generate
  #opt_num_tokens:  # Optimal number of tokens to generate
  max_prompt_embedding_table_size: 0  # Maximum size of the prompt embedding table
  gather_context_logits: false  # Gather context logits during generation
  gather_generation_logits: false  # Gather generation logits during generation
  strongly_typed: false  # Enable strongly typed network definition
  #builder_opt:  # Builder optimization level
  profiling_verbosity: layer_names_only  # Profiling verbosity level (choices: layer_names_only, detailed, none)
  enable_debug_output: false  # Enable debug output
  max_draft_len: 0  # Maximum draft length for Medusa-style generation
  use_refit: false  # Enable engine refitting
  #input_timing_cache:  # Path to the input timing cache file
  #output_timing_cache:  # Path to save the output timing cache file
  lora_config:  # Configuration for LoRA (Low-Rank Adaptation)
    #lora_dir:  # Path to the LoRA checkpoint directory
    #lora_target_modules:  # Target modules for LoRA adaptation
    #lora_ckpt_source: hf  # Source of LoRA checkpoints (choices: hf, nemo)
    #max_lora_rank: 4  # Maximum rank for LoRA adaptation
  auto_parallel_config:  # Configuration for automatic parallelization
    #enabled: false  # Enable automatic parallelization
    #tp_size: 1  # Tensor parallelism size for automatic parallelization
    #pp_size: 1  # Pipeline parallelism size for automatic parallelization
    #max_memory_MB: 80000  # Maximum memory in MB for automatic parallelization
    #max_dram_memory_MB: 30000  # Maximum DRAM memory in MB for automatic parallelization
    #compile_max_memory_MB: 17000  # Maximum memory in MB for compilation during automatic parallelization
    #compile_max_dram_memory_MB: 8000  # Maximum DRAM memory in MB for compilation during automatic parallelization
    #debug_mode: false  # Enable debug mode for automatic parallelization
  weight_sparsity: false  # Enable weight sparsity
  plugin_config:  # Configuration for plugins
    #use_custom_all_reduce: false  # Use custom all-reduce plugin
    #use_fp8_all_reduce: false  # Use FP8 all-reduce plugin
    #use_fp8_cast_plugin: false  # Use FP8 cast plugin
    #use_async_malloc: false  # Use asynchronous memory allocation plugin
    #use_paged_context_fmha: false  # Use paged context fused multi-head attention plugin
    #use_fp8_context_fmha: false  # Use FP8 context fused multi-head attention plugin
    #lora_plugin:  # Configuration for LoRA plugin
      #type:  # Type of LoRA plugin
  max_encoder_input_len: 1024  # Maximum encoder input sequence length for encoder-decoder models
  use_fused_mlp: false  # Use fused MLP layers
  dry_run: false  # Perform a dry run without building the engine
  visualize_network: false  # Visualize the network graph

The updated script includes all the relevant configurations from the buildconfig.yaml file.

It parses the model, checkpoint, and build configurations and constructs the corresponding command-line arguments for the trtllm-build command.

python3 buildrun.py --config buildconfig.yaml
python build_trtllm.py --config buildconfig.yaml

This configuration file is divided into three main sections:

Model Configuration

Specifies the paths to the pretrained model directory and the output directory where the built engine will be saved. It also allows you to set the data type for the model (float32, float16, or bfloat16).

Checkpoint Configuration

Defines the settings related to the TensorRT-LLM checkpoint, such as the checkpoint directory, tensor parallelism size, pipeline parallelism size, and various model-specific parameters like vocabulary size, number of layers, attention heads, hidden size, etc.

Many of these settings are optional and can be uncommented and adjusted based on the specific model requirements.

Build Configuration

Contains the parameters for building the TensorRT engine, including maximum input and output sequence lengths, maximum batch size, beam width, prompt embedding table size, and various optimization and debugging options.

It also allows you to configure LoRA (Low-Rank Adaptation), automatic parallelization, weight sparsity, and plugin-specific settings.

PreviousBuild argumentsNextRun the buildconfig file

Last updated 1 year ago

Was this helpful?