Proposed build config file for LLama3

Here's the updated buildconfig.yaml for the LLaMA-3B model, with highlights for any issues or areas of concern:

# TensorRT-LLM Build Configuration File

# Model Configuration
model:
  model_dir: ./path/to/llama3b/model  # Path to the pretrained LLaMA-3B model directory
  output_dir: ./path/to/llama3b/output  # Path to save the built LLaMA-3B engine
  dtype: bfloat16  # Data type for the LLaMA-3B model (updated to bfloat16)

# Checkpoint Configuration
checkpoint:
  checkpoint_dir: ./path/to/llama3b/checkpoint  # Path to the LLaMA-3B TensorRT-LLM checkpoint directory
  tp_size: 1  # Tensor parallelism size, increase for multi-GPU tensor parallelism
  pp_size: 1  # Pipeline parallelism size, increase for multi-GPU pipeline parallelism
  vocab_size: 128256  # Vocabulary size of the LLaMA-3B model (updated to 128256)
  n_positions: 8192  # Maximum number of positions (sequence length) for LLaMA-3B (updated to 8192)
  n_layer: 32  # Number of layers in the LLaMA-3B model
  n_head: 32  # Number of attention heads in the LLaMA-3B model
  n_embd: 4096  # Hidden size of the LLaMA-3B model
  inter_size: 14336  # Intermediate size of the LLaMA-3B model's feed-forward layers (updated to 14336)
  #meta_ckpt_dir:  # Path to the meta checkpoint directory
  n_kv_head: 8  # Number of key-value heads for LLaMA-3B (updated to 8)
  rms_norm_eps: 1e-5  # Epsilon value for RMS normalization (updated to 1e-5)
  #use_weight_only: false  # Enable weight-only quantization
  #weight_only_precision: int8  # Precision for weight-only quantization (choices: int8, int4)
  #smoothquant: 0.5  # Smoothquant parameter for quantization
  #per_channel: false  # Enable per-channel quantization
  #per_token: false  # Enable per-token quantization
  #int8_kv_cache: false  # Enable int8 quantization for key-value cache
  #ammo_quant_ckpt_path:  # Path to the quantized checkpoint file in .npz format
  #per_group: false  # Enable per-group quantization for GPTQ/AWQ quantization
  #load_by_shard: false  # Load the pretrained model shard-by-shard
  hidden_act: silu  # Activation function used in the LLaMA-3B model
  rope_theta: 500000.0  # Rotary position embedding theta value for LLaMA-3B (updated from rotary_base)
  #group_size: 128  # Group size used in GPTQ quantization
  #dataset_cache_dir:  # Path to the dataset cache directory
  #load_model_on_cpu: false  # Load the model on CPU
  #use_parallel_embedding: false  # Enable embedding parallelism
  #embedding_sharding_dim: 0  # Dimension for embedding sharding (choices: 0, 1)
  #use_embedding_sharing: false  # Enable embedding sharing to reduce engine size
  #workers: 1  # Number of workers for parallel checkpoint conversion
  #moe_num_experts: 0  # Number of experts for Mixture of Experts (MoE) layers
  #moe_top_k: 0  # Top-k value for MoE layers (defaults to 1 if moe_num_experts is set)
  #moe_tp_mode: 0  # Parallelism mode for distributing MoE experts in tensor parallelism
  #moe_renorm_mode: 1  # Renormalization mode for MoE gate logits
  #save_config_only: false  # Only save the model configuration without building the engine
  #disable_weight_only_quant_plugin: false  # Disable the weight-only quantization plugin

# Build Configuration
build:
  max_input_len: 256  # Maximum input sequence length
  max_output_len: 256  # Maximum output sequence length
  max_batch_size: 8  # Maximum batch size
  max_beam_width: 1  # Maximum beam width for beam search
  #max_num_tokens:  # Maximum number of tokens to generate
  #opt_num_tokens:  # Optimal number of tokens to generate
  max_prompt_embedding_table_size: 0  # Maximum size of the prompt embedding table
  gather_context_logits: false  # Gather context logits during generation
  gather_generation_logits: false  # Gather generation logits during generation
  strongly_typed: false  # Enable strongly typed network definition
  #builder_opt:  # Builder optimization level
  profiling_verbosity: layer_names_only  # Profiling verbosity level (choices: layer_names_only, detailed, none)
  enable_debug_output: false  # Enable debug output
  max_draft_len: 0  # Maximum draft length for Medusa-style generation
  use_refit: false  # Enable engine refitting
  #input_timing_cache:  # Path to the input timing cache file
  #output_timing_cache:  # Path to save the output timing cache file
  lora_config:  # Configuration for LoRA (Low-Rank Adaptation)
    #lora_dir:  # Path to the LoRA checkpoint directory
    #lora_target_modules:  # Target modules for LoRA adaptation
    #lora_ckpt_source: hf  # Source of LoRA checkpoints (choices: hf, nemo)
    #max_lora_rank: 4  # Maximum rank for LoRA adaptation
  auto_parallel_config:  # Configuration for automatic parallelization
    #enabled: false  # Enable automatic parallelization
    #tp_size: 1  # Tensor parallelism size for automatic parallelization
    #pp_size: 1  # Pipeline parallelism size for automatic parallelization
    #max_memory_MB: 80000  # Maximum memory in MB for automatic parallelization
    #max_dram_memory_MB: 30000  # Maximum DRAM memory in MB for automatic parallelization
    #compile_max_memory_MB: 17000  # Maximum memory in MB for compilation during automatic parallelization
    #compile_max_dram_memory_MB: 8000  # Maximum DRAM memory in MB for compilation during automatic parallelization
    #debug_mode: false  # Enable debug mode for automatic parallelization
  weight_sparsity: false  # Enable weight sparsity
  plugin_config:  # Configuration for plugins
    #use_custom_all_reduce: false  # Use custom all-reduce plugin
    #use_fp8_all_reduce: false  # Use FP8 all-reduce plugin
    #use_fp8_cast_plugin: false  # Use FP8 cast plugin
    #use_async_malloc: false  # Use asynchronous memory allocation plugin
    #use_paged_context_fmha: false  # Use paged context fused multi-head attention plugin
    #use_fp8_context_fmha: false  # Use FP8 context fused multi-head attention plugin
    #lora_plugin:  # Configuration for LoRA plugin
      #type:  # Type of LoRA plugin
  max_encoder_input_len: 1024  # Maximum encoder input sequence length for encoder-decoder models
  use_fused_mlp: false  # Use fused MLP layers
  dry_run: false  # Perform a dry run without building the engine
  visualize_network: false  # Visualize the network graph

Highlights:

The model_dir and output_dir paths should be updated to point to the appropriate directories for the LLaMA-3B model.
The dtype has been updated to bfloat16 to match the LLaMA-3B model's data type.
The vocab_size has been updated to 128256 to match the LLaMA-3B model's vocabulary size.
The n_positions has been updated to 8192 to match the LLaMA-3B model's maximum sequence length.
The inter_size has been updated to 14336 to match the LLaMA-3B model's intermediate size.
The n_kv_head has been updated to 8 to match the LLaMA-3B model's number of key-value heads.
The rms_norm_eps has been updated to 1e-5 to match the LLaMA-3B model's RMS normalization epsilon value.
The rotary_base parameter has been replaced with rope_theta and set to 500000.0 to match the LLaMA-3B model's rotary position embedding configuration.
The bos_token_id, eos_token_id, tie_word_embeddings, use_cache, and torch_dtype configurations are missing. Consider adding them to match the LLaMA-3B model's configuration if they are relevant for the build process.

Please ensure that the paths and directories are updated according to your specific setup and requirements. Also, review and uncomment any additional configurations that may be necessary for your specific use case.

PreviousProposed checkpoint config file for LLama3 Nextrun.py for inference

Last updated 1 year ago

Was this helpful?