Here's the updated buildconfig.yaml for the LLaMA-3B model, with highlights for any issues or areas of concern:
# TensorRT-LLM Build Configuration File# Model Configurationmodel:model_dir:./path/to/llama3b/model# Path to the pretrained LLaMA-3B model directoryoutput_dir:./path/to/llama3b/output# Path to save the built LLaMA-3B enginedtype:bfloat16# Data type for the LLaMA-3B model (updated to bfloat16)# Checkpoint Configurationcheckpoint:checkpoint_dir:./path/to/llama3b/checkpoint# Path to the LLaMA-3B TensorRT-LLM checkpoint directorytp_size:1# Tensor parallelism size, increase for multi-GPU tensor parallelismpp_size:1# Pipeline parallelism size, increase for multi-GPU pipeline parallelismvocab_size:128256# Vocabulary size of the LLaMA-3B model (updated to 128256)n_positions:8192# Maximum number of positions (sequence length) for LLaMA-3B (updated to 8192)n_layer:32# Number of layers in the LLaMA-3B modeln_head:32# Number of attention heads in the LLaMA-3B modeln_embd:4096# Hidden size of the LLaMA-3B modelinter_size:14336# Intermediate size of the LLaMA-3B model's feed-forward layers (updated to 14336)#meta_ckpt_dir: # Path to the meta checkpoint directoryn_kv_head:8# Number of key-value heads for LLaMA-3B (updated to 8)rms_norm_eps:1e-5# Epsilon value for RMS normalization (updated to 1e-5)#use_weight_only: false # Enable weight-only quantization#weight_only_precision: int8 # Precision for weight-only quantization (choices: int8, int4)#smoothquant: 0.5 # Smoothquant parameter for quantization#per_channel: false # Enable per-channel quantization#per_token: false # Enable per-token quantization#int8_kv_cache: false # Enable int8 quantization for key-value cache#ammo_quant_ckpt_path: # Path to the quantized checkpoint file in .npz format#per_group: false # Enable per-group quantization for GPTQ/AWQ quantization#load_by_shard: false # Load the pretrained model shard-by-shardhidden_act:silu# Activation function used in the LLaMA-3B modelrope_theta:500000.0# Rotary position embedding theta value for LLaMA-3B (updated from rotary_base)#group_size: 128 # Group size used in GPTQ quantization#dataset_cache_dir: # Path to the dataset cache directory#load_model_on_cpu: false # Load the model on CPU#use_parallel_embedding: false # Enable embedding parallelism#embedding_sharding_dim: 0 # Dimension for embedding sharding (choices: 0, 1)#use_embedding_sharing: false # Enable embedding sharing to reduce engine size#workers: 1 # Number of workers for parallel checkpoint conversion#moe_num_experts: 0 # Number of experts for Mixture of Experts (MoE) layers#moe_top_k: 0 # Top-k value for MoE layers (defaults to 1 if moe_num_experts is set)#moe_tp_mode: 0 # Parallelism mode for distributing MoE experts in tensor parallelism#moe_renorm_mode: 1 # Renormalization mode for MoE gate logits#save_config_only: false # Only save the model configuration without building the engine#disable_weight_only_quant_plugin: false # Disable the weight-only quantization plugin# Build Configurationbuild:max_input_len:256# Maximum input sequence lengthmax_output_len:256# Maximum output sequence lengthmax_batch_size:8# Maximum batch sizemax_beam_width:1# Maximum beam width for beam search#max_num_tokens: # Maximum number of tokens to generate#opt_num_tokens: # Optimal number of tokens to generatemax_prompt_embedding_table_size:0# Maximum size of the prompt embedding tablegather_context_logits:false# Gather context logits during generationgather_generation_logits:false# Gather generation logits during generationstrongly_typed:false# Enable strongly typed network definition#builder_opt: # Builder optimization levelprofiling_verbosity:layer_names_only# Profiling verbosity level (choices: layer_names_only, detailed, none)enable_debug_output:false# Enable debug outputmax_draft_len:0# Maximum draft length for Medusa-style generationuse_refit:false# Enable engine refitting#input_timing_cache: # Path to the input timing cache file#output_timing_cache: # Path to save the output timing cache filelora_config:# Configuration for LoRA (Low-Rank Adaptation)#lora_dir: # Path to the LoRA checkpoint directory#lora_target_modules: # Target modules for LoRA adaptation#lora_ckpt_source: hf # Source of LoRA checkpoints (choices: hf, nemo)#max_lora_rank: 4 # Maximum rank for LoRA adaptationauto_parallel_config:# Configuration for automatic parallelization#enabled: false # Enable automatic parallelization#tp_size: 1 # Tensor parallelism size for automatic parallelization#pp_size: 1 # Pipeline parallelism size for automatic parallelization#max_memory_MB: 80000 # Maximum memory in MB for automatic parallelization#max_dram_memory_MB: 30000 # Maximum DRAM memory in MB for automatic parallelization#compile_max_memory_MB: 17000 # Maximum memory in MB for compilation during automatic parallelization#compile_max_dram_memory_MB: 8000 # Maximum DRAM memory in MB for compilation during automatic parallelization#debug_mode: false # Enable debug mode for automatic parallelizationweight_sparsity:false# Enable weight sparsityplugin_config:# Configuration for plugins#use_custom_all_reduce: false # Use custom all-reduce plugin#use_fp8_all_reduce: false # Use FP8 all-reduce plugin#use_fp8_cast_plugin: false # Use FP8 cast plugin#use_async_malloc: false # Use asynchronous memory allocation plugin#use_paged_context_fmha: false # Use paged context fused multi-head attention plugin#use_fp8_context_fmha: false # Use FP8 context fused multi-head attention plugin#lora_plugin: # Configuration for LoRA plugin#type: # Type of LoRA pluginmax_encoder_input_len:1024# Maximum encoder input sequence length for encoder-decoder modelsuse_fused_mlp:false# Use fused MLP layersdry_run:false# Perform a dry run without building the enginevisualize_network:false# Visualize the network graph
Highlights:
The model_dir and output_dir paths should be updated to point to the appropriate directories for the LLaMA-3B model.
The dtype has been updated to bfloat16 to match the LLaMA-3B model's data type.
The vocab_size has been updated to 128256 to match the LLaMA-3B model's vocabulary size.
The n_positions has been updated to 8192 to match the LLaMA-3B model's maximum sequence length.
The inter_size has been updated to 14336 to match the LLaMA-3B model's intermediate size.
The n_kv_head has been updated to 8 to match the LLaMA-3B model's number of key-value heads.
The rms_norm_eps has been updated to 1e-5 to match the LLaMA-3B model's RMS normalization epsilon value.
The rotary_base parameter has been replaced with rope_theta and set to 500000.0 to match the LLaMA-3B model's rotary position embedding configuration.
The bos_token_id, eos_token_id, tie_word_embeddings, use_cache, and torch_dtype configurations are missing. Consider adding them to match the LLaMA-3B model's configuration if they are relevant for the build process.
Please ensure that the paths and directories are updated according to your specific setup and requirements. Also, review and uncomment any additional configurations that may be necessary for your specific use case.