Checkpoint List - Arguments
Here's a table summarising the arguments that you can parse in the
script, along with their default values:
Argument | Default Value | Description |
--model_dir | None | Path to the Hugging Face model directory |
--meta_ckpt_dir | None | Path to the meta checkpoint directory |
--tp_size | 1 | N-way tensor parallelism size |
--pp_size | 1 | N-way pipeline parallelism size |
--dtype | 'float16' | Data type ('float32', 'bfloat16', 'float16') |
--vocab_size | 32000 | Vocabulary size |
--n_positions | 2048 | Number of positions |
--n_layer | 32 | Number of layers |
--n_head | 32 | Number of attention heads |
--n_kv_head | None | Number of key-value heads (defaults to n_head if not specified) |
--n_embd | 4096 | Hidden size |
--inter_size | 11008 | Intermediate size |
--rms_norm_eps | 1e-06 | RMS normalization epsilon |
--use_weight_only | False | Quantize weights for the various GEMMs to INT4/INT8 |
--disable_weight_only_quant_plugin | False | Use OOTB implementation instead of plugin for weight quantization |
--weight_only_precision | 'int8' | Precision for weight-only quantization ('int8', 'int4', 'int4_gptq') |
--smoothquant | None | Set the α parameter for Smoothquant quantization (float value) |
--per_channel | False | Use per-channel static scaling factor for GEMM's result |
--per_token | False | Use per-token dynamic scaling factor for activations |
--int8_kv_cache | False | Use INT8 quantization for KV cache |
--ammo_quant_ckpt_path | None | Path to a quantized model checkpoint in .npz format |
--per_group | False | Use per-group dynamic scaling factor for weights in INT4 range (for GPTQ/AWQ quantization) |
--load_by_shard | False | Load a pretrained model shard-by-shard |
--hidden_act | 'silu' | Hidden activation function |
--rotary_base | 10,000 | Rotary base value |
--group_size | 128 | Group size used in GPTQ quantization |
--dataset-cache-dir | None | Cache directory to load the Hugging Face dataset |
--load_model_on_cpu | False | Load the model on CPU |
--use_parallel_embedding | False | Enable embedding parallelism |
--embedding_sharding_dim | 0 | Dimension for sharding the embedding lookup table (0: vocab dimension, 1: hidden dimension) |
--use_embedding_sharing | False | Try to reduce the engine size by sharing the embedding lookup table between two layers |
--output_dir | 'tllm_checkpoint' | Path to save the TensorRT-LLM checkpoint |
--workers | 1 | Number of workers for converting checkpoint in parallel |
--moe_num_experts | 0 | Number of experts to use for MOE layers |
--moe_top_k | 0 | Top_k value to use for MOE layers (defaults to 1 if --moe_num_experts is set) |
--moe_tp_mode | MoeConfig.ParallelismMode.TENSOR_PARALLEL | Controls how to distribute experts in TP (check layers/ for accepted values) |
--moe_renorm_mode | MoeConfig.ExpertScaleNormalizationMode.RENORM | Controls renormalization after gate logits (check layers/ for accepted values) |
--save_config_only | False | Only save the model config without reading and converting weights (for debugging) |
These arguments allow you to customise the behavior of the
script according to your specific requirements. You can provide the desired values for these arguments when running the script.
Last updated