Checkpoint List - Arguments
Here's a table summarising the arguments that you can parse in the convert_checkpoint.py
script, along with their default values:
Argument | Default Value | Description |
---|---|---|
--model_dir | None | Path to the Hugging Face model directory |
--meta_ckpt_dir | None | Path to the meta checkpoint directory |
--tp_size | 1 | N-way tensor parallelism size |
--pp_size | 1 | N-way pipeline parallelism size |
--dtype | 'float16' | Data type ('float32', 'bfloat16', 'float16') |
--vocab_size | 32000 | Vocabulary size |
--n_positions | 2048 | Number of positions |
--n_layer | 32 | Number of layers |
--n_head | 32 | Number of attention heads |
--n_kv_head | None | Number of key-value heads (defaults to n_head if not specified) |
--n_embd | 4096 | Hidden size |
--inter_size | 11008 | Intermediate size |
--rms_norm_eps | 1e-06 | RMS normalization epsilon |
--use_weight_only | False | Quantize weights for the various GEMMs to INT4/INT8 |
--disable_weight_only_quant_plugin | False | Use OOTB implementation instead of plugin for weight quantization |
--weight_only_precision | 'int8' | Precision for weight-only quantization ('int8', 'int4', 'int4_gptq') |
--smoothquant | None | Set the α parameter for Smoothquant quantization (float value) |
--per_channel | False | Use per-channel static scaling factor for GEMM's result |
--per_token | False | Use per-token dynamic scaling factor for activations |
--int8_kv_cache | False | Use INT8 quantization for KV cache |
--ammo_quant_ckpt_path | None | Path to a quantized model checkpoint in .npz format |
--per_group | False | Use per-group dynamic scaling factor for weights in INT4 range (for GPTQ/AWQ quantization) |
--load_by_shard | False | Load a pretrained model shard-by-shard |
--hidden_act | 'silu' | Hidden activation function |
--rotary_base | 10,000 | Rotary base value |
--group_size | 128 | Group size used in GPTQ quantization |
--dataset-cache-dir | None | Cache directory to load the Hugging Face dataset |
--load_model_on_cpu | False | Load the model on CPU |
--use_parallel_embedding | False | Enable embedding parallelism |
--embedding_sharding_dim | 0 | Dimension for sharding the embedding lookup table (0: vocab dimension, 1: hidden dimension) |
--use_embedding_sharing | False | Try to reduce the engine size by sharing the embedding lookup table between two layers |
--output_dir | 'tllm_checkpoint' | Path to save the TensorRT-LLM checkpoint |
--workers | 1 | Number of workers for converting checkpoint in parallel |
--moe_num_experts | 0 | Number of experts to use for MOE layers |
--moe_top_k | 0 | Top_k value to use for MOE layers (defaults to 1 if --moe_num_experts is set) |
--moe_tp_mode | MoeConfig.ParallelismMode.TENSOR_PARALLEL | Controls how to distribute experts in TP (check layers/moe.py for accepted values) |
--moe_renorm_mode | MoeConfig.ExpertScaleNormalizationMode.RENORM | Controls renormalization after gate logits (check layers/moe.py for accepted values) |
--save_config_only | False | Only save the model config without reading and converting weights (for debugging) |
These arguments allow you to customise the behavior of the convert_checkpoint.py
script according to your specific requirements. You can provide the desired values for these arguments when running the script.
Last updated