Checkpoint List - Arguments
Here's a table summarising the arguments that you can parse in the convert_checkpoint.py
script, along with their default values:
--model_dir
None
Path to the Hugging Face model directory
--meta_ckpt_dir
None
Path to the meta checkpoint directory
--tp_size
1
N-way tensor parallelism size
--pp_size
1
N-way pipeline parallelism size
--dtype
'float16'
Data type ('float32', 'bfloat16', 'float16')
--vocab_size
32000
Vocabulary size
--n_positions
2048
Number of positions
--n_layer
32
Number of layers
--n_head
32
Number of attention heads
--n_kv_head
None
Number of key-value heads (defaults to n_head if not specified)
--n_embd
4096
Hidden size
--inter_size
11008
Intermediate size
--rms_norm_eps
1e-06
RMS normalization epsilon
--use_weight_only
False
Quantize weights for the various GEMMs to INT4/INT8
--disable_weight_only_quant_plugin
False
Use OOTB implementation instead of plugin for weight quantization
--weight_only_precision
'int8'
Precision for weight-only quantization ('int8', 'int4', 'int4_gptq')
--smoothquant
None
Set the α parameter for Smoothquant quantization (float value)
--per_channel
False
Use per-channel static scaling factor for GEMM's result
--per_token
False
Use per-token dynamic scaling factor for activations
--int8_kv_cache
False
Use INT8 quantization for KV cache
--ammo_quant_ckpt_path
None
Path to a quantized model checkpoint in .npz format
--per_group
False
Use per-group dynamic scaling factor for weights in INT4 range (for GPTQ/AWQ quantization)
--load_by_shard
False
Load a pretrained model shard-by-shard
--hidden_act
'silu'
Hidden activation function
--rotary_base
10,000
Rotary base value
--group_size
128
Group size used in GPTQ quantization
--dataset-cache-dir
None
Cache directory to load the Hugging Face dataset
--load_model_on_cpu
False
Load the model on CPU
--use_parallel_embedding
False
Enable embedding parallelism
--embedding_sharding_dim
0
Dimension for sharding the embedding lookup table (0: vocab dimension, 1: hidden dimension)
--use_embedding_sharing
False
Try to reduce the engine size by sharing the embedding lookup table between two layers
--output_dir
'tllm_checkpoint'
Path to save the TensorRT-LLM checkpoint
--workers
1
Number of workers for converting checkpoint in parallel
--moe_num_experts
0
Number of experts to use for MOE layers
--moe_top_k
0
Top_k value to use for MOE layers (defaults to 1 if --moe_num_experts is set)
--moe_tp_mode
MoeConfig.ParallelismMode.TENSOR_PARALLEL
Controls how to distribute experts in TP (check layers/moe.py for accepted values)
--moe_renorm_mode
MoeConfig.ExpertScaleNormalizationMode.RENORM
Controls renormalization after gate logits (check layers/moe.py for accepted values)
--save_config_only
False
Only save the model config without reading and converting weights (for debugging)
These arguments allow you to customise the behavior of the convert_checkpoint.py
script according to your specific requirements. You can provide the desired values for these arguments when running the script.
Last updated