Checkpoint List - Arguments

Here's a table summarising the arguments that you can parse in the convert_checkpoint.py script, along with their default values:

Argument

Default Value

Description

--model_dir

None

Path to the Hugging Face model directory

--meta_ckpt_dir

None

Path to the meta checkpoint directory

--tp_size

N-way tensor parallelism size

--pp_size

N-way pipeline parallelism size

--dtype

'float16'

Data type ('float32', 'bfloat16', 'float16')

--vocab_size

32000

Vocabulary size

--n_positions

2048

Number of positions

--n_layer

Number of layers

--n_head

Number of attention heads

--n_kv_head

None

Number of key-value heads (defaults to n_head if not specified)

--n_embd

4096

Hidden size

--inter_size

11008

Intermediate size

--rms_norm_eps

1e-06

RMS normalization epsilon

--use_weight_only

False

Quantize weights for the various GEMMs to INT4/INT8

--disable_weight_only_quant_plugin

False

Use OOTB implementation instead of plugin for weight quantization

--weight_only_precision

'int8'

Precision for weight-only quantization ('int8', 'int4', 'int4_gptq')

--smoothquant

None

Set the α parameter for Smoothquant quantization (float value)

--per_channel

False

Use per-channel static scaling factor for GEMM's result

--per_token

False

Use per-token dynamic scaling factor for activations

--int8_kv_cache

False

Use INT8 quantization for KV cache

--ammo_quant_ckpt_path

None

Path to a quantized model checkpoint in .npz format

--per_group

False

Use per-group dynamic scaling factor for weights in INT4 range (for GPTQ/AWQ quantization)

--load_by_shard

False

Load a pretrained model shard-by-shard

--hidden_act

'silu'

Hidden activation function

--rotary_base

10,000

Rotary base value

--group_size

128

Group size used in GPTQ quantization

--dataset-cache-dir

None

Cache directory to load the Hugging Face dataset

--load_model_on_cpu

False

Load the model on CPU

--use_parallel_embedding

False

Enable embedding parallelism

--embedding_sharding_dim

Dimension for sharding the embedding lookup table (0: vocab dimension, 1: hidden dimension)

--use_embedding_sharing

False

Try to reduce the engine size by sharing the embedding lookup table between two layers

--output_dir

'tllm_checkpoint'

Path to save the TensorRT-LLM checkpoint

--workers

Number of workers for converting checkpoint in parallel

--moe_num_experts

Number of experts to use for MOE layers

--moe_top_k

Top_k value to use for MOE layers (defaults to 1 if --moe_num_experts is set)

--moe_tp_mode

MoeConfig.ParallelismMode.TENSOR_PARALLEL

Controls how to distribute experts in TP (check layers/moe.py for accepted values)

--moe_renorm_mode

MoeConfig.ExpertScaleNormalizationMode.RENORM

Controls renormalization after gate logits (check layers/moe.py for accepted values)

--save_config_only

False

Only save the model config without reading and converting weights (for debugging)

These arguments allow you to customise the behavior of the convert_checkpoint.py script according to your specific requirements. You can provide the desired values for these arguments when running the script.

PreviousConverting Checkpoints NextExamples of running the convert_checkpoint.py script

Last updated 1 year ago

Was this helpful?