Compiling LLama Models
Here is a list of arguments
--model_dir
: Specifies the directory where the pre-trained model is stored.--meta_ckpt_dir
: Specifies the directory where the meta checkpoint is stored.--tp_size
: Sets the N-way tensor parallelism size.--pp_size
: Sets the N-way pipeline parallelism size.--dtype
: Determines the data type for the model weights, with options includingfloat32
,bfloat16
, andfloat16
.--vocab_size
: Specifies the vocabulary size of the model.--n_positions
: Sets the number of positions for embeddings.--n_layer
: Specifies the number of layers in the model.--n_head
: Sets the number of attention heads in the model.--n_kv_head
: Specifies the number of key-value heads, if different fromn_head
.--n_embd
: Sets the dimensionality of embeddings.--inter_size
: Specifies the size of the intermediate layer in the transformer.--rms_norm_eps
: Sets the epsilon value for RMS normalization.--use_weight_only
: Enables quantization of weights only, without affecting activations.--disable_weight_only_quant_plugin
: Disables the plugin implementation for weight-only quantization, using the out-of-the-box implementation instead.--weight_only_precision
: Defines the precision for weight-only quantization, with options includingint8
,int4
,int4_awq
, andint4_gptq
.--smoothquant
: Activates SmoothQuant quantization with a specified alpha parameter.--per_channel
: Uses a different static scaling factor for each channel in GEMM's result for quantization.--per_token
: Chooses a custom scaling factor for each token at runtime during quantization.--int8_kv_cache
: Enables INT8 quantization for the key-value cache.--ammo_quant_ckpt_path
: Path to a quantized model checkpoint in.npz
format.--per_group
: Chooses a custom scaling factor for each group at runtime, specifically for GPTQ/AWQ quantization.--quantize_lm_head
: Quantizes the language model head weights as well when using INT4_AWQ.--enable_fp8
: Uses FP8 Linear layer for Attention QKV/Dense and MLP.--fp8_kv_cache
: Chooses FP8 quantization for the key-value cache.--load_by_shard
: Enables loading a pre-trained model shard-by-shard.-
-hidden_act
: Specifies the hidden activation function.--rotary_base
: Sets the base value for rotary embeddings.--rotary_scaling
: Specifies the type and factor for rotary scaling.--group_size
: Sets the group size used in GPTQ/AWQ quantization.--storage-type
: Specifies the storage type, with options includingfp32
andfp16
.--dataset-cache-dir
: Sets the cache directory to load the Hugging Face dataset.--load-model-on-cpu
: Forces the model to load on the CPU.--convert-model-on-cpu
: Forces the model conversion to occur on the CPU.--use_parallel_embedding
: Enables embedding parallelism.--embedding_sharding_dim
: Specifies the dimension along which to shard the embedding lookup table.--use_embedding_sharing
: Attempts to reduce the engine size by sharing the embedding lookup table between layers.--use_prompt_tuning
: Enables prompt tuning.--output_dir
: Specifies the directory to save the converted model checkpoint.--workers
: Sets the number of workers for converting the checkpoint in parallel.--moe_num_experts
: Specifies the number of experts for MOE layers.--moe_top_k
: Sets the top_k value for MOE layers.--moe_tp_mode
: Determines how to distribute experts in tensor parallelism.--moe_renorm_mode
: Controls renormalization after gate logits for MOE.--use_fused_mlp
: Enables horizontal fusion in GatedMLP.--enable_pos_shift
: Enables position shift for the streaming LLM method.--dense_context_fmha
: Enables dense FMHA in context phase, as opposed to sliding window attention.--hf_lora_dir
: Specifies the directory for a LoRA model.
Last updated