Compiling LLama Models

Here is a list of arguments

--model_dir: Specifies the directory where the pre-trained model is stored.
--meta_ckpt_dir: Specifies the directory where the meta checkpoint is stored.
--tp_size: Sets the N-way tensor parallelism size.
--pp_size: Sets the N-way pipeline parallelism size.
--dtype: Determines the data type for the model weights, with options including float32, bfloat16, and float16.
--vocab_size: Specifies the vocabulary size of the model.
--n_positions: Sets the number of positions for embeddings.
--n_layer: Specifies the number of layers in the model.
--n_head: Sets the number of attention heads in the model.
--n_kv_head: Specifies the number of key-value heads, if different from n_head.
--n_embd: Sets the dimensionality of embeddings.
--inter_size: Specifies the size of the intermediate layer in the transformer.
--rms_norm_eps: Sets the epsilon value for RMS normalization.
--use_weight_only: Enables quantization of weights only, without affecting activations.
--disable_weight_only_quant_plugin: Disables the plugin implementation for weight-only quantization, using the out-of-the-box implementation instead.
--weight_only_precision: Defines the precision for weight-only quantization, with options including int8, int4, int4_awq, and int4_gptq.
--smoothquant: Activates SmoothQuant quantization with a specified alpha parameter.
--per_channel: Uses a different static scaling factor for each channel in GEMM's result for quantization.
--per_token: Chooses a custom scaling factor for each token at runtime during quantization.
--int8_kv_cache: Enables INT8 quantization for the key-value cache.
--ammo_quant_ckpt_path: Path to a quantized model checkpoint in .npz format.
--per_group: Chooses a custom scaling factor for each group at runtime, specifically for GPTQ/AWQ quantization.
--quantize_lm_head: Quantizes the language model head weights as well when using INT4_AWQ.
--enable_fp8: Uses FP8 Linear layer for Attention QKV/Dense and MLP.
--fp8_kv_cache: Chooses FP8 quantization for the key-value cache.
--load_by_shard: Enables loading a pre-trained model shard-by-shard.
--hidden_act: Specifies the hidden activation function.
--rotary_base: Sets the base value for rotary embeddings.
--rotary_scaling: Specifies the type and factor for rotary scaling.
--group_size: Sets the group size used in GPTQ/AWQ quantization.
--storage-type: Specifies the storage type, with options including fp32 and fp16.
--dataset-cache-dir: Sets the cache directory to load the Hugging Face dataset.
--load-model-on-cpu: Forces the model to load on the CPU.
--convert-model-on-cpu: Forces the model conversion to occur on the CPU.
--use_parallel_embedding: Enables embedding parallelism.
--embedding_sharding_dim: Specifies the dimension along which to shard the embedding lookup table.
--use_embedding_sharing: Attempts to reduce the engine size by sharing the embedding lookup table between layers.
--use_prompt_tuning: Enables prompt tuning.
--output_dir: Specifies the directory to save the converted model checkpoint.
--workers: Sets the number of workers for converting the checkpoint in parallel.
--moe_num_experts: Specifies the number of experts for MOE layers.
--moe_top_k: Sets the top_k value for MOE layers.
--moe_tp_mode: Determines how to distribute experts in tensor parallelism.
--moe_renorm_mode: Controls renormalization after gate logits for MOE.
--use_fused_mlp: Enables horizontal fusion in GatedMLP.
--enable_pos_shift: Enables position shift for the streaming LLM method.
--dense_context_fmha: Enables dense FMHA in context phase, as opposed to sliding window attention.
--hf_lora_dir: Specifies the directory for a LoRA model.

Previoussummarize.py script in Llama folder NextTasks

Last updated 1 year ago

Was this helpful?