Here's the updated configuration file for convert_checkpoint.py based on the LLaMA-3B model:
model:
model_dir: ./llama-3b-hf
output_dir: ../llama-3b-hf-output
dtype: bfloat16 # Choices: float32, bfloat16, float16
# Suggestion:
# - Use bfloat16 for a balance between performance and accuracy, as used in the LLaMA-3B model
checkpoint:
tp_size: 1 # Tensor parallelism size
pp_size: 1 # Pipeline parallelism size
# Suggestions:
# - Increase tp_size and pp_size for distributed training across multiple GPUs
# - Keep tp_size and pp_size as 1 for single GPU training
vocab_size: 128256
# Suggestion:
# - Update vocab_size to match the LLaMA-3B model's vocabulary size
n_positions: 8192
# Suggestion:
# - Update n_positions to match the LLaMA-3B model's max position embeddings
n_layer: 32
# Suggestions:
# - Adjust n_layer based on the desired model depth
# - Keep n_layer as 32 to match the LLaMA-3B model's configuration
n_head: 32
# Suggestions:
# - Adjust n_head based on the desired number of attention heads
# - Keep n_head as 32 to match the LLaMA-3B model's configuration
n_embd: 4096
# Suggestions:
# - Adjust n_embd based on the desired hidden size
# - Keep n_embd as 4096 to match the LLaMA-3B model's configuration
inter_size: 14336
# Suggestion:
# - Update inter_size to match the LLaMA-3B model's intermediate size
# Additional checkpoint arguments
meta_ckpt_dir: null # ./path/to/meta/checkpoint
n_kv_head: 8
# Suggestion:
# - Update n_kv_head to match the LLaMA-3B model's number of key-value heads
rms_norm_eps: 1e-5
# Suggestion:
# - Update rms_norm_eps to match the LLaMA-3B model's configuration
use_weight_only: false
disable_weight_only_quant_plugin: false
weight_only_precision: int8 # Choices: int8, int4, int4_gptq
smoothquant: null # 0.5
per_channel: false
per_token: false
int8_kv_cache: false
ammo_quant_ckpt_path: null # ./path/to/ammo/quant/checkpoint
per_group: false
load_by_shard: false
hidden_act: silu
rope_theta: 500000.0
# Suggestion:
# - Update rotary_base to rope_theta and set its value to 500000.0 to match the LLaMA-3B model's configuration
group_size: 128
dataset_cache_dir: null # ./path/to/dataset/cache
load_model_on_cpu: false
use_parallel_embedding: false
embedding_sharding_dim: 0 # Choices: 0, 1
use_embedding_sharing: false
workers: 1
moe_num_experts: 0
moe_top_k: 0
moe_tp_mode: 0
moe_renorm_mode: 1
save_config_only: false
# Additional configurations to match LLaMA-3B
bos_token_id: 128000
eos_token_id: 128001
tie_word_embeddings: false
use_cache: true
torch_dtype: bfloat16
I have made the following changes and additions to align the configuration with the LLaMA-3B model:
Updated vocab_size to 128256.
Updated n_positions to 8192.
Updated inter_size to 14336.
Updated n_kv_head to 8.
Updated rms_norm_eps to 1e-5.
Replaced rotary_base with rope_theta and set its value to 500000.0.
Added bos_token_id and set it to 128000.
Added eos_token_id and set it to 128001.
Added tie_word_embeddings and set it to false.
Added use_cache and set it to true.
Added torch_dtype and set it to bfloat16.
Please note that the convert_checkpoint.py script looks good and covers all the necessary configurations. The updated configuration file should work well with the convert_checkpoint.py script to convert the LLaMA-3B model checkpoint.