Checkpoint Script Arguments
With the model downloaded, the next step is to convert the checkpoints.
The convert_checkpoint.py
script is a component in the TensorRT-LLM process of converting a pre-trained language model, such as LLaMA, into an optimised format suitable for inference on GPUs using TensorRT.
The script takes a pre-trained model checkpoint and converts it into a format that can be loaded and optimised by TensorRT-LLM.
The script accepts various command-line arguments to configure the conversion process. For example:
TensorRT-LLM library gives the following instruction. Highlighting the configurations you can make for the checkpoint conversion. We will review this convert_checkpoint.py file but note that we have created our own configuration files for greater transparency and usability.
To use the convert_checkpoint.py
script effectively, follow these guidelines:
Prepare the model
We assume that you have already downloaded the necessary model files from Huggingface.
The script supports loading models from different sources, such as a Hugging Face model directory (
--model_dir
) or a meta checkpoint directory (--meta_ckpt_dir
).It uses the Hugging Face Transformers library to load the pre-trained model and its configuration.
The
preload_model
function is responsible for loading the model based on the specified directory and device (CPU or GPU).
Set parallelism
Specify the desired level of parallelism using the --tp_size
and --pp_size
arguments. --tp_size
determines the tensor parallelism size, which splits the model's tensor computations across multiple GPUs. --pp_size
sets the pipeline parallelism size, which divides the model into stages for parallel execution.
Adjust these values based on your available hardware and performance requirements.
Choose the data type
Use the --dtype
argument to specify the data type for the model weights.
Available options include float32
, bfloat16
, and float16
. Consider the trade-off between precision and performance when selecting the data type.
Configure model-specific parameters
Set the appropriate values for model-specific parameters such as --vocab_size
, --n_positions
, --n_layer
, --n_head
, --n_embd
, --inter_size
, and others.
These parameters define the model architecture and should match the specifications of the original model.
Apply quantization and optimization
The script supports various quantization options to reduce the model size and improve inference performance.
Experiment with different combinations to achieve the desired balance between performance and accuracy.
Quantization settings are determined based on the command-line arguments, such as:
--use_weight_only
, --weight_only_precision
, --smoothquant
, --per_channel
, and --per_token
.
The
args_to_quantization
function maps the command-line arguments to the corresponding quantization configuration (QuantConfig
).Quantization algorithms like
QuantAlgo.W8A16
,QuantAlgo.W4A16
, andQ
uantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN
are used based on the specified settings.
Configure parallelism
The script supports tensor parallelism and pipeline parallelism to distribute the model across multiple GPUs.
The
--tp_size
and--pp_size
arguments control the parallelism settings.The
Mapping
class is used to define the mapping of the model across GPUs based on the parallelism settings.
Configure sharding
Use flags like --load_by_shard
, --use_parallel_embedding
, --embedding_sharding_dim
, and --use_embedding_sharing
to control how the model is loaded and how embeddings are handled in parallel computing environments. Adjust these settings based on your hardware setup and performance goals.
Customize model architecture
If needed, customise the model architecture using flags such as --hidden_act
, --rotary_base
, --group_size
, and others.
These flags allow you to fine-tune the model's activation functions, quantization settings, and other architectural aspects.
Set up Mixture of Experts (MoE) and LoRA
For advanced use cases, configure MoE layers and LoRA (Low-Rank Adaptation) using flags like --moe_num_experts
, --moe_top_k
, --moe_tp_mode
, --moe_renorm_mode
, --lora_target_modules
, and --max_lora_rank
.
These settings enable sophisticated customisation for performance and functionality.
Optimise TensorRT engines
Utilize flags like --use_fused_mlp
, --enable_pos_shift
, --dense_context_fmha
, and --hf_lora_dir
to control specific optimizations and features in the TensorRT engines.
These optimisations can improve performance and enable advanced techniques like fused MLP layers and position shift for streaming large language models.
Specify output and runtime settings
After the model conversion and quantization, the script saves the converted checkpoint to the specified output directory (--output_dir
).
The save_checkpoint
method of LLaMAForCausalLM
is used to save the converted model weights and configuration.
Specify runtime settings
Adjust the number of workers for parallel conversion using the --workers
argument based on your available resources.
Remember to carefully consider the default values for each argument and override them as needed based on your specific requirements.
Experiment with different combinations of arguments to find the optimal configuration for your use case.
By following these guidelines and leveraging the various arguments provided by the convert_checkpoint.py
script, you can effectively convert and optimise your model checkpoint for deployment and inference.
Key considerations when using this script
Ensure that the pre-trained model is compatible with the LLaMA architecture and can be loaded using the Hugging Face Transformers library.
Choose the appropriate data type (
--dtype
) based on the desired precision and performance trade-off. Float16 (FP16) is commonly used for faster inference with minimal accuracy loss.Consider the available GPU memory and select the appropriate parallelism settings (
--tp_size
and--pp_size
) to distribute the model across multiple GPUs if necessary.Experiment with different quantization settings to achieve the desired balance between model size, inference speed, and accuracy. Weight-only quantization (
--use_weight_only
) and SmoothQuant (--smoothquant
) are popular options.Ensure that the output directory (
--output_dir
) has sufficient space to store the converted checkpoint.If using multi-threading (
--workers
), ensure that the system has enough resources to handle the parallel execution.
Overall, the convert_checkpoint.py
script plays a vital role in the TensorRT-LLM process by converting pre-trained language models into a format optimised for inference on GPUs using TensorRT.
It provides flexibility in model loading, quantization, parallelism, and saves the converted checkpoint for further optimisation and deployment.
We have put together a configuration script to make the process more transparent and simply easier.
Last updated