Checkpoint Script Arguments

With the model downloaded, the next step is to convert the checkpoints.

The convert_checkpoint.py script is a component in the TensorRT-LLM process of converting a pre-trained language model, such as LLaMA, into an optimised format suitable for inference on GPUs using TensorRT.

The script takes a pre-trained model checkpoint and converts it into a format that can be loaded and optimised by TensorRT-LLM.

The script accepts various command-line arguments to configure the conversion process. For example:

TensorRT-LLM library gives the following instruction. Highlighting the configurations you can make for the checkpoint conversion. We will review this convert_checkpoint.py file but note that we have created our own configuration files for greater transparency and usability.

convert_checkpoint.py --model_dir llama-2-7b-chat-hf \
                              --output_dir ./llama-2-7b-chat-hf-output \
                              --dtype float16

To use the convert_checkpoint.py script effectively, follow these guidelines:

Prepare the model

We assume that you have already downloaded the necessary model files from Huggingface.

The script supports loading models from different sources, such as a Hugging Face model directory (--model_dir) or a meta checkpoint directory (--meta_ckpt_dir).
It uses the Hugging Face Transformers library to load the pre-trained model and its configuration.
The preload_model function is responsible for loading the model based on the specified directory and device (CPU or GPU).

Set parallelism

Specify the desired level of parallelism using the --tp_size and --pp_size arguments. --tp_size determines the tensor parallelism size, which splits the model's tensor computations across multiple GPUs. --pp_size sets the pipeline parallelism size, which divides the model into stages for parallel execution.

Adjust these values based on your available hardware and performance requirements.

Choose the data type

Use the --dtype argument to specify the data type for the model weights.

Available options include float32, bfloat16, and float16. Consider the trade-off between precision and performance when selecting the data type.

Configure model-specific parameters

Set the appropriate values for model-specific parameters such as --vocab_size, --n_positions, --n_layer, --n_head, --n_embd, --inter_size, and others.

These parameters define the model architecture and should match the specifications of the original model.

Apply quantization and optimization

The script supports various quantization options to reduce the model size and improve inference performance.

Experiment with different combinations to achieve the desired balance between performance and accuracy.

Quantization settings are determined based on the command-line arguments, such as:

--use_weight_only, --weight_only_precision, --smoothquant, --per_channel, and --per_token.

The args_to_quantization function maps the command-line arguments to the corresponding quantization configuration (QuantConfig).
Quantization algorithms like QuantAlgo.W8A16, QuantAlgo.W4A16, and QuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN are used based on the specified settings.

Configure parallelism

The script supports tensor parallelism and pipeline parallelism to distribute the model across multiple GPUs.
The --tp_size and --pp_size arguments control the parallelism settings.
The Mapping class is used to define the mapping of the model across GPUs based on the parallelism settings.

Configure sharding

Use flags like --load_by_shard, --use_parallel_embedding, --embedding_sharding_dim, and --use_embedding_sharing to control how the model is loaded and how embeddings are handled in parallel computing environments. Adjust these settings based on your hardware setup and performance goals.

Customize model architecture

If needed, customise the model architecture using flags such as --hidden_act, --rotary_base, --group_size, and others.

These flags allow you to fine-tune the model's activation functions, quantization settings, and other architectural aspects.

Set up Mixture of Experts (MoE) and LoRA

For advanced use cases, configure MoE layers and LoRA (Low-Rank Adaptation) using flags like --moe_num_experts, --moe_top_k, --moe_tp_mode, --moe_renorm_mode, --lora_target_modules, and --max_lora_rank.

These settings enable sophisticated customisation for performance and functionality.

Optimise TensorRT engines

Utilize flags like --use_fused_mlp, --enable_pos_shift, --dense_context_fmha, and --hf_lora_dir to control specific optimizations and features in the TensorRT engines.

These optimisations can improve performance and enable advanced techniques like fused MLP layers and position shift for streaming large language models.

Specify output and runtime settings

After the model conversion and quantization, the script saves the converted checkpoint to the specified output directory (--output_dir).

The save_checkpoint method of LLaMAForCausalLM is used to save the converted model weights and configuration.

Specify runtime settings

Adjust the number of workers for parallel conversion using the --workers argument based on your available resources.

Remember to carefully consider the default values for each argument and override them as needed based on your specific requirements.

Experiment with different combinations of arguments to find the optimal configuration for your use case.

By following these guidelines and leveraging the various arguments provided by the convert_checkpoint.py script, you can effectively convert and optimise your model checkpoint for deployment and inference.

Key considerations when using this script

Ensure that the pre-trained model is compatible with the LLaMA architecture and can be loaded using the Hugging Face Transformers library.
Choose the appropriate data type (--dtype) based on the desired precision and performance trade-off. Float16 (FP16) is commonly used for faster inference with minimal accuracy loss.
Consider the available GPU memory and select the appropriate parallelism settings (--tp_size and --pp_size) to distribute the model across multiple GPUs if necessary.
Experiment with different quantization settings to achieve the desired balance between model size, inference speed, and accuracy. Weight-only quantization (--use_weight_only) and SmoothQuant (--smoothquant) are popular options.
Ensure that the output directory (--output_dir) has sufficient space to store the converted checkpoint.
If using multi-threading (--workers), ensure that the system has enough resources to handle the parallel execution.

Overall, the convert_checkpoint.py script plays a vital role in the TensorRT-LLM process by converting pre-trained language models into a format optimised for inference on GPUs using TensorRT.

It provides flexibility in model loading, quantization, parallelism, and saves the converted checkpoint for further optimisation and deployment.

We have put together a configuration script to make the process more transparent and simply easier.

Previousconvert_checkpoint examples Nextcheckpoint configuration file

Last updated 1 year ago

Was this helpful?