llama/convert.py
The convert.py
file in the llama
folder of the TensorRT-LLM library plays a crucial role in converting the pre-trained LLAMA model from the Hugging Face format to the TensorRT-LLM format.
It provides functions to load the weights from the Hugging Face model, apply quantization, and save the converted model as a TensorRT-LLM checkpoint.
Let's analyze the key functions in this file and discuss how they relate to the TensorRT-LLM process.
from_hugging_face
function:
This function creates a
LLaMAForCausalLM
object from the given parameters.It takes the model directory, data type, mapping, quantization configuration, and other optional parameters as input.
It creates a configuration dictionary by calling the
create_config_from_hugging_face
function, which extracts relevant configuration parameters from the Hugging Face model.It creates a
PretrainedConfig
object from the configuration dictionary and sets the rank based on the mapping.It creates an instance of the
LLaMAForCausalLM
class using thefrom_config
method.If
skip_loading_weights
is not set, it loads the weights from the Hugging Face model using either theload_from_hf_checkpoint
function (ifload_by_shard
is set) or theload_weights_from_hf
function.Finally, it loads the weights into the
LLaMAForCausalLM
instance and returns it.
quantize
function:
This function quantizes the model and saves it as a TensorRT-LLM checkpoint to the output directory.
It creates a configuration dictionary by calling the
create_config_from_hugging_face
function with the provided quantization configuration.It saves the configuration as a JSON file in the output directory.
It loads the Hugging Face model using the
AutoModelForCausalLM
class.If smooth quantization is enabled, it calls the
smooth_quant
function to capture the activation ranges and compute the smoothing parameters.It iterates over each rank in the mapping and loads the weights from the Hugging Face model using the
load_weights_from_hf
function, passing the captured activation ranges and smoothing parameters if applicable.It saves the loaded weights as a SafeTensors file in the output directory for each rank.
load_weights_from_hf
function:
This function loads the weights from the Hugging Face model and converts them to the TensorRT-LLM format.
It takes the configuration, mapping, Hugging Face model, and optional quantization-related parameters as input.
It determines the quantization algorithm and sets the appropriate quantization type.
It calls the
convert_hf_llama
function to convert the weights from the Hugging Face format to the TensorRT-LLM format, applying quantization and parallelization settings based on the configuration.It returns the converted weights.
convert_hf_llama
function:
This function converts the weights from the Hugging Face LLAMA model to the TensorRT-LLM format.
It takes the Hugging Face model, mapping, vocabulary size, data type, and various quantization and parallelization settings as input.
It iterates over each layer in the model and calls the
convert_layer
function to convert the weights of each layer.It handles the embedding and language model head weights separately, applying parallelization and quantization as specified.
It returns the converted weights as a dictionary.
The convert.py
file plays a vital role in the TensorRT-LLM process by providing the necessary functions to convert the pre-trained LLAMA model from the Hugging Face format to the TensorRT-LLM format.
It handles the configuration extraction, weight loading, quantization, and parallelization aspects of the conversion process.
The converted model, along with its configuration and quantization settings, is then used by the TensorRT-LLM compiler to generate an optimized TensorRT engine for efficient inference.
The convert.py
file acts as a bridge between the pre-trained Hugging Face model and the TensorRT-LLM framework, enabling seamless integration and optimization of the LLAMA model for high-performance inference.
Last updated