llama/convert.py

The convert.py file in the llama folder of the TensorRT-LLM library plays a crucial role in converting the pre-trained LLAMA model from the Hugging Face format to the TensorRT-LLM format.

It provides functions to load the weights from the Hugging Face model, apply quantization, and save the converted model as a TensorRT-LLM checkpoint.

Let's analyze the key functions in this file and discuss how they relate to the TensorRT-LLM process.

from_hugging_face function:

This function creates a LLaMAForCausalLM object from the given parameters.
It takes the model directory, data type, mapping, quantization configuration, and other optional parameters as input.
It creates a configuration dictionary by calling the create_config_from_hugging_face function, which extracts relevant configuration parameters from the Hugging Face model.
It creates a PretrainedConfig object from the configuration dictionary and sets the rank based on the mapping.
It creates an instance of the LLaMAForCausalLM class using the from_config method.
If skip_loading_weights is not set, it loads the weights from the Hugging Face model using either the load_from_hf_checkpoint function (if load_by_shard is set) or the load_weights_from_hf function.
Finally, it loads the weights into the LLaMAForCausalLM instance and returns it.

quantize function:

This function quantizes the model and saves it as a TensorRT-LLM checkpoint to the output directory.
It creates a configuration dictionary by calling the create_config_from_hugging_face function with the provided quantization configuration.
It saves the configuration as a JSON file in the output directory.
It loads the Hugging Face model using the AutoModelForCausalLM class.
If smooth quantization is enabled, it calls the smooth_quant function to capture the activation ranges and compute the smoothing parameters.
It iterates over each rank in the mapping and loads the weights from the Hugging Face model using the load_weights_from_hf function, passing the captured activation ranges and smoothing parameters if applicable.
It saves the loaded weights as a SafeTensors file in the output directory for each rank.

load_weights_from_hf function:

This function loads the weights from the Hugging Face model and converts them to the TensorRT-LLM format.
It takes the configuration, mapping, Hugging Face model, and optional quantization-related parameters as input.
It determines the quantization algorithm and sets the appropriate quantization type.
It calls the convert_hf_llama function to convert the weights from the Hugging Face format to the TensorRT-LLM format, applying quantization and parallelization settings based on the configuration.
It returns the converted weights.

convert_hf_llama function:

This function converts the weights from the Hugging Face LLAMA model to the TensorRT-LLM format.
It takes the Hugging Face model, mapping, vocabulary size, data type, and various quantization and parallelization settings as input.
It iterates over each layer in the model and calls the convert_layer function to convert the weights of each layer.
It handles the embedding and language model head weights separately, applying parallelization and quantization as specified.
It returns the converted weights as a dictionary.

The convert.py file plays a vital role in the TensorRT-LLM process by providing the necessary functions to convert the pre-trained LLAMA model from the Hugging Face format to the TensorRT-LLM format.

It handles the configuration extraction, weight loading, quantization, and parallelization aspects of the conversion process.

The converted model, along with its configuration and quantization settings, is then used by the TensorRT-LLM compiler to generate an optimized TensorRT engine for efficient inference.

The convert.py file acts as a bridge between the pre-trained Hugging Face model and the TensorRT-LLM framework, enabling seamless integration and optimization of the LLAMA model for high-performance inference.

Previousllama/weight.py NextPreTrainedModel class

Last updated 1 month ago