TensorRT-LLM build workflow
The conversion APIs in the TensorRT-LLM module provide a standardised way to convert weights from various formats, such as Hugging Face checkpoints, into the format expected by TensorRT-LLM models.
These APIs aim to simplify the process of loading pre-trained weights into TensorRT-LLM models, making it easier to use and adapt models from different sources.
The key components of the conversion APIs are
TopModelMixin
class
This is a mixin class that declares the from_hugging_face()
interface. It serves as a blueprint for model classes to implement the weight conversion functionality from Hugging Face checkpoints.
from_hugging_face()
method
This is a class method declared in the TopModelMixin
class.
It takes the Hugging Face model directory, data type, and other optional parameters as input and returns an instance of the TensorRT-LLM model with the loaded weights. The actual implementation of this method is left to the subclasses.
LLaMAForCausalLM
class
This class represents the LLaMA model for causal language modelling in TensorRT-LLM.
It inherits from DecoderModelForCausalLM
and includes TopModelMixin
in its base class hierarchy.
The class implements the from_hugging_face()
method to convert weights from a Hugging Face checkpoint to the TensorRT-LLM expected format and load them into the model.
By using the conversion APIs, the process of converting and loading weights becomes more streamlined.
For example, in the convert_checkpoint.py
script, the code can be simplified to just creating an instance of LLaMAForCausalLM
using the from_hugging_face()
method and then saving the checkpoint using the save_checkpoint()
method.
The from_hugging_face()
method intentionally returns an in-memory object instead of saving the checkpoint to disk.
This allows for flexibility and faster conversion and building of the model in a single process. The save_checkpoint()
method can be called separately to save the model to disk when needed.
In addition to the from_hugging_face()
method, the LLaMAForCausalLM
class also provides a from_meta_ckpt()
method specifically for loading weights from Meta checkpoints. This method is not part of the TopModelMixin
class as it is specific to the LLaMA model.
The conversion APIs are designed to be extensible. In future releases, more factory methods like from_jax()
, from_nemo()
, or from_keras()
can be added to support different training checkpoints. Model developers can choose to implement any subset of these factory methods for the models they contribute to TensorRT-LLM.
For unsupported formats, users still have the flexibility to implement their own weight conversion outside the core library. They can create a weights dictionary and load it into the model using the load()
method or directly assign the model parameters internally.
Overall, the conversion APIs in TensorRT-LLM provide a standardised and flexible way to convert and load weights from different sources into TensorRT-LLM models, making it easier to work with and adapt models in various scenarios.
Key Components
Model Conversion
TensorRT-LLM provides APIs for converting existing model checkpoints from various training frameworks to the TensorRT-LLM format.
The
TopModelMixin
class defines a common interface for model conversion, with thefrom_hugging_face()
method being the most commonly used.Model-specific conversion logic is implemented in the respective model classes, such as
LLaMAForCausalLM
, which inherits fromTopModelMixin
.The conversion APIs allow for flexibility in supporting different checkpoint formats, such as Hugging Face and Meta checkpoints for LLaMA models.
Best Practice: Use the provided conversion APIs whenever possible to ensure compatibility and maintainability. If a custom checkpoint format is used, implement the conversion logic within the TensorRT-LLM core library to keep the model definition and conversion code together.
Quantization
TensorRT-LLM supports various quantization techniques to reduce the model size and improve inference performance.
The NVIDIA AMMO toolkit is used for quantization algorithms like FP8, W4A16_AWQ, and W4A8_AWQ, while TensorRT-LLM also provides its own implementations for Smooth Quant, INT8 KV cache, and INT4/INT8 weight-only quantization.
The
quantize()
method in thePretrainedModel
class provides a unified interface for quantization, with the default implementation handling AMMO-supported quantization.Model-specific quantization logic can be implemented in the respective model classes, such as
LLaMAForCausalLM
, by overriding thequantize()
method.Best Practice: Use the
quantize()
method to perform quantization consistently across different models. When using the API in an MPI program, ensure that only rank 0 calls thequantize()
method to avoid resource contention.
Engine Building
The
tensorrt_llm.build
API is used to build the TensorRT-LLM model object into a TensorRT engine.This API simplifies the process of creating a builder, creating a network object, tracing the model to the network, and building the TensorRT engine.
The
BuildConfig
class is used to specify the build configuration options, such as the maximum batch size.The built engine can be saved to disk for later use.
Best Practice: Use the
tensorrt_llm.build
API to build the engine consistently across different models. Experiment with different build configurations to find the optimal balance between performance and memory usage.
Tips and Tricks
Checkpoint Deserialization
TensorRT-LLM provides the
from checkpoint()
method in thePretrainedModel
class to deserialise a saved checkpoint into a model object.This allows for faster iteration and avoids the need to convert the checkpoint every time the model is built.
Tip: Save the converted model checkpoint to disk and use the
from_checkpoint()
method to load it when building the engine to speed up the development process.
Model-Specific Optimization
While TensorRT-LLM provides a general build workflow, there may be model-specific optimisations that can be applied.
Tip: Explore model-specific configuration options and experiment with different quantization techniques to find the optimal balance between accuracy and performance for your specific use case.
Monitoring and Profiling
Use profiling tools like NVIDIA Nsight Systems to analyse the performance of the built engine and identify potential bottlenecks.
Tip: Monitor the GPU utilisation, memory usage, and inference latency to ensure that the deployed model meets the performance requirements.
Versioning and Compatibility
TensorRT-LLM is actively developed, and new features and improvements are added regularly.
Tip: Always use the same TensorRT-LLM version specified in the
requirements.txt
file of the model example to avoid compatibility issues.
Scalability and Distribution
TensorRT-LLM supports multi-GPU and multi-node configurations for scalable deployment.
Tip: Use the provided multi-GPU and multi-node APIs and configurations to distribute the workload and improve the overall performance of the deployed model.
It's crucial to understand the TensorRT-LLM build workflow and leverage the provided APIs and best practices to optimise the deployment of large language models.
Experiment with different configurations, quantization techniques, and model-specific optimisations to find the optimal balance between performance, memory usage, and accuracy for your specific use case.
Stay updated with the latest TensorRT-LLM releases and documentation to take advantage of new features and improvements.
Continuously monitor and profile the deployed model to ensure that it meets the performance requirements and scales well in production environments.
By following these best practices and tips, you can effectively use the TensorRT-LLM build workflow to deploy large language models efficiently and achieve optimal performance in various applications and scenarios.
Last updated