TensorRT-LLM build workflow

The conversion APIs in the TensorRT-LLM module provide a standardised way to convert weights from various formats, such as Hugging Face checkpoints, into the format expected by TensorRT-LLM models.

These APIs aim to simplify the process of loading pre-trained weights into TensorRT-LLM models, making it easier to use and adapt models from different sources.

The key components of the conversion APIs are

TopModelMixin class

This is a mixin class that declares the from_hugging_face() interface. It serves as a blueprint for model classes to implement the weight conversion functionality from Hugging Face checkpoints.

What is a mixin class?

A mixin class is a type of class that's designed to provide additional functionality to other classes through inheritance.

Mixins allow for the sharing of functionality across various classes without needing to express a parent-child relationship in the class hierarchy.

This can be particularly useful in languages that support single inheritance, like Python, as it provides a way to gain the benefits of multiple inheritances in a more controlled manner.

Key Characteristics of Mixin Classes

Provide Methods, Not State: Mixin classes typically offer methods and functionality but do not maintain state. They don't define attributes that keep data specific to a particular instance.

Intended for Inheritance: Mixins are meant to be inherited and used by other classes, not instantiated on their own. They provide additional functionality to classes in a modular way.

No Independent Functionality: On their own, mixins do not function. They enhance the functionality of other classes when mixed into them.

Reusability: They are designed to be reused across multiple classes. This makes them an excellent tool for sharing functionality without repeating code.

from_hugging_face() method

This is a class method declared in the TopModelMixin class.

It takes the Hugging Face model directory, data type, and other optional parameters as input and returns an instance of the TensorRT-LLM model with the loaded weights. The actual implementation of this method is left to the subclasses.

LLaMAForCausalLM class

This class represents the LLaMA model for causal language modelling in TensorRT-LLM.

It inherits from DecoderModelForCausalLM and includes TopModelMixin in its base class hierarchy.

The class implements the from_hugging_face() method to convert weights from a Hugging Face checkpoint to the TensorRT-LLM expected format and load them into the model.

By using the conversion APIs, the process of converting and loading weights becomes more streamlined.

For example, in the convert_checkpoint.py script, the code can be simplified to just creating an instance of LLaMAForCausalLM using the from_hugging_face() method and then saving the checkpoint using the save_checkpoint() method.

The from_hugging_face() method intentionally returns an in-memory object instead of saving the checkpoint to disk.

This allows for flexibility and faster conversion and building of the model in a single process. The save_checkpoint() method can be called separately to save the model to disk when needed.

In addition to the from_hugging_face() method, the LLaMAForCausalLM class also provides a from_meta_ckpt() method specifically for loading weights from Meta checkpoints. This method is not part of the TopModelMixin class as it is specific to the LLaMA model.

The conversion APIs are designed to be extensible. In future releases, more factory methods like from_jax(), from_nemo(), or from_keras() can be added to support different training checkpoints. Model developers can choose to implement any subset of these factory methods for the models they contribute to TensorRT-LLM.

For unsupported formats, users still have the flexibility to implement their own weight conversion outside the core library. They can create a weights dictionary and load it into the model using the load() method or directly assign the model parameters internally.

Overall, the conversion APIs in TensorRT-LLM provide a standardised and flexible way to convert and load weights from different sources into TensorRT-LLM models, making it easier to work with and adapt models in various scenarios.

Key Components

Model Conversion

TensorRT-LLM provides APIs for converting existing model checkpoints from various training frameworks to the TensorRT-LLM format.
The TopModelMixin class defines a common interface for model conversion, with the from_hugging_face() method being the most commonly used.
Model-specific conversion logic is implemented in the respective model classes, such as LLaMAForCausalLM, which inherits from TopModelMixin.
The conversion APIs allow for flexibility in supporting different checkpoint formats, such as Hugging Face and Meta checkpoints for LLaMA models.
Best Practice: Use the provided conversion APIs whenever possible to ensure compatibility and maintainability. If a custom checkpoint format is used, implement the conversion logic within the TensorRT-LLM core library to keep the model definition and conversion code together.

Quantization

TensorRT-LLM supports various quantization techniques to reduce the model size and improve inference performance.
The NVIDIA AMMO toolkit is used for quantization algorithms like FP8, W4A16_AWQ, and W4A8_AWQ, while TensorRT-LLM also provides its own implementations for Smooth Quant, INT8 KV cache, and INT4/INT8 weight-only quantization.
The quantize() method in the PretrainedModel class provides a unified interface for quantization, with the default implementation handling AMMO-supported quantization.
Model-specific quantization logic can be implemented in the respective model classes, such as LLaMAForCausalLM, by overriding the quantize() method.
Best Practice: Use the quantize() method to perform quantization consistently across different models. When using the API in an MPI program, ensure that only rank 0 calls the quantize() method to avoid resource contention.

Engine Building

The tensorrt_llm.build API is used to build the TensorRT-LLM model object into a TensorRT engine.
This API simplifies the process of creating a builder, creating a network object, tracing the model to the network, and building the TensorRT engine.
The BuildConfig class is used to specify the build configuration options, such as the maximum batch size.
The built engine can be saved to disk for later use.
Best Practice: Use the tensorrt_llm.build API to build the engine consistently across different models. Experiment with different build configurations to find the optimal balance between performance and memory usage.

Tips and Tricks

Checkpoint Deserialization

TensorRT-LLM provides the from checkpoint() method in the PretrainedModel class to deserialise a saved checkpoint into a model object.
This allows for faster iteration and avoids the need to convert the checkpoint every time the model is built.
Tip: Save the converted model checkpoint to disk and use the from_checkpoint() method to load it when building the engine to speed up the development process.

Model-Specific Optimization

While TensorRT-LLM provides a general build workflow, there may be model-specific optimisations that can be applied.
Tip: Explore model-specific configuration options and experiment with different quantization techniques to find the optimal balance between accuracy and performance for your specific use case.

Monitoring and Profiling

Use profiling tools like NVIDIA Nsight Systems to analyse the performance of the built engine and identify potential bottlenecks.
Tip: Monitor the GPU utilisation, memory usage, and inference latency to ensure that the deployed model meets the performance requirements.

Versioning and Compatibility

TensorRT-LLM is actively developed, and new features and improvements are added regularly.
Tip: Always use the same TensorRT-LLM version specified in the requirements.txt file of the model example to avoid compatibility issues.

Scalability and Distribution

TensorRT-LLM supports multi-GPU and multi-node configurations for scalable deployment.
Tip: Use the provided multi-GPU and multi-node APIs and configurations to distribute the workload and improve the overall performance of the deployed model.

It's crucial to understand the TensorRT-LLM build workflow and leverage the provided APIs and best practices to optimise the deployment of large language models.

Experiment with different configurations, quantization techniques, and model-specific optimisations to find the optimal balance between performance, memory usage, and accuracy for your specific use case.

Stay updated with the latest TensorRT-LLM releases and documentation to take advantage of new features and improvements.

Continuously monitor and profile the deployed model to ensure that it meets the performance requirements and scales well in production environments.

By following these best practices and tips, you can effectively use the TensorRT-LLM build workflow to deploy large language models efficiently and achieve optimal performance in various applications and scenarios.

PreviousModel Configuration NextTensorRT-LLM build workflow - process

Last updated 1 year ago

Was this helpful?