# TensorRT-LLM build workflow

The <mark style="color:blue;">**conversion APIs**</mark> in the TensorRT-LLM module provide a standardised way to convert weights from various formats, such as Hugging Face checkpoints, into the format expected by TensorRT-LLM models.&#x20;

These APIs aim to simplify the process of loading pre-trained weights into TensorRT-LLM models, making it easier to use and adapt models from different sources.

### <mark style="color:blue;">The key components of the conversion APIs are</mark>

<mark style="color:yellow;">**`TopModelMixin`**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**class**</mark>

This is a mixin class that declares the <mark style="color:yellow;">**`from_hugging_face()`**</mark> interface. It serves as a blueprint for model classes to implement the weight conversion functionality from Hugging Face checkpoints.

<details>

<summary><mark style="color:green;"><strong>What is a mixin class?</strong></mark></summary>

A mixin class is a type of class that's designed to provide additional functionality to other classes through inheritance.&#x20;

Mixins allow for the sharing of functionality across various classes without needing to express a parent-child relationship in the class hierarchy.&#x20;

This can be particularly useful in languages that support single inheritance, like Python, as it provides a way to gain the benefits of multiple inheritances in a more controlled manner.

#### <mark style="color:green;">Key Characteristics of Mixin Classes</mark>

<mark style="color:purple;">**Provide Methods, Not State:**</mark> Mixin classes typically offer methods and functionality but do not maintain state. They don't define attributes that keep data specific to a particular instance.

<mark style="color:purple;">**Intended for Inheritance:**</mark> Mixins are meant to be inherited and used by other classes, not instantiated on their own. They provide additional functionality to classes in a modular way.

<mark style="color:purple;">**No Independent Functionality**</mark><mark style="color:purple;">:</mark> On their own, mixins do not function. They enhance the functionality of other classes when mixed into them.

<mark style="color:purple;">**Reusability**</mark><mark style="color:purple;">:</mark> They are designed to be reused across multiple classes. This makes them an excellent tool for sharing functionality without repeating code.

</details>

<mark style="color:yellow;">**`from_hugging_face()`**</mark> method

This is a class method declared in the <mark style="color:yellow;">**`TopModelMixin`**</mark> class.&#x20;

It takes the Hugging Face model directory, data type, and other optional parameters as input and returns an instance of the TensorRT-LLM model with the loaded weights. The actual implementation of this method is left to the subclasses.

<mark style="color:yellow;">**`LLaMAForCausalLM`**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**class**</mark>

This class represents the LLaMA model for causal language modelling in TensorRT-LLM.&#x20;

It inherits from <mark style="color:yellow;">**`DecoderModelForCausalLM`**</mark> and includes <mark style="color:yellow;">**`TopModelMixin`**</mark> in its base class hierarchy.&#x20;

The class implements the <mark style="color:yellow;">**`from_hugging_face()`**</mark> method to convert weights from a Hugging Face checkpoint to the TensorRT-LLM expected format and load them into the model.

By using the conversion APIs, the process of converting and loading weights becomes more streamlined.&#x20;

For example, in the <mark style="color:yellow;">**`convert_checkpoint.py`**</mark> script, the code can be simplified to just creating an instance of <mark style="color:yellow;">**`LLaMAForCausalLM`**</mark> using the <mark style="color:yellow;">**`from_hugging_face()`**</mark> method and then saving the checkpoint using the <mark style="color:yellow;">**`save_checkpoint()`**</mark> method.

The <mark style="color:yellow;">**`from_hugging_face()`**</mark> method intentionally returns an in-memory object instead of saving the checkpoint to disk.&#x20;

This allows for flexibility and faster conversion and building of the model in a single process. The <mark style="color:yellow;">**`save_checkpoint()`**</mark> method can be called separately to save the model to disk when needed.

In addition to the <mark style="color:yellow;">**`from_hugging_face()`**</mark> method, the <mark style="color:yellow;">**`LLaMAForCausalLM`**</mark> class also provides a <mark style="color:yellow;">**`from_meta_ckpt()`**</mark> method specifically for loading weights from Meta checkpoints. This method is not part of the <mark style="color:yellow;">**`TopModelMixin`**</mark> class as it is specific to the <mark style="color:yellow;">**LLaMA**</mark> model.

The conversion APIs are designed to be extensible. In future releases, more factory methods like <mark style="color:yellow;">**`from_jax()`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`from_nemo()`**</mark><mark style="color:yellow;">**, or**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`from_keras()`**</mark> can be added to support different training checkpoints.  Model developers can choose to implement any subset of these factory methods for the models they contribute to TensorRT-LLM.

For unsupported formats, users still have the flexibility to implement their own weight conversion outside the core library. They can create a weights dictionary and load it into the model using the <mark style="color:yellow;">**`load()`**</mark> method or directly assign the model parameters internally.

Overall, the conversion APIs in TensorRT-LLM provide a standardised and flexible way to convert and load weights from different sources into TensorRT-LLM models, making it easier to work with and adapt models in various scenarios.

### <mark style="color:blue;">Key Components</mark>

#### <mark style="color:green;">Model Conversion</mark>

* TensorRT-LLM provides APIs for converting existing model checkpoints from various training frameworks to the TensorRT-LLM format.
* The <mark style="color:yellow;">**`TopModelMixin`**</mark> class defines a common interface for model conversion, with the <mark style="color:yellow;">**`from_hugging_face()`**</mark> method being the most commonly used.
* Model-specific conversion logic is implemented in the respective model classes, such as <mark style="color:yellow;">**`LLaMAForCausalLM`**</mark>, which inherits from <mark style="color:yellow;">**`TopModelMixin`**</mark>.
* The conversion APIs allow for flexibility in supporting different checkpoint formats, such as Hugging Face and Meta checkpoints for LLaMA models.
* <mark style="color:purple;">**Best Practice:**</mark> Use the provided conversion APIs whenever possible to ensure compatibility and maintainability. If a custom checkpoint format is used, implement the conversion logic within the TensorRT-LLM core library to keep the model definition and conversion code together.

#### <mark style="color:green;">Quantization</mark>

* TensorRT-LLM supports various quantization techniques to reduce the model size and improve inference performance.
* The <mark style="color:blue;">**NVIDIA AMMO**</mark> toolkit is used for quantization algorithms like FP8, W4A16\_AWQ, and W4A8\_AWQ, while TensorRT-LLM also provides its own implementations for Smooth Quant, INT8 KV cache, and INT4/INT8 weight-only quantization.
* The <mark style="color:yellow;">**`quantize()`**</mark> method in the <mark style="color:yellow;">**`PretrainedModel`**</mark> class provides a unified interface for quantization, with the default implementation handling AMMO-supported quantization.
* Model-specific quantization logic can be implemented in the respective model classes, such as <mark style="color:yellow;">**`LLaMAForCausalLM`**</mark>, by overriding the <mark style="color:yellow;">**`quantize()`**</mark> method.
* <mark style="color:purple;">**Best Practice:**</mark> Use the <mark style="color:yellow;">**`quantize()`**</mark> method to perform quantization consistently across different models. When using the API in an MPI program, ensure that only rank 0 calls the <mark style="color:yellow;">**`quantize()`**</mark> method to avoid resource contention.

#### <mark style="color:green;">Engine Building</mark>

* The <mark style="color:yellow;">**`tensorrt_llm.build`**</mark> API is used to build the TensorRT-LLM model object into a TensorRT engine.
* This API simplifies the process of creating a builder, creating a network object, tracing the model to the network, and building the TensorRT engine.
* The <mark style="color:yellow;">**`BuildConfig`**</mark> class is used to specify the build configuration options, such as the maximum batch size.
* The built engine can be saved to disk for later use.
* <mark style="color:purple;">**Best Practice:**</mark> Use the <mark style="color:yellow;">**`tensorrt_llm.build`**</mark> API to build the engine consistently across different models. Experiment with different build configurations to find the optimal balance between performance and memory usage.

### <mark style="color:blue;">Tips and Tricks</mark>

#### <mark style="color:green;">Checkpoint Deserialization</mark>

* TensorRT-LLM provides the <mark style="color:yellow;">**`from checkpoint()`**</mark> method in the <mark style="color:yellow;">**`PretrainedModel`**</mark> class to deserialise a saved checkpoint into a model object.
* This allows for faster iteration and avoids the need to convert the checkpoint every time the model is built.
* Tip: Save the converted model checkpoint to disk and use the <mark style="color:yellow;">**`from_checkpoint()`**</mark> method to load it when building the engine to speed up the development process.

#### <mark style="color:green;">Model-Specific Optimization</mark>

* While TensorRT-LLM provides a general build workflow, there may be model-specific optimisations that can be applied.
* Tip: Explore model-specific configuration options and experiment with different quantization techniques to find the optimal balance between accuracy and performance for your specific use case.

#### <mark style="color:green;">Monitoring and Profiling</mark>

* Use profiling tools like <mark style="color:blue;">**NVIDIA Nsight Systems**</mark> to analyse the performance of the built engine and identify potential bottlenecks.
* Tip: Monitor the GPU utilisation, memory usage, and inference latency to ensure that the deployed model meets the performance requirements.

#### <mark style="color:green;">Versioning and Compatibility</mark>

* TensorRT-LLM is actively developed, and new features and improvements are added regularly.
* Tip: Always use the same TensorRT-LLM version specified in the <mark style="color:yellow;">**`requirements.txt`**</mark> file of the model example to avoid compatibility issues.

#### <mark style="color:green;">Scalability and Distribution</mark>

* TensorRT-LLM supports multi-GPU and multi-node configurations for scalable deployment.
* Tip: Use the provided multi-GPU and multi-node APIs and configurations to distribute the workload and improve the overall performance of the deployed model.

It's crucial to understand the TensorRT-LLM build workflow and leverage the provided APIs and best practices to optimise the deployment of large language models.&#x20;

Experiment with different configurations, quantization techniques, and model-specific optimisations to find the optimal balance between performance, memory usage, and accuracy for your specific use case.

Stay updated with the latest TensorRT-LLM releases and documentation to take advantage of new features and improvements.&#x20;

Continuously monitor and profile the deployed model to ensure that it meets the performance requirements and scales well in production environments.

By following these best practices and tips, you can effectively use the TensorRT-LLM build workflow to deploy large language models efficiently and achieve optimal performance in various applications and scenarios.
