tensorrt_llm/builder.py

The `Builder` class

The Builder class in TensorRT-LLM is responsible for building the TensorRT engine from a given network definition.

It provides methods to create and configure the network, set builder configurations, and build the optimised engine for inference.

Here are the key components and functionalities of the Builder class:

Initialization

The Builder class is initialized with a trt.Builder object from the TensorRT library.
It has a strongly_typed attribute that determines whether the network should be created with strongly typed tensors. By default, it is set to False, but it can be enabled for TensorRT versions that support it.

Creating a Network

The create_network() method creates an empty Network object using the TensorRT builder.
It sets the appropriate flags for the network creation, such as EXPLICIT_BATCH and STRONGLY_TYPED, based on the TensorRT version and the strongly_typed attribute.

Creating a Builder Configuration

The create_builder_config() method creates a BuilderConfig object with specified configurations.
It takes parameters such as precision, timing cache, tensor parallel, use_refit, int8, strongly_typed, optimisation level, profiling verbosity, and other optional arguments.
The method sets the appropriate flags and configurations in the TensorRT builder config based on the provided parameters.

Adding Optimization Profiles

The _add_optimization_profile() method adds optimisation profiles to the builder configuration.
It iterates over the input tensors of the network and sets the minimum, optimum, and maximum shapes for each input tensor in the optimisation profile.
It also handles the sharding of input tensors based on the auto-parallel configuration.

Validating Named Dimensions

The _validate_named_dimensions() method validates that the named dimensions of different input tensors in each optimization profile have the same range.
It ensures that the modeling in TensorRT-LLM is correct and provides user-friendly error messages if any discrepancies are found.

Refitting an Engine

The refit_engine() method refits an existing TensorRT engine using the weights from a given network.
It requires that the engine was built with the REFIT flag and the network has the same structure as the engine.
It uses the trt.Refitter class to set the named weights and refit the engine.

Building an Engine

The build_engine() method builds a TensorRT engine from a given network and builder configuration.
It sets the plugin configuration, auto-parallel configuration, and optimization profiles in the builder configuration.
It renames the weights in the network based on the named parameters.
It then builds the serialized engine using the TensorRT builder and the configured network.

Saving Timing Cache and Configuration

The save_timing_cache() method serializes the timing cache of a given builder configuration to a file.
The save_config() method saves the builder configuration to a JSON file.

The BuildConfig class is a dataclass that represents the configuration options for building the TensorRT engine.

It includes parameters such as:

maximum input length, maximum output length, maximum batch size, beam width, number of tokens, prompt embedding table size, gather context/generation logits, strongly typed, builder optimization level, profiling verbosity, debug output, draft length, use_refit, timing cache paths, LoRA configuration, auto-parallel configuration, weight sparsity, and plugin configuration.

Overall, the Builder class in TensorRT-LLM provides a high-level interface for building optimised TensorRT engines from network definitions, allowing customisation of various configurations and optimisations for efficient inference.

Explanation of the builder.py classes and methods

Imagine you're building a powerful race car (the TensorRT engine) from various components (the model, build configuration, etc.).

The builder.py file acts as the assembly manual, guiding you through the process of constructing and fine-tuning your race car.

The Builder class is like the master mechanic who oversees the entire building process.

It provides the tools and expertise needed to create the race car's blueprint (the TensorRT network) and configure its settings (the builder configuration).

The master mechanic can create a new blueprint (create_network()) and customize the configuration (create_builder_config()) based on your preferences, such as the desired precision, performance optimizations, and memory usage.

The BuilderConfig class represents the configuration settings for your race car.

It's like a checklist of options you can choose from to optimise your car's performance.

You can specify the precision (e.g., float32, float16), enable performance optimisations (e.g., refit, sparse weights), and set profiling verbosity to analyse your car's performance.

The master mechanic uses this configuration to fine-tune the building process.

The BuildConfig class is like a detailed specification sheet for your race car.

It contains all the important parameters that define your car's capabilities, such as the maximum input and output lengths, batch size, beam width, and token limits.

You can think of it as a custom order form where you specify your desired features, such as gathering context and generation logits, enabling debug output, and configuring LoRA (Low-Rank Adaptation) and auto-parallel settings.

The EngineConfig class is like a comprehensive blueprint that combines the race car's specification sheet (BuildConfig) with its underlying architecture (PretrainedConfig) and version information. It provides a complete overview of your race car's configuration.

The Engine class represents the final product—your fully assembled and optimized race car.

It encapsulates the TensorRT engine and its associated configuration. You can save your race car (save()) for future use or load a previously built one (from_dir()) to take it for a spin.

The build() function is like the main assembly line where all the components come together to create your race car.

It takes the model (the engine block) and the build configuration (the custom order form) and goes through a series of optimisation steps to ensure peak performance.

It fine-tunes the model, creates the TensorRT network, applies optimisations, and finally builds the engine using the builder configuration.

Throughout the building process, the script utilizes various utility functions and classes to streamline the workflow, such as optimize_model() for model optimization, auto_parallel() for parallel processing, and net_guard() for network management.

By following the steps outlined in the builder.py file and leveraging the different classes and functions, you can create a highly optimised and customized race car (TensorRT engine) tailored to your specific needs.

In summary, the builder.py file provides a structured and modular approach to building TensorRT engines, allowing you to fine-tune and optimize your model's performance.

By understanding the roles and relationships between the classes and functions, you can effectively navigate the building process and create powerful and efficient engines for your specific use case.

UML Analysis

+--------------------------------------------------+
|                    Builder                      |
+--------------------------------------------------+
| - _trt_builder: trt.Builder                     |
| - strongly_typed: bool                          |
+--------------------------------------------------+
| + __init__()                                    |
| + trt_builder: trt.Builder                      |
| + create_network() -> Network                   |
| + create_builder_config(...) -> BuilderConfig   |
| + _add_optimization_profile(...)                |
| + _validate_named_dimensions(...)               |
| + refit_engine(...) -> trt.IHostMemory          |
| + build_engine(...) -> trt.IHostMemory          |
| + save_timing_cache(...) -> bool                |
| + save_config(...)                              |
+--------------------------------------------------+

The Builder class is responsible for creating and configuring the TensorRT builder and building the engine. It has the following methods:

__init__(): Initializes the Builder object and creates a trt.Builder instance.
trt_builder: Returns the trt.Builder instance.
create_network(): Creates and returns a Network object.
create_builder_config(...): Creates and returns a BuilderConfig object with the specified precision, timing cache, and other configurations.
_add_optimization_profile(...): Adds optimization profiles to the builder config based on the input tensors and auto-parallel configuration.
_validate_named_dimensions(...): Validates the named dimensions of different input tensors in each optimization profile.
refit_engine(...): Refits a TensorRT engine using weights from the network.
build_engine(...): Builds a TensorRT engine from the network and builder config.
save_timing_cache(...): Serializes the timing cache of a given builder config to a file.
save_config(...): Saves the builder config to a JSON file.

+--------------------------------------------------+
|                  BuilderConfig                  |
+--------------------------------------------------+
| - _trt_builder_config: trt.IBuilderConfig       |
| - precision: str                                |
| - tensor_parallel: int                          |
| - use_refit: bool                               |
| - int8: bool                                    |
| - strongly_typed: bool                          |
| - plugin_config: PluginConfig                   |
| - auto_parallel_config: dict                    |
+--------------------------------------------------+
| + _init(...)                                    |
| + trt_builder_config: trt.IBuilderConfig        |
| + to_dict() -> dict                             |
+--------------------------------------------------+

The BuilderConfig class represents the configuration for the TensorRT builder. It has the following properties and methods:

_trt_builder_config: The underlying trt.IBuilderConfig instance.
precision: The precision used for building the engine (e.g., 'float32', 'float16', 'bfloat16').
tensor_parallel: The number of GPUs used for tensor parallelism.
use_refit: Flag indicating whether to use engine refitting.
int8: Flag indicating whether to enable INT8 precision.
strongly_typed: Flag indicating whether to use strongly typed networks.
plugin_config: The PluginConfig object associated with the builder config.
auto_parallel_config: The auto-parallel configuration dictionary.
_init(...): Initializes the BuilderConfig object with the provided arguments.
trt_builder_config: Returns the underlying trt.IBuilderConfig instance.
to_dict(): Returns a dictionary representation of the builder config.

+--------------------------------------------------+
|                   BuildConfig                   |
+--------------------------------------------------+
| - max_input_len: int                            |
| - max_output_len: int                           |
| - max_batch_size: int                           |
| - max_beam_width: int                           |
| - max_num_tokens: Optional[int]                 |
| - opt_num_tokens: Optional[int]                 |
| - max_prompt_embedding_table_size: int          |
| - gather_context_logits: int                    |
| - gather_generation_logits: int                 |
| - strongly_typed: bool                          |
| - builder_opt: Optional[int]                    |
| - profiling_verbosity: str                      |
| - enable_debug_output: bool                     |
| - max_draft_len: int                            |
| - use_refit: bool                               |
| - input_timing_cache: str                       |
| - output_timing_cache: str                      |
| - lora_config: LoraBuildConfig                  |
| - auto_parallel_config: AutoParallelConfig       |
| - weight_sparsity: bool                         |
| - plugin_config: PluginConfig                   |
| - use_fused_mlp: bool                           |
+--------------------------------------------------+
| + from_dict(...) -> BuildConfig                 |
| + from_json_file(...) -> BuildConfig            |
| + to_dict() -> dict                             |
+--------------------------------------------------+

The BuildConfig class represents the configuration for building the engine.

It has various properties to control the build process, such as input and output lengths, batch size, beam width, token limits, prompt embedding table size, gather logits flags, precision, builder optimisation level, profiling verbosity, debug output, draft length, refitting, timing cache, LoRA configuration, auto-parallel configuration, weight sparsity, plugin configuration, and fused MLP usage.

It also provides methods to create BuildConfig instances from dictionaries or JSON files and to convert the configuration to a dictionary.

+--------------------------------------------------+
|                  EngineConfig                   |
+--------------------------------------------------+
| - pretrained_config: PretrainedConfig           |
| - build_config: BuildConfig                     |
| - version: str                                  |
+--------------------------------------------------+
| + from_json_file(...) -> EngineConfig           |
| + to_dict() -> dict                             |
+--------------------------------------------------+

The EngineConfig class represents the configuration for an engine.

It combines the pretrained model configuration (PretrainedConfig), build configuration (BuildConfig), and version information. It provides methods to create EngineConfig instances from JSON files and to convert the configuration to a dictionary.

+--------------------------------------------------+
|                     Engine                      |
+--------------------------------------------------+
| - config: EngineConfig                          |
| - engine: trt.IHostMemory                       |
+--------------------------------------------------+
| + __init__(config: EngineConfig, engine: trt.IHostMemory) |
| + save(engine_dir: str)                         |
| + from_dir(engine_dir: str, rank: int) -> Engine|
+--------------------------------------------------+

The Engine class represents a TensorRT engine. It has the following properties and methods:

config: The EngineConfig object associated with the engine.
engine: The serialized TensorRT engine (trt.IHostMemory).
__init__(...): Initializes the Engine object with the provided EngineConfig and serialized engine.
save(...): Saves the engine and its configuration to the specified directory.
from_dir(...): Creates an Engine instance from the specified directory and rank.

The get_engine_version(...) function retrieves the version of the engine from the configuration file in the specified directory.

The build(...) function is the main entry point for building an engine.

It takes a pretrained model (PretrainedModel) and a build configuration (BuildConfig) as input and returns an Engine object.

The function performs various optimizations on the model based on the build configuration, creates a TensorRT network, and builds the engine using the Builder class.

This breakdown provides an overview of the classes and functions in the builder.py file and their responsibilities.

The classes are designed to encapsulate different aspects of the engine building process, such as builder configuration, build configuration, engine configuration, and the engine itself. The build(...) function orchestrates the entire process of optimizing the model, creating the network, and building the engine using the provided configurations.

Previoustensorrt_llm folders Nexttensorrt_llm/network.py

Last updated 1 month ago

The Builder class

Initialization

Creating a Builder Configuration

Adding Optimization Profiles

Validating Named Dimensions

Refitting an Engine

Building an Engine

Saving Timing Cache and Configuration

Explanation of the builder.py classes and methods

UML Analysis

The Builder class

Initialization

Creating a Builder Configuration

Adding Optimization Profiles

Validating Named Dimensions

Refitting an Engine

Building an Engine

Saving Timing Cache and Configuration

Explanation of the builder.py classes and methods

UML Analysis

The `Builder` class

The `Builder` class