Model Definition
The model definition process in TensorRT-LLM involves using its Python API to create a structured representation of neural networks, specifically optimised for execution on NVIDIA GPUs.
Core Concepts
Graph-Based Representation
TensorRT-LLM uses a graph-based approach to represent neural networks.
In this context, a graph refers to a collection of nodes (representing operations like activations, layers) and edges (representing tensors flowing between operations).
TensorRT Python API
This API is for creating the graph representation in TensorRT.
It allows the construction and manipulation of network architecture in a format that TensorRT can optimise and execute.
Key Components of TensorRT-LLM Model Definition
tensorrt_llm.Builder Class
Contains a
tensorrt.Builder
object.Used to initialise the process of defining a neural network model.
Network Creation
The
tensorrt_llm.Builder.create_network
method creates an instance oftensorrt.INetworkDefinition
.INetworkDefinition
is essentially the graph of the model where you define the network architecture.
Functional API
TensorRT-LLM provides
tensorrt_llm.functional
, which contains functions to add various layers and operations to the network.These are 'free functions', meaning they aren't bound to a class instance but can operate on network elements like tensors.
Example of Activation Layer
A simple activation layer can be added using
tensorrt_llm.functional.activation
.This function inserts a
tensorrt.IActivationLayer
into the model's graph.It uses
default_trtnet()
to get the currentINetworkDefinition
and adds the activation layer to it.Standard activation functions like ReLU and Sigmoid are conveniently wrapped using this pattern.
Specialised Functions
Advanced functions, like Swish (SiLU), are constructed using these basic building blocks.
For example, SiLU is implemented as
input * sigmoid(input)
, combining a basic operation (*
) with a standard activation (sigmoid
).
Graph Traversal and Transformation
Once the model is defined using the API, you can traverse or transform the graph using the API provided by
tensorrt.ILayer
.This capability is crucial for optimising the network, modifying its structure, or performing analysis.
Optimisation During Engine Compilation
After defining the model, TensorRT optimises this graph during the engine compilation phase.
Optimisation includes layer fusion, precision calibration, kernel auto-tuning, etc., to ensure efficient execution on GPUs.
Practical Implications
This approach provides a way to define large language models, offering the ability to easily customise and optimise the network architecture.
TensorRT-LLM facilitates the creation of highly efficient models suitable for GPU-accelerated inference, especially important for computationally intensive tasks like those encountered in large language models.
In summary, the model definition in TensorRT-LLM is a process of constructing a graph-based representation of a neural network using a specialised Python API.
Last updated