Model Definition

The model definition process in TensorRT-LLM involves using its Python API to create a structured representation of neural networks, specifically optimised for execution on NVIDIA GPUs.

Core Concepts

Graph-Based Representation

TensorRT-LLM uses a graph-based approach to represent neural networks.

In this context, a graph refers to a collection of nodes (representing operations like activations, layers) and edges (representing tensors flowing between operations).

TensorRT Python API

This API is for creating the graph representation in TensorRT.

It allows the construction and manipulation of network architecture in a format that TensorRT can optimise and execute.

Key Components of TensorRT-LLM Model Definition

tensorrt_llm.Builder Class

Contains a tensorrt.Builder object.
Used to initialise the process of defining a neural network model.

Network Creation

The tensorrt_llm.Builder.create_network method creates an instance of tensorrt.INetworkDefinition.
INetworkDefinition is essentially the graph of the model where you define the network architecture.

Functional API

TensorRT-LLM provides tensorrt_llm.functional, which contains functions to add various layers and operations to the network.
These are 'free functions', meaning they aren't bound to a class instance but can operate on network elements like tensors.

Example of Activation Layer

A simple activation layer can be added using tensorrt_llm.functional.activation.
This function inserts a tensorrt.IActivationLayer into the model's graph.
It uses default_trtnet() to get the current INetworkDefinition and adds the activation layer to it.
Standard activation functions like ReLU and Sigmoid are conveniently wrapped using this pattern.

Specialised Functions

Advanced functions, like Swish (SiLU), are constructed using these basic building blocks.
For example, SiLU is implemented as input * sigmoid(input), combining a basic operation (*) with a standard activation (sigmoid).

Graph Traversal and Transformation

Once the model is defined using the API, you can traverse or transform the graph using the API provided by tensorrt.ILayer.
This capability is crucial for optimising the network, modifying its structure, or performing analysis.

Optimisation During Engine Compilation

After defining the model, TensorRT optimises this graph during the engine compilation phase.
Optimisation includes layer fusion, precision calibration, kernel auto-tuning, etc., to ensure efficient execution on GPUs.

Practical Implications

This approach provides a way to define large language models, offering the ability to easily customise and optimise the network architecture.
TensorRT-LLM facilitates the creation of highly efficient models suitable for GPU-accelerated inference, especially important for computationally intensive tasks like those encountered in large language models.

In summary, the model definition in TensorRT-LLM is a process of constructing a graph-based representation of a neural network using a specialised Python API.

PreviousINetworkDefinition NextCompilation

Last updated 1 year ago

Was this helpful?