The TensorRT-LLM process
The TensorRT-LLM is a toolkit that streamlines the process of deploying large language models (LLMs) for efficient inference.
It provides a Python API that abstracts away the complexities of working with the low-level TensorRT API while still leveraging its powerful optimisation capabilities.
Model Definition
The TensorRT-LLM Python API serves as an interface for defining the architecture of your neural language model.
You can think of it as a high-level description of your neural network, specifying the layers, activations, and connections between them.
Under the hood, this API translates your model definition into a graph representation using the TensorRT API.
Compilation
Once you have defined your model, the next step is to compile it into an optimised inference engine.
This is where the power of TensorRT comes into play.
The compilation process takes your model definition and applies various optimisations to generate an efficient execution plan. Think of it as a way to transform your high-level model description into a highly optimised form that can run efficiently on the target GPU hardware.
Weight Bindings
During compilation, TensorRT needs to know the values of the model's weights. This is where weight bindings come in.
You assign the trained weights to the corresponding parameters in your model definition. These weights are then embedded into the compiled TensorRT engine, allowing it to perform inference with the trained parameters.
Pattern-Matching and Fusion
One of the key optimisations performed by TensorRT during compilation is operation fusion.
It analyses the computational graph of your model and identifies patterns that can be fused together into a single, more efficient operation.
For example, a matrix multiplication followed by an activation function can be fused into a single kernel. This reduces memory transfers and kernel launch overhead, leading to faster execution.
Plugins
TensorRT-LLM introduces the concept of plugins, which are user-defined kernels that can be seamlessly integrated into the model graph.
Plugins allow you to extend the functionality of TensorRT by implementing custom operations that may not be natively supported. This flexibility is particularly useful for handling advanced or domain-specific operations in LLMs.
Runtime
Once your model is compiled into a TensorRT engine, you need a runtime environment to execute it.
TensorRT-LLM provides a runtime API in both Python and C++ that facilitates loading the engine and running inference. The runtime handles the execution flow, including feeding inputs, running the model, and retrieving outputs. It abstracts away the low-level details, making it easier to integrate the engine into your application.
Multi-GPU and Multi-Node Support
TensorRT-LLM goes beyond single-GPU execution by enabling multi-GPU and multi-node support.
It leverages plugins that wrap communication primitives from the NCCL library to facilitate efficient data exchange between multiple GPUs or nodes. This allows you to distribute the workload and scale up the inference performance of your LLM.
In-flight Batching
To further optimise throughput, TensorRT-LLM introduces the concept of in-flight batching.
It enables the runtime to batch multiple inference requests together, allowing for more efficient utilisation of GPU resources. The Batch Manager component handles this functionality, transparently batching requests and dispatching them to the engine for execution.
Summary
By understanding these components and their interactions, you can appreciate how TensorRT-LLM simplifies the process of deploying LLMs for efficient inference.
It provides a high-level API for model definition, leverages TensorRT's optimisations during compilation, offers flexibility through plugins, and delivers a runtime environment for seamless execution.
This architecture empowers you to focus on the high-level aspects of your LLM while benefiting from the performance optimisations provided by TensorRT under the hood.
Technical Summary of Each Component
Model Definition
TensorRT-LLM provides a Python API to define Large Language Models (LLMs).
The TensorRT Python API creates graph representations of deep neural networks.
The
tensorrt_llm.Builder
class contains atensorrt.Builder
object used to create an instance oftensorrt.INetworkDefinition
.The
INetworkDefinition
object is populated using free functions fromtensorrt_llm.functional
.These functions, like
activation
,relu
,sigmoid
, etc., insert nodes into the model's graph.Higher-level functions can be composed from these basic building blocks, such as the
silu
activation.The resulting graph represents the network and can be traversed or transformed using the
tensorrt.ILayer
class.
Compilation
Once the model graph is defined, it needs to be compiled into an optimised TensorRT engine.
The tensorrt_llm.Builder
class provides the build_engine
method, which calls the build_serialized_network
method of the tensorrt.Builder
object.
During compilation, TensorRT performs several optimisations on the model graph:
It chooses the best kernel for each operation based on the available GPU.
It identifies patterns in the graph where multiple operations can be fused into a single kernel, reducing memory movement and kernel launch overhead.
It compiles the graph of operations into a single CUDA Graph that can be launched efficiently.
Complex layer fusions, like FlashAttention, cannot be automatically discovered by TensorRT. In such cases, explicit plugins can be used to replace parts of the graph with custom kernels.
The compilation process produces an instance of the
tensorrt.IHostMemory
class, which represents the optimised TensorRT engine.The compiled engine can be stored as a binary file for later use.
Weight Bindings
TensorRT engines embed the network weights, which must be known during compilation.
Before calling
tensorrt_llm.Builder.build_engine
, the weights must be bound to parameters in the model definition.This is done by assigning values to the
Parameter
objects exposed by the model's layers.TensorRT also supports refitting engines to update weights after compilation using the
refit_engine
method intensorrt_llm.Builder
.
Pattern-Matching and Fusion
TensorRT performs pattern-matching and fusion during the compilation process to optimise the model execution.
Fusion helps reduce data transfer between memory and compute cores and removes kernel launch overhead.
TensorRT identifies sequences of operations that can be fused and automatically generates efficient GPU kernels for them.
For example, a sequence of
matmul
followed byrelu
can be fused into a single kernel, avoiding intermediate memory writes and reads.TensorRT's pattern-matching algorithm is powerful but may not identify all possible fusions, especially for uncommon or advanced patterns.
Plugins
Plugins are a mechanism in TensorRT to extend its functionality with custom GPU kernels.
They are inserted into the network graph definition and map to user-defined kernels written in C++.
Plugins follow a well-defined interface described in the TensorRT Developer Guide.
TensorRT-LLM uses several plugins, located in the
cpp/tensorrt_llm/plugins
directory.Plugins are useful for implementing complex operations or fusions that cannot be automatically discovered by TensorRT, such as the GPT Attention operator.
Runtime
TensorRT-LLM includes an API to implement Python and C++ runtimes.
The runtime components load the TensorRT engines and drive their execution.
For auto-regressive models like GPT, the runtime loads the engine that processes the input sequence and handles the generation loop.
Multi-GPU and Multi-Node Support
TensorRT-LLM extends TensorRT's single-GPU design to support multiple GPUs and nodes.
It uses TensorRT plugins that wrap communication primitives from the NCCL library and a custom All-Reduce plugin.
The communication plugins are found in
cpp/tensorrt_llm/plugins/ncclPlugin
.Multi-GPU functions like
allreduce
,allgather
,send
, andrecv
are exposed in the TensorRT-LLM Python API.Two modes of model parallelism are supported: Tensor Parallelism and Pipeline Parallelism.
Tensor Parallelism splits layers across GPUs, with each GPU running the entire network and synchronising as needed.
Pipeline Parallelism distributes layers to GPUs, with each GPU running a subset of the model and communicating at layer boundaries.
Summary
The TensorRT-LLM architecture provides a framework for defining, compiling, and executing LMs efficiently using TensorRT.
It is important to note that the effectiveness of the TensorRT-LLM architecture depends on several factors:
The quality of the model definition and the choice of appropriate layers and operations
The ability to leverage TensorRT's pattern-matching and fusion capabilities effectively
The use of plugins for complex operations or fusions that cannot be automatically discovered.
The optimisation of the C++ runtime for the specific LLM architecture and deployment scenario.
The careful consideration of multi-GPU and multi-node configurations based on the model size, available resources, and performance requirements.
Additionally, it's crucial to benchmark and profile the model performance using TensorRT-LLM to identify potential bottlenecks and optimise accordingly.
Experimenting with different optimisation techniques, such as quantization or different precisions, can further improve the model's efficiency.
By leveraging the capabilities of TensorRT and extending it with custom plugins and runtime optimisations, TensorRT-LLM enables the deployment of large-scale language models in various scenarios, from local execution to multi-GPU and multi-node configurations.
Last updated