# The TensorRT-LLM process

The TensorRT-LLM  is a toolkit that streamlines the process of deploying large language models (LLMs) for efficient inference.&#x20;

It provides a <mark style="color:yellow;">**Python API**</mark> that abstracts away the complexities of working with the low-level TensorRT API while still leveraging its powerful optimisation capabilities.

### <mark style="color:blue;">Model Definition</mark>

The TensorRT-LLM Python API serves as an interface for defining the architecture of your neural language model.&#x20;

You can think of it as a high-level description of your neural network, specifying the layers, activations, and connections between them. &#x20;

Under the hood, this API *<mark style="color:yellow;">**translates your model definition into a graph representation**</mark>* using the TensorRT API.

### <mark style="color:blue;">Compilation</mark>

Once you have defined your model, the next step is to <mark style="color:yellow;">**compile it into an optimised inference engine.**</mark>&#x20;

This is where the power of TensorRT comes into play. &#x20;

The compilation process takes your model definition and applies various optimisations to generate an efficient execution plan.  Think of it as a way to transform your high-level model description into a highly optimised form that can run efficiently on the target GPU hardware.

### <mark style="color:blue;">Weight Bindings</mark>

During compilation, TensorRT needs to know the values of the model's weights. This is where weight bindings come in.&#x20;

You assign the trained weights to the corresponding parameters in your model definition.  These *<mark style="color:yellow;">**weights are then embedded into the compiled TensorRT engine**</mark>*, allowing it to perform inference with the trained parameters.

### <mark style="color:blue;">Pattern-Matching and Fusion</mark>

One of the key optimisations performed by TensorRT during compilation is *<mark style="color:yellow;">**operation fusion.**</mark>*&#x20;

It analyses the computational graph of your model and identifies patterns that can be fused together into a single, more efficient operation.&#x20;

For example, *<mark style="color:yellow;">**a matrix multiplication followed by an activation function can be fused into a single kernel.**</mark>* This reduces memory transfers and kernel launch overhead, leading to faster execution.

### <mark style="color:blue;">Plugins</mark>

TensorRT-LLM introduces the concept of plugins, which are user-defined kernels that can be seamlessly integrated into the model graph.

Plugins allow you to extend the functionality of TensorRT by *<mark style="color:yellow;">**implementing custom operations that may not be natively supported.**</mark>*  This flexibility is particularly useful for handling advanced or domain-specific operations in LLMs.

### <mark style="color:blue;">Runtime</mark>

Once your model is compiled into a TensorRT engine, you need a runtime environment to execute it.&#x20;

TensorRT-LLM provides a runtime API in both Python and C++ that facilitates loading the engine and running inference.  The runtime *<mark style="color:yellow;">**handles the execution flow, including feeding inputs, running the model, and retrieving outputs**</mark>*. It abstracts away the low-level details, making it easier to integrate the engine into your application.

### <mark style="color:blue;">Multi-GPU and Multi-Node Support</mark>

TensorRT-LLM goes beyond single-GPU execution by enabling multi-GPU and multi-node support.

It leverages plugins that wrap communication primitives from the <mark style="color:blue;">**NCCL library**</mark> to facilitate *<mark style="color:yellow;">**efficient data exchange between multiple GPUs or nodes**</mark>*. This allows you to distribute the workload and scale up the inference performance of your LLM.

### <mark style="color:blue;">In-flight Batching</mark>

To further optimise throughput, TensorRT-LLM introduces the concept of in-flight batching.&#x20;

It enables the runtime to *<mark style="color:yellow;">**batch multiple inference requests together**</mark>*, allowing for more efficient utilisation of GPU resources.   The Batch Manager component handles this functionality, transparently batching requests and dispatching them to the engine for execution.

### <mark style="color:blue;">Summary</mark>

By understanding these components and their interactions, you can appreciate how TensorRT-LLM simplifies the process of deploying LLMs for efficient inference.&#x20;

It provides a high-level API for model definition, leverages TensorRT's optimisations during compilation, offers flexibility through plugins, and delivers a runtime environment for seamless execution.&#x20;

This architecture empowers you to focus on the high-level aspects of your LLM while benefiting from the performance optimisations provided by TensorRT under the hood.

### <mark style="color:green;">Technical Summary of Each Component</mark>

### <mark style="color:blue;">Model Definition</mark>

* TensorRT-LLM provides a Python API to define Large Language Models (LLMs).
* The <mark style="color:blue;">**TensorRT Python API**</mark> <mark style="color:yellow;">creates graph representations of deep neural networks.</mark>
* The <mark style="color:yellow;">**`tensorrt_llm.Builder`**</mark> class contains a <mark style="color:yellow;">**`tensorrt.Builder`**</mark> object used to create an instance of <mark style="color:yellow;">**`tensorrt.INetworkDefinition`**</mark><mark style="color:yellow;">**.**</mark>
* The <mark style="color:yellow;">**`INetworkDefinition`**</mark> object is populated using free functions from <mark style="color:yellow;">**`tensorrt_llm.functional`**</mark>.
* These functions, like <mark style="color:yellow;">**`activation`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`relu`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`sigmoid`**</mark>, etc., insert nodes into the model's graph.
* Higher-level functions can be composed from these basic building blocks, such as the <mark style="color:yellow;">**`silu`**</mark> activation.
* The resulting graph represents the network and can be traversed or transformed using the <mark style="color:yellow;">**`tensorrt.ILayer`**</mark> class.

### <mark style="color:blue;">Compilation</mark>

Once the model graph is defined, it needs to be *<mark style="color:yellow;">**compiled into an optimised TensorRT engine**</mark>*.

The <mark style="color:yellow;">**`tensorrt_llm.Builder`**</mark> class provides the <mark style="color:yellow;">**`build_engine`**</mark> method, which calls the <mark style="color:yellow;">**`build_serialized_network`**</mark> method of the <mark style="color:yellow;">**`tensorrt.Builder`**</mark> object.

During compilation, TensorRT performs several optimisations on the model graph:

* It chooses the *<mark style="color:yellow;">**best kernel**</mark>* for each operation based on the available GPU.
* It identifies patterns in the graph where *<mark style="color:yellow;">**multiple operations can be fused**</mark>* into a single kernel, reducing memory movement and kernel launch overhead.
* It compiles the graph of operations into a *<mark style="color:yellow;">**single CUDA Graph**</mark>* that can be launched efficiently.
* Complex layer fusions, like <mark style="color:blue;">**FlashAttention**</mark>, *<mark style="color:yellow;">**cannot be automatically discovered**</mark>* by TensorRT.  In such cases, <mark style="color:blue;">**explicit plugins**</mark> can be used to replace parts of the graph with custom kernels.
* The compilation process produces an instance of the <mark style="color:yellow;">**`tensorrt.IHostMemory`**</mark> class, which represents the optimised TensorRT engine.
* The compiled engine can be stored as a binary file for later use.

### <mark style="color:blue;">Weight Bindings</mark>

* TensorRT engines embed the network weights, which must be known during compilation.
* Before calling <mark style="color:yellow;">**`tensorrt_llm.Builder.build_engine`**</mark>, the weights must be bound to parameters in the model definition.
* This is done by assigning values to the <mark style="color:yellow;">**`Parameter`**</mark> objects exposed by the model's layers.
* TensorRT also supports refitting engines to update weights after compilation using the <mark style="color:yellow;">**`refit_engine`**</mark> method in <mark style="color:yellow;">**`tensorrt_llm.Builder`**</mark><mark style="color:yellow;">**.**</mark>

### <mark style="color:blue;">Pattern-Matching and Fusion</mark>

* TensorRT performs pattern-matching and fusion during the compilation process to optimise the model execution.
* Fusion helps *<mark style="color:yellow;">**reduce data transfer between memory and compute cores**</mark>* and removes kernel launch overhead.
* TensorRT identifies sequences of operations that can be fused and automatically generates efficient GPU kernels for them.
* For example, a sequence of <mark style="color:yellow;">**`matmul`**</mark> followed by <mark style="color:yellow;">**`relu`**</mark> can be fused into a single kernel, avoiding intermediate memory writes and reads.
* TensorRT's pattern-matching algorithm is powerful but may not identify all possible fusions, especially for uncommon or advanced patterns.

### <mark style="color:blue;">Plugins</mark>

* Plugins are a mechanism in TensorRT to extend its functionality with custom GPU kernels.
* They are *<mark style="color:yellow;">**inserted into the network graph definition**</mark>* and map to user-defined kernels written in C++.
* Plugins follow a well-defined interface described in the TensorRT Developer Guide.
* TensorRT-LLM uses several plugins, located in the <mark style="color:yellow;">**`cpp/tensorrt_llm/plugins`**</mark> directory.
* Plugins are useful for implementing complex operations or fusions that cannot be automatically discovered by TensorRT, such as the GPT Attention operator.

### <mark style="color:blue;">Runtime</mark>

* TensorRT-LLM includes an API to implement Python and C++ runtimes.
* The runtime components *<mark style="color:yellow;">**load the TensorRT engines and drive their execution**</mark>*.
* For auto-regressive models like GPT, the runtime loads the engine that processes the input sequence and handles the generation loop.

### <mark style="color:blue;">Multi-GPU and Multi-Node Support</mark>

* TensorRT-LLM extends TensorRT's single-GPU design to support multiple GPUs and nodes.
* It uses TensorRT plugins that wrap communication primitives from the [<mark style="color:blue;">**NCCL library**</mark>](https://tensorrt-llm.continuumlabs.ai/nccl) and a custom All-Reduce plugin.
* The communication plugins are found in <mark style="color:yellow;">**`cpp/tensorrt_llm/plugins/ncclPlugin`**</mark>.
* Multi-GPU functions like <mark style="color:yellow;">**`allreduce`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`allgather`**</mark><mark style="color:yellow;">**,**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`send`**</mark>, and <mark style="color:yellow;">**`recv`**</mark> are exposed in the TensorRT-LLM Python API.
* Two modes of model parallelism are supported: <mark style="color:blue;">**Tensor Parallelism**</mark> and <mark style="color:blue;">**Pipeline Parallelism**</mark>.
* <mark style="color:blue;">**Tensor Parallelism**</mark> splits layers across GPUs, with each GPU running the entire network and synchronising as needed.
* <mark style="color:blue;">**Pipeline Parallelism**</mark> distributes layers to GPUs, with each GPU running a subset of the model and communicating at layer boundaries.

### <mark style="color:blue;">Summary</mark>

The TensorRT-LLM architecture provides a framework for defining, compiling, and executing LMs efficiently using TensorRT.&#x20;

It is important to note that the effectiveness of the TensorRT-LLM architecture depends on several factors:

* The <mark style="color:yellow;">**quality of the model definition**</mark> and the choice of appropriate layers and operations
* The ability to leverage TensorRT's <mark style="color:yellow;">**pattern-matching and fusion capabilities**</mark> effectively
* The <mark style="color:yellow;">**use of plugins**</mark> for complex operations or fusions that cannot be automatically discovered.
* The <mark style="color:yellow;">**optimisation of the C++ runtime**</mark> for the specific LLM architecture and deployment scenario.
* The careful <mark style="color:yellow;">**consideration of multi-GPU and multi-node configurations**</mark> based on the model size, available resources, and performance requirements.

Additionally, it's crucial to benchmark and profile the model performance using TensorRT-LLM to identify potential bottlenecks and optimise accordingly.&#x20;

Experimenting with different optimisation techniques, such as quantization or different precisions, can further improve the model's efficiency.

By leveraging the capabilities of TensorRT and extending it with custom plugins and runtime optimisations, TensorRT-LLM enables the deployment of large-scale language models in various scenarios, from local execution to multi-GPU and multi-node configurations.
