Compilation
Compilation Process in TensorRT-LLM
The compilation process in TensorRT-LLM involves transforming a populated tensorrt.INetworkDefinition
instance into an optimised TensorRT engine.
This process is orchestrated by the tensorrt_llm.Builder
class, which provides a high-level interface for building and optimising the engine.
At the core of the compilation process is the build_engine
member function of the tensorrt_llm.Builder
class.
This function takes the populated INetworkDefinition
instance and various build configurations as input.
It then invokes the build_serialized_network
method of the underlying tensorrt.Builder
object to compile the network into an efficient engine.
During the compilation process, the TensorRT compiler performs several optimisations to enhance the performance of the engine.
It analyses the graph of operations and selects the most suitable kernel for each operation based on the available GPU architecture.
Furthermore, the compiler identifies patterns in the graph where multiple operations can be fused into a single kernel. This fusion process reduces memory movement and minimises the overhead of launching multiple GPU kernels, resulting in improved execution speed.
One of the key advantages of the TensorRT compiler is its ability to compile the graph of operations into a single CUDA Graph.
This allows the entire graph to be launched in a single operation, further reducing the kernel launch overhead and maximising performance.
However, there are certain complex layer fusions, such as FlashAttention, that involve intricate interleaving of operations and cannot be automatically discovered by the TensorRT compiler.
In such cases, TensorRT-LLM provides the flexibility to explicitly replace parts of the graph with plugins at compile time. These plugins are pre-built and optimised implementations of specific operations that can be seamlessly integrated into the compilation process.
If the compilation process completes successfully, the build_engine
function returns an instance of the tensorrt.IHostMemory
class. This object represents the optimized TensorRT engine that is ready for execution. The engine can be serialised and stored as a binary file for later use, enabling efficient deployment and inference.
It's important to note that the compilation process in TensorRT-LLM is highly configurable.
The tensorrt_llm.Builder
class provides various options to customise the build settings, such as precision, quantization, and optimisation level. These settings can be adjusted based on the specific requirements of the LLM task and the target hardware.
In summary, the compilation process in TensorRT-LLM leverages the powerful TensorRT compiler to optimise the graph of operations and generate an efficient engine.
Through layer fusion, kernel selection, and CUDA Graph compilation, TensorRT-LLM achieves significant performance improvements for Large Language Model inference.
The flexibility to incorporate plugins for complex layer fusions further enhances the capabilities of the compilation process.
Last updated