LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Model Definition
  • Compilation
  • Weight Bindings
  • Pattern-Matching and Fusion
  • Plugins
  • Runtime
  • Multi-GPU and Multi-Node Support
  • In-flight Batching
  • Summary
  • Technical Summary of Each Component
  • Model Definition
  • Compilation
  • Weight Bindings
  • Pattern-Matching and Fusion
  • Plugins
  • Runtime
  • Multi-GPU and Multi-Node Support
  • Summary

Was this helpful?

  1. TensorRT-LLM Architecture and Process

The TensorRT-LLM process

The TensorRT-LLM is a toolkit that streamlines the process of deploying large language models (LLMs) for efficient inference.

It provides a Python API that abstracts away the complexities of working with the low-level TensorRT API while still leveraging its powerful optimisation capabilities.

Model Definition

The TensorRT-LLM Python API serves as an interface for defining the architecture of your neural language model.

You can think of it as a high-level description of your neural network, specifying the layers, activations, and connections between them.

Under the hood, this API translates your model definition into a graph representation using the TensorRT API.

Compilation

Once you have defined your model, the next step is to compile it into an optimised inference engine.

This is where the power of TensorRT comes into play.

The compilation process takes your model definition and applies various optimisations to generate an efficient execution plan. Think of it as a way to transform your high-level model description into a highly optimised form that can run efficiently on the target GPU hardware.

Weight Bindings

During compilation, TensorRT needs to know the values of the model's weights. This is where weight bindings come in.

You assign the trained weights to the corresponding parameters in your model definition. These weights are then embedded into the compiled TensorRT engine, allowing it to perform inference with the trained parameters.

Pattern-Matching and Fusion

One of the key optimisations performed by TensorRT during compilation is operation fusion.

It analyses the computational graph of your model and identifies patterns that can be fused together into a single, more efficient operation.

For example, a matrix multiplication followed by an activation function can be fused into a single kernel. This reduces memory transfers and kernel launch overhead, leading to faster execution.

Plugins

TensorRT-LLM introduces the concept of plugins, which are user-defined kernels that can be seamlessly integrated into the model graph.

Plugins allow you to extend the functionality of TensorRT by implementing custom operations that may not be natively supported. This flexibility is particularly useful for handling advanced or domain-specific operations in LLMs.

Runtime

Once your model is compiled into a TensorRT engine, you need a runtime environment to execute it.

TensorRT-LLM provides a runtime API in both Python and C++ that facilitates loading the engine and running inference. The runtime handles the execution flow, including feeding inputs, running the model, and retrieving outputs. It abstracts away the low-level details, making it easier to integrate the engine into your application.

Multi-GPU and Multi-Node Support

TensorRT-LLM goes beyond single-GPU execution by enabling multi-GPU and multi-node support.

It leverages plugins that wrap communication primitives from the NCCL library to facilitate efficient data exchange between multiple GPUs or nodes. This allows you to distribute the workload and scale up the inference performance of your LLM.

In-flight Batching

To further optimise throughput, TensorRT-LLM introduces the concept of in-flight batching.

It enables the runtime to batch multiple inference requests together, allowing for more efficient utilisation of GPU resources. The Batch Manager component handles this functionality, transparently batching requests and dispatching them to the engine for execution.

Summary

By understanding these components and their interactions, you can appreciate how TensorRT-LLM simplifies the process of deploying LLMs for efficient inference.

It provides a high-level API for model definition, leverages TensorRT's optimisations during compilation, offers flexibility through plugins, and delivers a runtime environment for seamless execution.

This architecture empowers you to focus on the high-level aspects of your LLM while benefiting from the performance optimisations provided by TensorRT under the hood.

Technical Summary of Each Component

Model Definition

  • TensorRT-LLM provides a Python API to define Large Language Models (LLMs).

  • The TensorRT Python API creates graph representations of deep neural networks.

  • The tensorrt_llm.Builder class contains a tensorrt.Builder object used to create an instance of tensorrt.INetworkDefinition.

  • The INetworkDefinition object is populated using free functions from tensorrt_llm.functional.

  • These functions, like activation, relu, sigmoid, etc., insert nodes into the model's graph.

  • Higher-level functions can be composed from these basic building blocks, such as the silu activation.

  • The resulting graph represents the network and can be traversed or transformed using the tensorrt.ILayer class.

Compilation

Once the model graph is defined, it needs to be compiled into an optimised TensorRT engine.

The tensorrt_llm.Builder class provides the build_engine method, which calls the build_serialized_network method of the tensorrt.Builder object.

During compilation, TensorRT performs several optimisations on the model graph:

  • It chooses the best kernel for each operation based on the available GPU.

  • It identifies patterns in the graph where multiple operations can be fused into a single kernel, reducing memory movement and kernel launch overhead.

  • It compiles the graph of operations into a single CUDA Graph that can be launched efficiently.

  • Complex layer fusions, like FlashAttention, cannot be automatically discovered by TensorRT. In such cases, explicit plugins can be used to replace parts of the graph with custom kernels.

  • The compilation process produces an instance of the tensorrt.IHostMemory class, which represents the optimised TensorRT engine.

  • The compiled engine can be stored as a binary file for later use.

Weight Bindings

  • TensorRT engines embed the network weights, which must be known during compilation.

  • Before calling tensorrt_llm.Builder.build_engine, the weights must be bound to parameters in the model definition.

  • This is done by assigning values to the Parameter objects exposed by the model's layers.

  • TensorRT also supports refitting engines to update weights after compilation using the refit_engine method in tensorrt_llm.Builder.

Pattern-Matching and Fusion

  • TensorRT performs pattern-matching and fusion during the compilation process to optimise the model execution.

  • Fusion helps reduce data transfer between memory and compute cores and removes kernel launch overhead.

  • TensorRT identifies sequences of operations that can be fused and automatically generates efficient GPU kernels for them.

  • For example, a sequence of matmul followed by relu can be fused into a single kernel, avoiding intermediate memory writes and reads.

  • TensorRT's pattern-matching algorithm is powerful but may not identify all possible fusions, especially for uncommon or advanced patterns.

Plugins

  • Plugins are a mechanism in TensorRT to extend its functionality with custom GPU kernels.

  • They are inserted into the network graph definition and map to user-defined kernels written in C++.

  • Plugins follow a well-defined interface described in the TensorRT Developer Guide.

  • TensorRT-LLM uses several plugins, located in the cpp/tensorrt_llm/plugins directory.

  • Plugins are useful for implementing complex operations or fusions that cannot be automatically discovered by TensorRT, such as the GPT Attention operator.

Runtime

  • TensorRT-LLM includes an API to implement Python and C++ runtimes.

  • The runtime components load the TensorRT engines and drive their execution.

  • For auto-regressive models like GPT, the runtime loads the engine that processes the input sequence and handles the generation loop.

Multi-GPU and Multi-Node Support

  • TensorRT-LLM extends TensorRT's single-GPU design to support multiple GPUs and nodes.

  • The communication plugins are found in cpp/tensorrt_llm/plugins/ncclPlugin.

  • Multi-GPU functions like allreduce, allgather, send, and recv are exposed in the TensorRT-LLM Python API.

  • Two modes of model parallelism are supported: Tensor Parallelism and Pipeline Parallelism.

  • Tensor Parallelism splits layers across GPUs, with each GPU running the entire network and synchronising as needed.

  • Pipeline Parallelism distributes layers to GPUs, with each GPU running a subset of the model and communicating at layer boundaries.

Summary

The TensorRT-LLM architecture provides a framework for defining, compiling, and executing LMs efficiently using TensorRT.

It is important to note that the effectiveness of the TensorRT-LLM architecture depends on several factors:

  • The quality of the model definition and the choice of appropriate layers and operations

  • The ability to leverage TensorRT's pattern-matching and fusion capabilities effectively

  • The use of plugins for complex operations or fusions that cannot be automatically discovered.

  • The optimisation of the C++ runtime for the specific LLM architecture and deployment scenario.

  • The careful consideration of multi-GPU and multi-node configurations based on the model size, available resources, and performance requirements.

Additionally, it's crucial to benchmark and profile the model performance using TensorRT-LLM to identify potential bottlenecks and optimise accordingly.

Experimenting with different optimisation techniques, such as quantization or different precisions, can further improve the model's efficiency.

By leveraging the capabilities of TensorRT and extending it with custom plugins and runtime optimisations, TensorRT-LLM enables the deployment of large-scale language models in various scenarios, from local execution to multi-GPU and multi-node configurations.

PreviousTensorRT-LLM Architecture and ProcessNextINetworkDefinition

Last updated 1 year ago

Was this helpful?

It uses TensorRT plugins that wrap communication primitives from the and a custom All-Reduce plugin.

NCCL library
Page cover image