LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. TensorRT-LLM Architecture and Process

Compilation

Compilation Process in TensorRT-LLM

The compilation process in TensorRT-LLM involves transforming a populated tensorrt.INetworkDefinition instance into an optimised TensorRT engine.

This process is orchestrated by the tensorrt_llm.Builder class, which provides a high-level interface for building and optimising the engine.

At the core of the compilation process is the build_engine member function of the tensorrt_llm.Builder class.

This function takes the populated INetworkDefinition instance and various build configurations as input.

It then invokes the build_serialized_network method of the underlying tensorrt.Builder object to compile the network into an efficient engine.

During the compilation process, the TensorRT compiler performs several optimisations to enhance the performance of the engine.

It analyses the graph of operations and selects the most suitable kernel for each operation based on the available GPU architecture.

Furthermore, the compiler identifies patterns in the graph where multiple operations can be fused into a single kernel. This fusion process reduces memory movement and minimises the overhead of launching multiple GPU kernels, resulting in improved execution speed.

One of the key advantages of the TensorRT compiler is its ability to compile the graph of operations into a single CUDA Graph.

This allows the entire graph to be launched in a single operation, further reducing the kernel launch overhead and maximising performance.

However, there are certain complex layer fusions, such as FlashAttention, that involve intricate interleaving of operations and cannot be automatically discovered by the TensorRT compiler.

In such cases, TensorRT-LLM provides the flexibility to explicitly replace parts of the graph with plugins at compile time. These plugins are pre-built and optimised implementations of specific operations that can be seamlessly integrated into the compilation process.

If the compilation process completes successfully, the build_engine function returns an instance of the tensorrt.IHostMemory class. This object represents the optimized TensorRT engine that is ready for execution. The engine can be serialised and stored as a binary file for later use, enabling efficient deployment and inference.

It's important to note that the compilation process in TensorRT-LLM is highly configurable.

The tensorrt_llm.Builder class provides various options to customise the build settings, such as precision, quantization, and optimisation level. These settings can be adjusted based on the specific requirements of the LLM task and the target hardware.

In summary, the compilation process in TensorRT-LLM leverages the powerful TensorRT compiler to optimise the graph of operations and generate an efficient engine.

Through layer fusion, kernel selection, and CUDA Graph compilation, TensorRT-LLM achieves significant performance improvements for Large Language Model inference.

The flexibility to incorporate plugins for complex layer fusions further enhances the capabilities of the compilation process.

PreviousModel DefinitionNextRuntime Engine

Last updated 1 year ago

Was this helpful?

Page cover image