LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. TensorRT-LLM Architecture and Process

Weight Bindings

Weight bindings refer to the process of assigning trained model weights to the corresponding parameters in the TensorRT-LLM model definition before compiling the TensorRT engine.

This is necessary because TensorRT engines embed the network weights, which must be known at the time of compilation.

The process is as follows

Model Definition

When defining a model using the TensorRT-LLM Python API, you create instances of various layers and modules, such as the Linear layer. These layers and modules have parameters that represent the learnable weights of the model.

Parameter Definition

In the model definition, you define the parameters for each layer or module.

For example, in the Linear layer, you define the weight and bias parameters using the Parameter class. You specify the shape and data type of these parameters based on the layer's configuration.

Weight Loading

After defining the model architecture, you need to load the trained weights from a checkpoint or a pre-trained model. These weights are typically stored in a file format specific to the training framework, such as PyTorch or TensorFlow.

Weight Binding

To bind the loaded weights to the model parameters, you assign the weight values to the corresponding parameter attributes in the model definition.

This is done by accessing the value attribute of each parameter and assigning the loaded weight data to it. For example, in the code snippet provided:

tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
tensorrt_llm_gpt.layers[i].mlp.fc.bias.value   = fromfile(...)

The fromfile function is used to load the weight data from a file, and the loaded data is assigned to the value attribute of the weight and bias parameters of the fully connected (FC) layer in the MLP module of the GPT model.

Engine Compilation

After binding the weights to the model parameters, you can proceed with building the TensorRT engine using the tensorrt_llm.Builder.build_engine function.

During the engine compilation process, TensorRT takes the model definition along with the bound weights and optimises the computation graph for efficient execution on the target GPU.

Weight Refitting (Optional)

TensorRT also supports the ability to refit engines with updated weights after compilation.

This feature is available in TensorRT-LLM through the refit_engine method in the tensorrt_llm.Builder class.

Refitting allows you to update the weights of an existing engine without the need to recompile the entire engine from scratch, which can save time in certain scenarios.

By binding the weights to the model parameters before compiling the TensorRT engine, you ensure that the engine has access to the trained weights and can perform inference accurately.

The weight binding process bridges the gap between the model definition and the trained weights, allowing TensorRT to optimise the computation graph and generate an efficient engine for execution.

PreviousRuntime EngineNextModel Configuration

Last updated 1 year ago

Was this helpful?

Page cover image