LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Network Creation
  • Adding Inputs and Outputs
  • Adding Layers and Operations
  • Tensor Manipulation
  • Network Optimisation
  • Debugging and Inspection
  • Relationship with the TensorRT-LLM process

Was this helpful?

  1. TensorRT-LLM Architecture and Process

INetworkDefinition

The INetworkDefinition is a component in the TensorRT workflow that defines the structure and layers of a neural network.

It serves as a high-level representation of the network and allows you to specify the input and output tensors, as well as the various operations and layers that make up the network.

Network Creation

  • You can create an empty INetworkDefinition using the TensorRT Builder.

  • The INetworkDefinition is populated either by using a parser (e.g., parsing a pre-trained model) or by manually adding layers and operations using the TensorRT Network API.

Adding Inputs and Outputs

  • You can add input tensors to the network using the add_input() method, specifying the name, data type, and dimensions of the input tensor.

  • Output tensors can be marked using the mark_output() method, indicating which tensors should be considered as outputs of the network.

Adding Layers and Operations

  • INetworkDefinition provides methods to add various layers and operations to the network.

  • Examples include add_convolution(), add_pooling(), add_activation(), add_fully_connected(), etc.

  • Each layer takes input tensors and produces output tensors, forming the network structure.

Tensor Manipulation

  • INetworkDefinition allows you to manipulate tensors within the network.

  • You can add operations like concatenation, element-wise operations, reshaping, slicing, etc., to transform and combine tensors.

Network Optimisation

  • The INetworkDefinition is used by the TensorRT Builder to optimise the network for inference.

  • The Builder analyses the network structure, applies optimisations, and generates an optimised runtime engine (ICudaEngine).

Debugging and Inspection

  • INetworkDefinition provides methods to inspect the network structure and debug the network.

  • You can retrieve information about layers, tensors, and their connections using methods like get_layer(), get_input(), get_output(), etc.

  • You can also mark tensors as debug tensors using mark_debug() to enable additional debugging information.

Relationship with the TensorRT-LLM process

Model Definition

  • The language model architecture, such as the transformer-based models like BERT or GPT, is defined using the INetworkDefinition.

  • The layers and operations specific to the language model, such as self-attention, feedforward layers, and embedding layers, are added to the INetworkDefinition.

Input and Output Tensors

  • The input tensors for the language model, such as the input tokens or token embeddings, are specified using add_input().

  • The output tensors, such as the language model predictions or hidden states, are marked using mark_output().

Optimisation for Inference

  • The INetworkDefinition representing the language model is passed to the TensorRT Builder for optimisation.

  • The Builder applies various optimisations techniques, such as layer fusion, precision calibration, and kernel auto-tuning, to generate an optimised runtime engine (ICudaEngine) specifically tailored for the language model.

Inference

  • The optimised ICudaEngine is used to create an IExecutionContext, which allows for efficient inference on the language model.

  • The IExecutionContext takes input data, such as text tokens, and produces the language model outputs, such as predicted tokens or language embeddings.

By defining the language model architecture using the INetworkDefinition and leveraging TensorRT's optimisation capabilities, the TensorRT-LLM process enables fast and efficient inference on large language models, making them suitable for real-time applications and resource-constrained environments.

PreviousThe TensorRT-LLM processNextModel Definition

Last updated 1 year ago

Was this helpful?