LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • When to Use Graph Rewriting?
  • Key Concepts in Graph Rewriting
  • Practical Examples of Graph Rewriting
  • Importance in Neural Network Optimization

Was this helpful?

Graph Rewriting

Graph Rewriting in the context of TensorRT-LLM is a process that involves manipulating the structure of a neural network at a lower level, specifically at the ILayer/INetworkDefinition level in TensorRT.

This technique is particularly useful in optimizing and transforming neural network models for efficient execution, especially when using NVIDIA's TensorRT, a high-performance deep learning inference optimizer and runtime library.

When to Use Graph Rewriting?

Graph Rewriting is used in situations where fine-grained control and manipulation of the network at the layer level are required, particularly after the network has been defined. It is different from Module Rewriting, which operates at a higher level (before the network is converted into TensorRT's graph format). Graph Rewriting is useful:

  1. When only ILayer/INetworkDefinition is available: If you're working directly with these lower-level constructs.

  2. For complex manipulations: That cannot be efficiently or feasibly done at the Module level, such as layer fusion, or when dealing with nested control flow.

Key Concepts in Graph Rewriting

  1. Tensor-Related Methods: These methods allow manipulation of tensors, including getting parent layers of a tensor, consumer layers, and replacing tensors in consumer layers.

  2. FLayerInfo and FLayerInfoMemo: These are used to store and retrieve high-level information about layers, especially useful for layers implemented as TensorRT plugins. They maintain the original input and attribute information of layers.

  3. Pattern and Pattern Manager:

    • PatternRewriter: For defining a rewriting pattern that alters the network.

    • PatternAnalyzer: For defining an analysis pattern that collects information from the network.

    • RewritePatternManager: Manages multiple rewriting patterns.

    • AnalysisPatternManager: Manages multiple analysis patterns.

  4. @record_signature Decorator: Used to record the high-level signature (FLayerInfo) for functionals, which is crucial for Graph Rewriting when analyzing or rewriting certain functionals.

  5. Workflow for Defining a Graph Rewriting Pattern: Typically involves defining a class that inherits from PatternRewriter or PatternAnalyzer and implementing the match and rewrite (or analyze) methods to specify how the network should be transformed or analyzed.

Practical Examples of Graph Rewriting

  1. Layer Replacement: Replacing one type of layer with another (e.g., replacing a sum layer with a subtract layer) while maintaining the connections in the network.

  2. Layer Fusion: Combining multiple layers into a single, more efficient layer, often used to reduce computational overhead.

  3. Optimization for Specific Hardware: Tailoring the network structure to leverage specific features of hardware accelerators, like GPUs, for more efficient execution.

  4. Enabling Advanced Features: Modifying layers to enable advanced features like dynamic batch sizes, mixed precision, or remove-padding modes in specific layers.

Importance in Neural Network Optimization

Graph Rewriting is a crucial step in optimizing neural networks for deployment, particularly in scenarios where high throughput and low latency are critical, such as real-time applications or edge computing. By transforming and optimizing the network at a granular level, it's possible to achieve significant improvements in performance on specific hardware architectures.

In summary, Graph Rewriting in TensorRT-LLM offers a powerful toolset for deep manipulation and optimization of neural networks, allowing for fine-grained control and customization to achieve high-performance inference.

PreviousFP8 Formats for Deep LearningNextReducing Activation Recomputation in Large Transformer Models

Last updated 1 year ago

Was this helpful?

Page cover image