LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • When to Use Graph Rewriting
  • Best Practices for Using Graph Rewriting

Was this helpful?

Graph Rewriting (GW) module

The Graph Rewriting (GW) module in TensorRT-LLM is a powerful tool for manipulating and optimizing the underlying graph of a neural network.

It allows you to modify the network structure at the ILayer/INetworkDefinition level, which is a lower-level representation compared to the high-level Module abstraction.

Let's dive into the details of the Graph Rewriting module and explore its usage and best practices.

When to Use Graph Rewriting

Graph Rewriting is particularly useful in the following scenarios:

  1. When you only have access to the ILayer/INetworkDefinition representation of the network and want to perform optimizations at that level.

  2. When modifying the network using Module Rewriting would lead to complex control flow or scattered functionality across multiple Module instances.

Graph Rewriting APIs: TensorRT-LLM provides several core APIs for Graph Rewriting:

Tensor-Related Methods

  • Tensor.get_parent: Retrieves the ILayer that produces a given tensor.

  • Tensor.get_users: Retrieves the consumer ILayers of a given tensor.

  • replace_all_uses_with: Replaces a tensor with another tensor in all its consumer ILayers.

FLayerInfo

  • FLayerInfo is a high-level signature that holds original input information for layers defined in functional.py. It provides a mapping between ILayers and their corresponding high-level information.

  • FLayerInfo.replace_input_with: Replaces an input tensor with another tensor.

  • FLayerInfo.replace_output_uses_with: Redirects the usage of original output tensors to a set of new tensors.

  • FLayerInfoMemo.instance(): Retrieves the singleton instance of FLayerInfoMemo.

  • FLayerInfoMemo.get: Retrieves the corresponding FLayerInfo for an ILayer.

Pattern and Pattern Manager

  • TensorRT-LLM defines two types of patterns: PatternRewriter and PatternAnalyzer.

  • PatternRewriter is used for defining rewriting patterns that actually alter the network structure. It provides methods like match, rewrite, and match_and_rewrite.

  • PatternAnalyzer is used for defining analysis patterns that collect information from the network. It provides methods like match and analyze.

  • RewritePatternManager and AnalysisPatternManager are used to manage multiple PatternRewriter or PatternAnalyzer instances, respectively.

Best Practices for Using Graph Rewriting

Understand the Network Structure

  • Before applying Graph Rewriting, familiarize yourself with the structure of the network and the layers you want to manipulate.

  • Identify the specific subgraphs or patterns you want to optimize or modify.

Follow the Four-Stage Rewriting Process

  • When rewriting a layer or subgraph, follow the four-stage process:

    1. Retrieve the input and output tensors of the subgraph to be replaced.

    2. Create a new subgraph that takes the old subgraph's inputs.

    3. Redirect the layers depending on the outputs of the old subgraph to the new subgraph.

    4. Mark the layers in the old subgraph as removed.

  • Avoid directly rewriting layers; instead, create new layers and redirect the usage of the original outputs to the new layers.

Leverage FLayerInfo for Plugin Layers

  • When working with TensorRT plugin layers, use FLayerInfo to access the original input information.

  • FLayerInfo provides a high-level abstraction for plugin layers, allowing you to retrieve and modify their inputs and outputs.

Use the @record_signature Decorator

  • If you are adding new Graph Rewriting patterns that involve functionals, ensure that the functionals are decorated with the @record_signature decorator.

  • This decorator records the FLayerInfo for a functional, making it available for analysis and rewriting.

Test and Validate Rewritten Networks

  • After applying Graph Rewriting, thoroughly test and validate the rewritten network to ensure its correctness and performance.

  • Compare the results of the original and rewritten networks to verify that the desired optimizations or modifications have been achieved.

Consider the Impact on Performance

  • While Graph Rewriting can lead to optimizations and improved performance, be mindful of the potential impact on inference speed and memory usage.

  • Profile and benchmark the rewritten network to assess its performance characteristics and ensure that the optimizations are beneficial for your specific use case.

Use Graph Rewriting Judiciously

  • Graph Rewriting is a powerful tool, but it should be used judiciously and only when necessary.

  • Overusing Graph Rewriting or applying complex rewriting patterns may lead to reduced readability and maintainability of the network definition.

By following these best practices and leveraging the Graph Rewriting APIs provided by TensorRT-LLM, you can effectively optimize and manipulate the underlying graph of your neural network.

Graph Rewriting allows you to fine-tune the network structure, fuse layers, and apply custom optimizations to improve performance and efficiency.

Remember to test and validate the rewritten network thoroughly to ensure that the desired optimisations are achieved without introducing any unintended side effects.

Additionally, keep in mind that Graph Rewriting operates at a lower level compared to Module Rewriting, so it may require a deeper understanding of the network structure and the TensorRT APIs.

Overall, the Graph Rewriting module in TensorRT-LLM provides a flexible and powerful way to optimize and customize your neural network graphs, enabling you to achieve better performance and efficiency in your TensorRT-based applications.

PreviousRuntimeNextFasterTransfomer Library

Last updated 1 year ago

Was this helpful?

Page cover image