LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Step 1
  • Step 2
  • Step 3
  • Step 4

Was this helpful?

  1. TensorRT-LLM build workflow

TensorRT-LLM build workflow - process

Step 1

Model Conversion Before you can build your large language model with TensorRT-LLM, you need to convert your existing model checkpoint into the TensorRT-LLM format.

Here's what you'll need:

Checklist

  • Your pre-trained model checkpoint from a supported training framework (e.g., Hugging Face, Meta)

  • TensorRT-LLM installed in your development environment

  • Familiarity with the TensorRT-LLM conversion APIs

The TensorRT-LLM library provides a handy TopModelMixin class that defines a common interface for model conversion.

The most frequently used method is from_hugging_face(), which allows you to convert models from the popular Hugging Face format.

Now, here's the cool part:

TensorRT-LLM has model-specific conversion logic implemented in dedicated model classes.

For example, if you're working with an LLaMA model, you'll use the LLaMAForCausalLM class, which inherits from TopModelMixin. This class takes care of all the nitty-gritty details of converting your LLaMA model checkpoint.

One of the great things about TensorRT-LLM's conversion APIs is their flexibility.

They support various checkpoint formats, such as Hugging Face and Meta checkpoints, making it easy to work with different model types.

Best Practice Tip

Whenever possible, use the provided conversion APIs to ensure compatibility and maintainability.

If you have a custom checkpoint format, consider implementing the conversion logic within the TensorRT-LLM core library. This keeps your model definition and conversion code together, making it easier to manage and update.

Step 2

Quantization (Optional)

Quantization is a technique that can help reduce your model's size and improve its inference performance.

TensorRT-LLM supports several quantization methods.

Checklist:

  • Decide on the quantization technique you want to use (e.g., FP8, W4A16_AWQ, INT8)

  • Install the NVIDIA AMMO toolkit if you plan to use AMMO-supported quantization algorithms

  • Familiarise yourself with the quantize() method in the PretrainedModel class

TensorRT-LLM leverages the NVIDIA AMMO toolkit for certain quantization algorithms like FP8, W4A16_AWQ, and W4A8_AWQ.

However, it also offers its own implementations for techniques like Smooth Quant, INT8 KV cache, and INT4/INT8 weight-only quantization.

The PretrainedModel class provides a unified interface for quantization through the quantize() method.

The default implementation handles AMMO-supported quantization, making it easy to apply quantization consistently across different models.

If you need model-specific quantization logic, you can implement it in the respective model classes. For example, the LLaMAForCausalLM class allows you to override the quantize() method to customise the quantization process for LLaMA models.

Best Practice Tip

When using the quantize() method in an MPI program, make sure that only rank 0 calls the method to avoid resource contention. This ensures that the quantization process runs smoothly and efficiently.

Step 3

Engine Building Once your model is converted and optionally quantized, it's time to build the TensorRT engine.

Checklist:

  • Your converted TensorRT-LLM model object

  • The tensorrt_llm.build API

  • A BuildConfig object specifying your desired build configuration options

Building the engine is where the magic happens.

The tensorrt_llm.build API simplifies the entire process by handling the creation of the builder, network object, tracing the model to the network, and finally building the TensorRT engine.

To customise your build, you can use the BuildConfig class to specify options like the maximum batch size. This allows you to fine-tune the engine based on your specific requirements.

Once the engine is built, you can save it to disk for later use. This is particularly handy when you want to deploy your model in a production environment.

Best Practice Tip

Consistently use the tensorrt_llm.build API across different models to maintain a standardised build process.

Don't be afraid to experiment with various build configurations to find the optimal balance between performance and memory usage for your specific use case.

Step 4

Deployment and Optimisation

Congratulations! You've successfully built your TensorRT engine. Now it's time to deploy and optimise your model.

Checklist:

  • Choose the appropriate deployment strategy (e.g., single-GPU, multi-GPU, multi-node)

  • Use profiling tools like NVIDIA Nsight Systems to analyse performance and identify bottlenecks

  • Continuously monitor and profile your deployed model to ensure optimal performance

TensorRT-LLM provides several deployment options to suit your needs.

For smaller models, a single-GPU deployment might suffice. However, for larger models or high-performance requirements, you can leverage multi-GPU and multi-node configurations.

To optimise your model's performance, it's crucial to profile and monitor its behavior in production.

Tools like NVIDIA Nsight Systems can help you identify bottlenecks and areas for improvement. Keep an eye on GPU utilisation, memory usage, and inference latency to ensure your model meets the desired performance targets.

Remember to stay updated with the latest TensorRT-LLM releases and documentation. As new features and improvements are introduced, you might discover additional optimization opportunities for your model.

Best Practice Tip

Regularly review and update your model's configuration, quantization techniques, and deployment strategy to maintain optimal performance as your requirements evolve.

Wrapping Up And there you have it!

A comprehensive process document for building large language models using the TensorRT-LLM workflow.

By following this checklist and leveraging the provided APIs and best practices, you'll be well on your way to deploying efficient and high-performance models.

Remember, the key to success is experimentation and continuous optimisation. Don't be afraid to try different configurations, quantization techniques, and deployment strategies to find the perfect balance for your specific use case.

If you encounter any challenges or have questions along the way, don't hesitate to reach out to the TensorRT-LLM community or consult the documentation. With the right approach and a bit of persistence, you'll be able to unlock the full potential of your large language models using TensorRT-LLM.

PreviousTensorRT-LLM build workflowNextCUDA Graphs

Last updated 1 year ago

Was this helpful?

Page cover image