LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

The TensorRT-LLM Process

PreviousTensorRT-LLMNextPerformance

Last updated 1 year ago

Was this helpful?

TensorRT-LLM is a toolkit designed to help users create optimised solutions for Large Language Model (LLM) inference.

It provides a Python API that allows users to define models and compile efficient TensorRT engines for NVIDIA GPUs.

The toolkit also includes Python and C++ components for building runtimes to execute these engines, as well as backends for the Triton Inference Server, making it easy to create web-based services for LLMs.

TensorRT-LLM supports multi-GPU and multi-node configurations through .

The process of creating an inference solution with TensorRT-LLM involves the following steps:

Model Definition

Users can either define their own model or choose from a list of pre-defined network architectures supported by TensorRT-LLM.

Model Training

If using a custom model, users must train the model using a training framework (training is not part of TensorRT-LLM). For pre-defined models, users can download checkpoints from various providers, such as the Hugging Face hub, which offers models trained using NVIDIA Nemo or PyTorch.

Model Recreation

With the model definition and weights, users utilise TensorRT-LLM's Python API to recreate the model in a format that can be compiled by TensorRT into an efficient engine.

TensorRT-LLM supports several standard models out-of-the-box for ease of use.

Runtime Creation

TensorRT-LLM provides users with components to create a runtime that executes the efficient TensorRT engine. The runtime components offer features such as beam-search and extensive sampling functionalities (e.g., top-K and top-P sampling). The C++ runtime is the recommended choice.

Triton Inference Server Integration

TensorRT-LLM includes Python and C++ backends for NVIDIA Triton Inference Server, allowing users to assemble solutions for LLM online serving. The C++ backend is recommended as it implements in-flight batching for optimized performance.

To use TensorRT-LLM, users need to supply a set of trained weights.

These weights can be obtained from the user's own model trained in a framework like NVIDIA NeMo or pulled from repositories such as the Hugging Face Hub, which offers pretrained weights.

In summary, TensorRT-LLM simplifies the process of creating optimised LLM inference solutions by providing a Python API for model definition, components for runtime creation, and backends for the Triton Inference Server.

Users can leverage pre-defined models and pretrained weights or use their own custom models to build efficient and scalable LLM inference solutions.

MPI
Page cover image