LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Purpose and Scope
  • Model Definition and Training
  • Compilation with TensorRT
  • Runtime Execution
  • Integration with Triton Inference Server
  • Usage and Practical Implications
  • Conclusion

Was this helpful?

TensorRT-LLM Architecture and Process

TensorRT-LLM is a framework designed for optimising and deploying large language models on NVIDIA GPUs.

It encompasses various components and stages, from model definition to efficient execution on hardware.

Purpose and Scope

Optimised Inference for LLMs

TensorRT-LLM is tailored for efficient inference of large language models, utilising NVIDIA's TensorRT for GPU optimisation.

Multi-GPU and Multi-Node Support

It caters to large-scale deployments, supporting both multi-GPU and multi-node configurations, crucial for handling the computational demands of large language models.

Model Definition and Training

Users can define their own models or choose from pre-defined architectures supported by TensorRT-LLM.

While TensorRT-LLM focuses on inference, the models themselves need to be trained using other frameworks like NVIDIA Nemo or PyTorch. Pre-trained model checkpoints can also be sourced from various providers, including HuggingFace.

Compilation with TensorRT

The framework provides a Python API to recreate models in a format that can be compiled by TensorRT into an efficient engine. This step involves translating the high-level model architecture into a representation optimised for GPU execution.

The model is then compiled into a TensorRT engine, which is an optimised version of the model specifically designed for fast inference on NVIDIA GPUs.

Runtime Execution

TensorRT-LLM includes components to create a runtime environment that can execute the optimised TensorRT engine.

Advanced functionalities like beam-search, top-K, and top-P sampling are available, which are important for tasks like text generation where different strategies for selecting the next word in a sequence are needed.

C++ Runtime: While there's Python support, the C++ runtime is recommended for performance reasons.

Integration with Triton Inference Server

Backends for Triton: The toolkit includes Python and C++ backends for integration with the NVIDIA Triton Inference Server, facilitating the deployment of LLMs as web-based services.

In-Flight Batching: Particularly in the C++ backend, TensorRT-LLM implements in-flight batching (grouping multiple inference requests together), improving throughput and efficiency.

Usage and Practical Implications

Deployment of LLMs: TensorRT-LLM streamlines the process of deploying large language models, particularly in web-based or cloud environments.

Optimisation for NVIDIA Hardware: The toolkit is specifically designed to leverage NVIDIA GPUs, making it suitable for environments where such hardware is available.

Flexibility and Advanced Features: It offers flexibility in terms of model choice and advanced features for runtime execution, catering to a range of use cases from simple language understanding to complex text generation.

Conclusion

TensorRT-LLM is a robust framework that bridges the gap between the development of large language models and their efficient deployment on NVIDIA GPUs.

It addresses the end-to-end workflow from model definition and GPU optimization to runtime execution and web service deployment, focusing on performance and scalability.

PreviousRunning with persistent volumesNextThe TensorRT-LLM process

Last updated 1 year ago

Was this helpful?

Page cover image