LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Python API
  • Features and Optimisations
  • Performance Improvements
  • TCO and Energy Efficiency
  • Advanced Scheduling Technique: In-flight Batching
  • Quantization and FP8 Support
  • Conclusion and Future Implications

Was this helpful?

TensorRT-LLM

This software library aims to solve issues surrounding the computational efficiency and cost-effectiveness of deploying large language models

NextThe TensorRT-LLM Process

Last updated 1 year ago

Was this helpful?

TensorRT-LLM is a framework for executing Large Language Model (LLM) inference on NVIDIA GPUs.

It integrates a Python API for defining and compiling models into efficient TensorRT engines and includes both Python and C++ components for runtime execution.

Additionally, it provides backend support for the Triton Inference Server, facilitating the deployment of web based large language model services.

The toolkit is compatible with multi-GPU and multi-node setups through MPI.

TensorRT-LLM integrates with the TensorRT deep learning compiler and includes optimised kernels, as well as pre- and post-processing steps.

It also incorporates multi-GPU/multi-node communication primitives.

The software aims to provide high performance without requiring users to have deep knowledge of C++ or CUDA programming languages.

Python API

TensorRT-LLM offers a modular Python API that allows for ease of use and quick customisations. It enables you to define, optimise, and execute new language model architectures as they evolve.

Features and Optimisations

  • Streaming of Tokens: Handles token streaming efficiently.

  • In-flight Batching: Allows for optimised scheduling to manage dynamic loads.

  • Paged attention: Efficiently manages attention mechanisms in large models.

  • Quantization: Supports reduced-precision inference for better performance.

Performance Improvements

TensorRT-LLM, when used with NVIDIA Hopper architecture, significantly accelerates LLM inference.

For example, it can increase throughput by 8x compared to the A100 GPU. It also shows 4.6x speedup for the Llama 2 language model by Meta.

TCO and Energy Efficiency

The software not only improves computational efficiency but also substantially reduces the total cost of ownership (TCO) and energy consumption.

An 8x performance speedup results in a 5.3x reduction in TCO and a 5.6x reduction in energy costs compared to the A100 baseline.

Advanced Scheduling Technique: In-flight Batching

TensorRT-LLM includes an optimised scheduling feature called "in-flight batching," which allows the runtime to immediately start executing new requests even before the previous batch is completed. This enables better utilisation of GPU resources.

Quantization and FP8 Support

NVIDIA H100 GPUs with TensorRT-LLM support a new 8-bit floating-point format (FP8) that allows for more efficient memory usage during inference without sacrificing accuracy.

This is done using NVIDIA's Hopper Transformer Engine technology.

Conclusion and Future Implications

The growing ecosystem of LLMs requires efficient solutions for deployment and scaling, and TensorRT-LLM aims to meet this need.

The software provides a robust, scalable, and cost-effective solution for businesses looking to deploy large language models.

In summary, TensorRT-LLM is a significant leap forward for anyone working with large language models, offering a range of features and optimizations to streamline deployment, improve performance, and reduce costs.

H100 FP8 increases max throughput, decreases 1st token latency, and reduces memory consumption. At peak, TensorRT-LLM on H100 can achieve >10K token/s or <10ms to first token.
Page cover image