LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • FP32, FP16, and BF16
  • Quantization and Dequantization (Q/DQ)
  • INT8 SmoothQuant (W8A8)
  • INT4 and INT8 Weight-Only (W4A16 and W8)
  • FP32, FP16, and BF16
  • Quantization and Dequantization (Q/DQ)
  • INT8 SmoothQuant (W8A8)
  • INT4 and INT8 Weight-Only (W4A16 and W8A16)
  • GPTQ and AWQ (W4A16)
  • FP8 (Hopper)
  • Support Matrix
  • Technical Detail: The QuantMode Flags

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

Numerical Precision

TensorRT-LLM provides various methods for numerical precision in the implementation of its models, catering to different computational requirements and accuracy needs. Here's a summary of these methods:

FP32, FP16, and BF16

  • FP32 (32-bit IEEE floating-point): This is the standard precision used in most models.

  • FP16 (16-bit IEEE floating-point): Provides a balance between performance and precision. Used when models have checkpoints available.

  • BF16 (16-bit Bfloat16): Similar to FP16 but with different trade-offs in the representation of floating-point numbers.

Quantization and Dequantization (Q/DQ)

  • INT8 Quantization: Converts floating-point numbers to 8-bit integers to reduce model size and accelerate inference.

  • Three Modes:

    • Per-tensor: Single scaling factor for the entire tensor.

    • Per-token: Different scaling factors for each token.

    • Per-channel: Different scaling factors for each channel.

INT8 SmoothQuant (W8A8)

  • SmoothQuant: Preserves network accuracy while using INT8 for both activations and weights. Requires preprocessing of weights.

INT4 and INT8 Weight-Only (W4A16 and W8)

TensorRT-LLM incorporates various methods to handle numerical precision, offering a range of options from standard floating-point precision to more compact and faster quantized formats. Here's a summary of these methods:

FP32, FP16, and BF16

  • Standard Precision: Models in TensorRT-LLM typically use 32-bit IEEE floating-point (FP32) precision.

  • Reduced Precision: Support for 16-bit IEEE floating-point (FP16) and Bfloat16 (BF16) is available, offering a balance between computational efficiency and numerical accuracy.

Quantization and Dequantization (Q/DQ)

  • INT8 Quantization: Involves converting floating-point numbers to 8-bit integers, a process that reduces model size and speeds up inference while maintaining acceptable levels of accuracy.

  • Scaling Factors: Quantization can be applied per-tensor, per-token, or per-channel, with each method employing different scaling factors for the conversion process.

INT8 SmoothQuant (W8A8)

  • Technique for Accuracy Preservation: This method enables inference using INT8 for both activations and weights without significant loss in accuracy, as detailed in research papers.

INT4 and INT8 Weight-Only (W4A16 and W8A16)

  • Quantizing Weights Only: In these techniques, only the weights of the model are quantized, with the activations remaining in higher precision formats like FP16 or BF16.

GPTQ and AWQ (W4A16)

  • Advanced Quantization Methods: GPTQ and AWQ are techniques that use per-group scaling factors and zero-offsetting in linear layers, as described in specific research papers.

FP8 (Hopper)

  • 8-bit Floating-Point Precision: This release of TensorRT-LLM includes implementations of FP8 precision for certain GPT models, offering a middle ground between FP16 and INT8 quantization.

Support Matrix

  • Broad Model Support: The current release supports a wide range of models in various precision formats, including FP32, FP16, BF16, FP8, and various quantized formats.

Technical Detail: The QuantMode Flags

  • Control Flags: The quantization method and other settings are controlled by QuantMode flags, allowing for precise configuration of the numerical precision and quantization methods used.

In summary, TensorRT-LLM provides a comprehensive suite of numerical precision options, enabling users to tailor the precision of their models according to their specific performance and accuracy requirements.

PreviousRotary Positional Embedding (RoPE)NextFP8 Formats for Deep Learning

Last updated 1 year ago

Was this helpful?