LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • FP32, FP16, and BF16
  • Quantization and Dequantization (Q/DQ)
  • INT8 SmoothQuant (W8A8)
  • INT4 and INT8 Weight-Only (W4A16 and W8A16)
  • GPTQ and AWQ (W4A16)
  • FP8 (Hopper)
  • Support Matrix
  • Technical Detail: The QuantMode Flags

Was this helpful?

Numerical Position

The TensorRT-LLM numerical precision documentation details the different numerical precision methods supported by TensorRT-LLM and their implementation.

Let's break down these methods and their specificities:

FP32, FP16, and BF16

  • Description: Models in TensorRT-LLM support IEEE 32-bit floating-point numbers (FP32), 16-bit floating-point numbers (FP16), and 16-bit Bfloat16 (BF16).

  • Use: When available, models can be trained or run with checkpoints using these numerical formats.

Quantization and Dequantization (Q/DQ)

  • Functionality:

    • Quantization (Q): Converts a floating-point number into an 8-bit integer representation using a scaling factor.

    • Dequantization (DQ): Converts an 8-bit integer back into a floating-point number.

  • Modes:

    • Per-tensor: A single scaling factor for all elements.

    • Per-token: Individual scaling factors for each token.

    • Per-channel: Individual scaling factors for each channel.

  • Implementation: Involves multiplication of the original floating-point value by the scaling factor, followed by saturation to the INT8 range.

INT8 SmoothQuant (W8A8)

  • Concept: A method to maintain network accuracy when running inference using INT8 for both activations and weights.

  • Preprocessing: Requires specific preparation of model weights.

  • Support: Provided for models like GPT, GPT-J, and LLaMA with examples in the examples/quantization folder.

INT4 and INT8 Weight-Only (W4A16 and W8A16)

  • Description: Quantizes weights to INT4/INT8, but keeps activations in FP16/BF16.

  • Application: User must determine appropriate scaling factors for the model weights.

  • Support: Examples included for GPT and LLaMA models.

GPTQ and AWQ (W4A16)

  • Techniques:

    • GPTQ: Presented in a specific research paper. Uses per-group scaling factors and zero-offsetting in linear layers.

    • AWQ: Similar to GPTQ with its own specifics.

  • Implementation: Supported via the WeightOnlyGroupwiseQuantMatmulPlugin and weight_only_groupwise_quant_matmul Python function.

  • Support: Experimental implementations for GPT-NeoX, LLaMA-v2, and GPT-J.

FP8 (Hopper)

  • Description: Implementations of 8-bit floating-point (FP8) for specific models.

  • Support: Examples available for GPT-NeMo, GPT-J, and LLaMA in the examples/quantization folder.

Support Matrix

  • Matrix Content: Lists the support status of various numerical precision methods across different models.

  • Categories: Includes FP32, FP16, BF16, FP8, W8A8 SQ (SmoothQuant), W8A16, W4A16, W4A16 AWQ, and W4A16 GPTQ.

Technical Detail: The QuantMode Flags

  • Purpose: Controls the quantization method used.

  • Flags:

    • INT4_WEIGHTS: Weights are quantized to 4 bits.

    • INT8_WEIGHTS: Weights are quantized to 8 bits.

    • ACTIVATIONS: Activations are quantized to 8 bits.

    • PER_CHANNEL: Scaling factors defined per channel.

    • PER_TOKEN: Scaling factors defined per token.

    • PER_GROUP: Scaling factors defined per group.

    • Additional flags for controlling K/V cache storage and Q/DQ nodes fusion.

PreviousEfficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LMNextTensorRT Models

Last updated 1 year ago

Was this helpful?

Page cover image