Numerical Position

The TensorRT-LLM numerical precision documentation details the different numerical precision methods supported by TensorRT-LLM and their implementation.

Let's break down these methods and their specificities:

FP32, FP16, and BF16

Description: Models in TensorRT-LLM support IEEE 32-bit floating-point numbers (FP32), 16-bit floating-point numbers (FP16), and 16-bit Bfloat16 (BF16).
Use: When available, models can be trained or run with checkpoints using these numerical formats.

Quantization and Dequantization (Q/DQ)

Functionality:
- Quantization (Q): Converts a floating-point number into an 8-bit integer representation using a scaling factor.
- Dequantization (DQ): Converts an 8-bit integer back into a floating-point number.
Modes:
- Per-tensor: A single scaling factor for all elements.
- Per-token: Individual scaling factors for each token.
- Per-channel: Individual scaling factors for each channel.
Implementation: Involves multiplication of the original floating-point value by the scaling factor, followed by saturation to the INT8 range.

INT8 SmoothQuant (W8A8)

Concept: A method to maintain network accuracy when running inference using INT8 for both activations and weights.
Preprocessing: Requires specific preparation of model weights.
Support: Provided for models like GPT, GPT-J, and LLaMA with examples in the examples/quantization folder.

INT4 and INT8 Weight-Only (W4A16 and W8A16)

Description: Quantizes weights to INT4/INT8, but keeps activations in FP16/BF16.
Application: User must determine appropriate scaling factors for the model weights.
Support: Examples included for GPT and LLaMA models.

GPTQ and AWQ (W4A16)

Techniques:
- GPTQ: Presented in a specific research paper. Uses per-group scaling factors and zero-offsetting in linear layers.
- AWQ: Similar to GPTQ with its own specifics.
Implementation: Supported via the WeightOnlyGroupwiseQuantMatmulPlugin and weight_only_groupwise_quant_matmul Python function.
Support: Experimental implementations for GPT-NeoX, LLaMA-v2, and GPT-J.

FP8 (Hopper)

Description: Implementations of 8-bit floating-point (FP8) for specific models.
Support: Examples available for GPT-NeMo, GPT-J, and LLaMA in the examples/quantization folder.

Support Matrix

Matrix Content: Lists the support status of various numerical precision methods across different models.
Categories: Includes FP32, FP16, BF16, FP8, W8A8 SQ (SmoothQuant), W8A16, W4A16, W4A16 AWQ, and W4A16 GPTQ.

Technical Detail: The QuantMode Flags

Purpose: Controls the quantization method used.
Flags:
- INT4_WEIGHTS: Weights are quantized to 4 bits.
- INT8_WEIGHTS: Weights are quantized to 8 bits.
- ACTIVATIONS: Activations are quantized to 8 bits.
- PER_CHANNEL: Scaling factors defined per channel.
- PER_TOKEN: Scaling factors defined per token.
- PER_GROUP: Scaling factors defined per group.
- Additional flags for controlling K/V cache storage and Q/DQ nodes fusion.

PreviousEfficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM NextTensorRT Models

Last updated 1 year ago

Was this helpful?