Numerical Precision

TensorRT-LLM provides various methods for numerical precision in the implementation of its models, catering to different computational requirements and accuracy needs. Here's a summary of these methods:

FP32, FP16, and BF16

FP32 (32-bit IEEE floating-point): This is the standard precision used in most models.
FP16 (16-bit IEEE floating-point): Provides a balance between performance and precision. Used when models have checkpoints available.
BF16 (16-bit Bfloat16): Similar to FP16 but with different trade-offs in the representation of floating-point numbers.

Quantization and Dequantization (Q/DQ)

INT8 Quantization: Converts floating-point numbers to 8-bit integers to reduce model size and accelerate inference.
Three Modes:
- Per-tensor: Single scaling factor for the entire tensor.
- Per-token: Different scaling factors for each token.
- Per-channel: Different scaling factors for each channel.

INT8 SmoothQuant (W8A8)

SmoothQuant: Preserves network accuracy while using INT8 for both activations and weights. Requires preprocessing of weights.

INT4 and INT8 Weight-Only (W4A16 and W8)

TensorRT-LLM incorporates various methods to handle numerical precision, offering a range of options from standard floating-point precision to more compact and faster quantized formats. Here's a summary of these methods:

FP32, FP16, and BF16

Standard Precision: Models in TensorRT-LLM typically use 32-bit IEEE floating-point (FP32) precision.
Reduced Precision: Support for 16-bit IEEE floating-point (FP16) and Bfloat16 (BF16) is available, offering a balance between computational efficiency and numerical accuracy.

Quantization and Dequantization (Q/DQ)

INT8 Quantization: Involves converting floating-point numbers to 8-bit integers, a process that reduces model size and speeds up inference while maintaining acceptable levels of accuracy.
Scaling Factors: Quantization can be applied per-tensor, per-token, or per-channel, with each method employing different scaling factors for the conversion process.

INT8 SmoothQuant (W8A8)

Technique for Accuracy Preservation: This method enables inference using INT8 for both activations and weights without significant loss in accuracy, as detailed in research papers.

INT4 and INT8 Weight-Only (W4A16 and W8A16)

Quantizing Weights Only: In these techniques, only the weights of the model are quantized, with the activations remaining in higher precision formats like FP16 or BF16.

GPTQ and AWQ (W4A16)

Advanced Quantization Methods: GPTQ and AWQ are techniques that use per-group scaling factors and zero-offsetting in linear layers, as described in specific research papers.

FP8 (Hopper)

8-bit Floating-Point Precision: This release of TensorRT-LLM includes implementations of FP8 precision for certain GPT models, offering a middle ground between FP16 and INT8 quantization.

Support Matrix

Broad Model Support: The current release supports a wide range of models in various precision formats, including FP32, FP16, BF16, FP8, and various quantized formats.

Technical Detail: The QuantMode Flags

Control Flags: The quantization method and other settings are controlled by QuantMode flags, allowing for precise configuration of the numerical precision and quantization methods used.

In summary, TensorRT-LLM provides a comprehensive suite of numerical precision options, enabling users to tailor the precision of their models according to their specific performance and accuracy requirements.

PreviousRotary Positional Embedding (RoPE)NextFP8 Formats for Deep Learning

Last updated 1 year ago

Was this helpful?