Numerical Precision

TensorRT-LLM provides various methods for numerical precision in the implementation of its models, catering to different computational requirements and accuracy needs. Here's a summary of these methods:

FP32, FP16, and BF16

  • FP32 (32-bit IEEE floating-point): This is the standard precision used in most models.

  • FP16 (16-bit IEEE floating-point): Provides a balance between performance and precision. Used when models have checkpoints available.

  • BF16 (16-bit Bfloat16): Similar to FP16 but with different trade-offs in the representation of floating-point numbers.

Quantization and Dequantization (Q/DQ)

  • INT8 Quantization: Converts floating-point numbers to 8-bit integers to reduce model size and accelerate inference.

  • Three Modes:

    • Per-tensor: Single scaling factor for the entire tensor.

    • Per-token: Different scaling factors for each token.

    • Per-channel: Different scaling factors for each channel.

INT8 SmoothQuant (W8A8)

  • SmoothQuant: Preserves network accuracy while using INT8 for both activations and weights. Requires preprocessing of weights.

INT4 and INT8 Weight-Only (W4A16 and W8)

TensorRT-LLM incorporates various methods to handle numerical precision, offering a range of options from standard floating-point precision to more compact and faster quantized formats. Here's a summary of these methods:

FP32, FP16, and BF16

  • Standard Precision: Models in TensorRT-LLM typically use 32-bit IEEE floating-point (FP32) precision.

  • Reduced Precision: Support for 16-bit IEEE floating-point (FP16) and Bfloat16 (BF16) is available, offering a balance between computational efficiency and numerical accuracy.

Quantization and Dequantization (Q/DQ)

  • INT8 Quantization: Involves converting floating-point numbers to 8-bit integers, a process that reduces model size and speeds up inference while maintaining acceptable levels of accuracy.

  • Scaling Factors: Quantization can be applied per-tensor, per-token, or per-channel, with each method employing different scaling factors for the conversion process.

INT8 SmoothQuant (W8A8)

  • Technique for Accuracy Preservation: This method enables inference using INT8 for both activations and weights without significant loss in accuracy, as detailed in research papers.

INT4 and INT8 Weight-Only (W4A16 and W8A16)

  • Quantizing Weights Only: In these techniques, only the weights of the model are quantized, with the activations remaining in higher precision formats like FP16 or BF16.

GPTQ and AWQ (W4A16)

  • Advanced Quantization Methods: GPTQ and AWQ are techniques that use per-group scaling factors and zero-offsetting in linear layers, as described in specific research papers.

FP8 (Hopper)

  • 8-bit Floating-Point Precision: This release of TensorRT-LLM includes implementations of FP8 precision for certain GPT models, offering a middle ground between FP16 and INT8 quantization.

Support Matrix

  • Broad Model Support: The current release supports a wide range of models in various precision formats, including FP32, FP16, BF16, FP8, and various quantized formats.

Technical Detail: The QuantMode Flags

  • Control Flags: The quantization method and other settings are controlled by QuantMode flags, allowing for precise configuration of the numerical precision and quantization methods used.

In summary, TensorRT-LLM provides a comprehensive suite of numerical precision options, enabling users to tailor the precision of their models according to their specific performance and accuracy requirements.

Last updated