Numerical Precision
TensorRT-LLM provides various methods for numerical precision in the implementation of its models, catering to different computational requirements and accuracy needs. Here's a summary of these methods:
FP32, FP16, and BF16
FP32 (32-bit IEEE floating-point): This is the standard precision used in most models.
FP16 (16-bit IEEE floating-point): Provides a balance between performance and precision. Used when models have checkpoints available.
BF16 (16-bit Bfloat16): Similar to FP16 but with different trade-offs in the representation of floating-point numbers.
Quantization and Dequantization (Q/DQ)
INT8 Quantization: Converts floating-point numbers to 8-bit integers to reduce model size and accelerate inference.
Three Modes:
Per-tensor: Single scaling factor for the entire tensor.
Per-token: Different scaling factors for each token.
Per-channel: Different scaling factors for each channel.
INT8 SmoothQuant (W8A8)
SmoothQuant: Preserves network accuracy while using INT8 for both activations and weights. Requires preprocessing of weights.
INT4 and INT8 Weight-Only (W4A16 and W8)
TensorRT-LLM incorporates various methods to handle numerical precision, offering a range of options from standard floating-point precision to more compact and faster quantized formats. Here's a summary of these methods:
FP32, FP16, and BF16
Standard Precision: Models in TensorRT-LLM typically use 32-bit IEEE floating-point (FP32) precision.
Reduced Precision: Support for 16-bit IEEE floating-point (FP16) and Bfloat16 (BF16) is available, offering a balance between computational efficiency and numerical accuracy.
Quantization and Dequantization (Q/DQ)
INT8 Quantization: Involves converting floating-point numbers to 8-bit integers, a process that reduces model size and speeds up inference while maintaining acceptable levels of accuracy.
Scaling Factors: Quantization can be applied per-tensor, per-token, or per-channel, with each method employing different scaling factors for the conversion process.
INT8 SmoothQuant (W8A8)
Technique for Accuracy Preservation: This method enables inference using INT8 for both activations and weights without significant loss in accuracy, as detailed in research papers.
INT4 and INT8 Weight-Only (W4A16 and W8A16)
Quantizing Weights Only: In these techniques, only the weights of the model are quantized, with the activations remaining in higher precision formats like FP16 or BF16.
GPTQ and AWQ (W4A16)
Advanced Quantization Methods: GPTQ and AWQ are techniques that use per-group scaling factors and zero-offsetting in linear layers, as described in specific research papers.
FP8 (Hopper)
8-bit Floating-Point Precision: This release of TensorRT-LLM includes implementations of FP8 precision for certain GPT models, offering a middle ground between FP16 and INT8 quantization.
Support Matrix
Broad Model Support: The current release supports a wide range of models in various precision formats, including FP32, FP16, BF16, FP8, and various quantized formats.
Technical Detail: The QuantMode Flags
Control Flags: The quantization method and other settings are controlled by QuantMode flags, allowing for precise configuration of the numerical precision and quantization methods used.
In summary, TensorRT-LLM provides a comprehensive suite of numerical precision options, enabling users to tailor the precision of their models according to their specific performance and accuracy requirements.
Last updated