FP8 Formats for Deep Learning
This paper proposes an 8-bit floating point (FP8) binary interchange format for accelerating deep learning training and inference.
The authors introduce two FP8 encodings: E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa).
They demonstrate that using FP8 can match the accuracy achieved by 16-bit training (FP16 or bfloat16) across a wide range of tasks and model architectures, without changing hyperparameters.
Key points
FP8 is a natural progression from 16-bit formats, reducing compute and memory bandwidth requirements for training and inference.
E4M3 is used for weights and activations, while E5M2 is used for gradients. E4M3 deviates from IEEE-754 conventions to extend its dynamic range, while E5M2 follows these conventions.
Scaling factors are used to move values into the representable range of FP8. Per-tensor scaling factors are required for some networks, as the dynamic range of FP8 is insufficient to cover all tensors' important values.
FP8 training is evaluated on a variety of tasks: image classification (CNNs and Transformers), language translation (RNNs and Transformers), and language modeling (Transformers). Results show that FP8 training matches 16-bit baselines without changing hyperparameters, even for very large models (e.g., 175B parameters).
FP8 inference is simplified compared to int8, as no post-training quantization (PTQ) or quantization-aware training (QAT) is needed. FP8 PTQ maintains accuracy better than int8 PTQ for models trained in 16-bit.
Ramifications
Faster and more efficient training and inference: FP8 reduces compute and memory requirements, enabling faster processing and lower power consumption.
Easier deployment: Using the same datatype (FP8) for both training and inference simplifies the deployment process compared to int8 inference.
Large model training: FP8 enables training very large models (e.g., 175B parameters) with reduced resources, making such models more accessible.
Potential hardware support: The proposed FP8 format could drive hardware implementations in future AI accelerators, providing native support for efficient deep learning.
In summary, this paper presents a compelling case for using FP8 as a standard for deep learning, demonstrating its effectiveness across a wide range of tasks and model sizes. The proposed FP8 format has the potential to significantly accelerate AI research and deployment by reducing the computational and memory requirements for training and inference.
Last updated