LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key points
  • Ramifications

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

FP8 Formats for Deep Learning

PreviousNumerical PrecisionNextGraph Rewriting

Last updated 1 year ago

Was this helpful?

This paper proposes an 8-bit floating point (FP8) binary interchange format for accelerating deep learning training and inference.

The authors introduce two FP8 encodings: E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa).

They demonstrate that using FP8 can match the accuracy achieved by 16-bit training (FP16 or bfloat16) across a wide range of tasks and model architectures, without changing hyperparameters.

Key points

  1. FP8 is a natural progression from 16-bit formats, reducing compute and memory bandwidth requirements for training and inference.

  2. E4M3 is used for weights and activations, while E5M2 is used for gradients. E4M3 deviates from IEEE-754 conventions to extend its dynamic range, while E5M2 follows these conventions.

  3. Scaling factors are used to move values into the representable range of FP8. Per-tensor scaling factors are required for some networks, as the dynamic range of FP8 is insufficient to cover all tensors' important values.

  4. FP8 training is evaluated on a variety of tasks: image classification (CNNs and Transformers), language translation (RNNs and Transformers), and language modeling (Transformers). Results show that FP8 training matches 16-bit baselines without changing hyperparameters, even for very large models (e.g., 175B parameters).

  5. FP8 inference is simplified compared to int8, as no post-training quantization (PTQ) or quantization-aware training (QAT) is needed. FP8 PTQ maintains accuracy better than int8 PTQ for models trained in 16-bit.

Ramifications

  1. Faster and more efficient training and inference: FP8 reduces compute and memory requirements, enabling faster processing and lower power consumption.

  2. Easier deployment: Using the same datatype (FP8) for both training and inference simplifies the deployment process compared to int8 inference.

  3. Large model training: FP8 enables training very large models (e.g., 175B parameters) with reduced resources, making such models more accessible.

  4. Potential hardware support: The proposed FP8 format could drive hardware implementations in future AI accelerators, providing native support for efficient deep learning.

In summary, this paper presents a compelling case for using FP8 as a standard for deep learning, demonstrating its effectiveness across a wide range of tasks and model sizes. The proposed FP8 format has the potential to significantly accelerate AI research and deployment by reducing the computational and memory requirements for training and inference.

FP8 Formats for Deep LearningarXiv.org
Page cover image
Logo