LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Purpose and Process
  • Parameters
  • Quantization and Scaling Logic
  • Outputs
  • Importance of Quantization

Was this helpful?

  1. LLama2 installation

generate_int8 function

The generate_int8 function is designed for quantizing weights of neural network models from floating-point precision (e.g., FP32 or FP16) to INT8 precision and computing various scaling factors essential for the quantization process and subsequent inference.

This function is particularly tailored for General Matrix Multiply (GEMM) operations, which are core to deep learning computations, especially in transformers-based models. Here’s a detailed explanation of its components and functionalities:

Purpose and Process

The function serves two main purposes:

  1. Quantizing Weights: It converts model weights to INT8 format, which reduces memory footprint and can speed up inference on compatible hardware. The function supports either per-tensor or per-column (per-channel) quantization.

  2. Computing Scaling Factors: It calculates several scaling factors needed to adjust the quantized weights and activations during inference, ensuring that the quantization process minimizes loss of accuracy.

Parameters

  • weights: The original floating-point weights of the model or a layer that need to be quantized.

  • act_range: A dictionary containing the ranges (maximum absolute values) of activations ("x"), weights ("w"), and possibly outputs ("y"), used to determine scaling factors.

  • is_qkv: A boolean indicating if the weights belong to a QKV (Query, Key, Value) projection layer, common in transformer models. QKV layers have specific quantization considerations.

  • multi_query_mode: A boolean that, when combined with is_qkv, indicates a special mode of handling multiple queries simultaneously, affecting how quantization is applied.

Quantization and Scaling Logic

  • The function initially detaches and moves the weights to CPU as NumPy arrays for processing.

  • It differentiates between standard layers and QKV projection layers, applying a specific quantization strategy for each. For QKV layers, it treats the combined QKV matrix as three separate matrices, each with potentially different scaling factors.

  • For non-QKV or standard QKV layers, it computes a global (per-tensor) or local (per-column) scaling factor based on the provided activation range. For multi-query QKV layers, it computes separate scaling factors for Q, K, and V projections based on their respective activation ranges.

  • Scaling factors are computed for both directions: floating-point to INT8 (scale_w_orig_quant) and INT8 back to floating-point (scale_w_quant_orig), along with specific scaling factors needed for GEMM operations using either CUTLASS or CUBLAS APIs. CUTLASS requires separate scaling for activations and weights, while CUBLAS uses a combined scaling factor since it does not support per-row scaling.

  • The function accounts for different scaling requirements when using tensor or pipeline parallelism and adjusts scaling factors accordingly to ensure consistent model behavior across different numbers of GPUs.

Outputs

The function returns a dictionary containing:

  • Quantized weights in INT8 format ("weight.int8") for both global and column-specific quantization.

  • The computed scaling factors necessary for adjusting inputs, weights, and outputs during inference with quantized models. These factors are crucial for maintaining the accuracy of the model after quantization.

Importance of Quantization

Quantizing model weights and activations to INT8

  • Reduces Model Size: Lower precision weights significantly reduce the memory footprint, making deployment on edge devices more feasible.

  • Increases Inference Speed: Many modern CPUs and GPUs have specialized instructions for INT8 arithmetic, leading to faster computations compared to higher precision formats.

  • Requires Careful Scaling: To preserve model accuracy, precise scaling factors must be applied during inference. This function automates the calculation of these factors based on the ranges of weights and activations.

PreviousUsing the models - running LlamaNextsummarize.py script in Llama folder

Last updated 1 year ago

Was this helpful?

Page cover image