LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

Converting Checkpoints

The convert_checkpoint.py script is a component in the TensorRT-LLM process of converting a pre-trained language model, such as LLaMA, into an optimised format suitable for inference on GPUs using TensorRT.

The script takes a pre-trained model checkpoint and converts it into a format that can be loaded and optimised by TensorRT-LLM.

Here's a detailed analysis of how the script works and its role in the TensorRT-LLM process:

Command-line Arguments

  • The script accepts various command-line arguments to configure the conversion process.

  • Key arguments include --model_dir (path to the pre-trained model directory), --output_dir (path to save the converted checkpoint), and --dtype (data type for the converted model, e.g., float16).

  • Other arguments control parallelism, quantization, and model-specific settings.

Model Loading

  • The script supports loading models from different sources, such as a Hugging Face model directory (--model_dir) or a meta checkpoint directory (--meta_ckpt_dir).

  • It uses the Hugging Face Transformers library to load the pre-trained model and its configuration.

  • The preload_model function is responsible for loading the model based on the specified directory and device (CPU or GPU).

Model Conversion

  • The script converts the pre-trained model into a format compatible with TensorRT-LLM.

  • It creates an instance of the LLaMAForCausalLM class, which represents the LLaMA model architecture in TensorRT-LLM.

  • The conversion process involves initializing the model with the specified data type, mapping (tensor parallelism and pipeline parallelism), and quantization settings.

  • The from_hugging_face method of LLaMAForCausalLM is used to convert the Hugging Face model to TensorRT-LLM format.

Quantization

  • The script supports various quantization options to reduce the model size and improve inference performance.

  • Quantization settings are determined based on the command-line arguments, such as --use_weight_only, --weight_only_precision, --smoothquant, --per_channel, and --per_token.

  • The args_to_quantization function maps the command-line arguments to the corresponding quantization configuration (QuantConfig).

  • Quantization algorithms like QuantAlgo.W8A16, QuantAlgo.W4A16, and QuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN are used based on the specified settings.

Parallelism

  • The script supports tensor parallelism and pipeline parallelism to distribute the model across multiple GPUs.

  • The --tp_size and --pp_size arguments control the parallelism settings.

  • The Mapping class is used to define the mapping of the model across GPUs based on the parallelism settings.

Saving the Converted Checkpoint

  • After the model conversion and quantization, the script saves the converted checkpoint to the specified output directory (--output_dir).

  • The save_checkpoint method of LLaMAForCausalLM is used to save the converted model weights and configuration.

Multi-threading

  • The script supports multi-threaded execution to speed up the conversion process when using multiple GPUs.

  • The execute function is used to distribute the conversion tasks across multiple threads based on the specified number of workers (--workers).

Key considerations when using this script

  1. Ensure that the pre-trained model is compatible with the LLaMA architecture and can be loaded using the Hugging Face Transformers library.

  2. Choose the appropriate data type (--dtype) based on the desired precision and performance trade-off. Float16 (FP16) is commonly used for faster inference with minimal accuracy loss.

  3. Consider the available GPU memory and select the appropriate parallelism settings (--tp_size and --pp_size) to distribute the model across multiple GPUs if necessary.

  4. Experiment with different quantization settings to achieve the desired balance between model size, inference speed, and accuracy. Weight-only quantization (--use_weight_only) and SmoothQuant (--smoothquant) are popular options.

  5. Ensure that the output directory (--output_dir) has sufficient space to store the converted checkpoint.

  6. If using multi-threading (--workers), ensure that the system has enough resources to handle the parallel execution.

Overall, the convert_checkpoint.py script plays a vital role in the TensorRT-LLM process by converting pre-trained language models into a format optimised for inference on GPUs using TensorRT.

It provides flexibility in model loading, quantization, parallelism, and saves the converted checkpoint for further optimisation and deployment.

PreviousLLama2 installationNextCheckpoint List - Arguments

Last updated 1 year ago

Was this helpful?