LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • More advanced techniques for fun
  • Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism
  • Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism
  • Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization
  • Using SmoothQuant for Enhanced Model Precision

Was this helpful?

  1. LLama2 installation
  2. Converting Checkpoints

convert_checkpoint examples

Here are some new commands for your model setup, incorporating different configurations for the LLaMA 70B model using the llama-2-7b-chat-hf directory and varying levels of tensor and pipeline parallelism:

To compile the LLaMA 7B model in the simplest way, you would typically want to minimize the complexity of the setup, focusing on using a single GPU without engaging in tensor or pipeline parallelism.

This approach reduces the complexity of the setup and avoids potential issues related to distributed computing environments.

Here's how you could do it with minimal configuration:

Convert the LLaMA 7B model to TensorRT-LLM checkpoint format using a single GPU

python3 convert_checkpoint.py --model_dir llama-2-7b-chat-hf \
                              --output_dir ./llama-2-7b-chat-hf-output \
                              --dtype float16

Build the TensorRT engine(s) for the LLaMA 70B model using a single GPU:

trtllm-build --checkpoint_dir ./llama-2-7b-chat-hf-output \
             --output_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16

This setup uses the --tp_size 1 parameter to indicate that you're compiling the model for use with a single GPU, thus avoiding the additional complexity of managing multiple GPUs or engaging in tensor or pipeline parallelism.

It assumes the model directory ./tmp/llama/7B/hf/ contains the Hugging Face checkpoint for the LLaMA 7B model.

Remember, while this approach simplifies the compilation process, it may not fully leverage the computational capabilities of a multi-GPU setup, which could be beneficial for very large models like LLaMA 70B.

However, it serves as a good starting point for initial testing or environments where only a single GPU is available.

Use summarize_long.py

python3 ../summarize_long.py --test_trt_llm \
                       --hf_model_dir ./llama-models/llama-7b-hf \
                       --data_type fp16 \
                       --engine_dir ./tmp/llama/7B-chat/trt_engines/fp16/1-gpu \
                       --test_hf \
                       --tokenizer_dir ./llama-models/llama-7b-hf \
                       --output_dir ./results/llama/7B-chat \
                       --eval_task summarize

More advanced techniques for fun

Build LLaMA 70B with INT8 Quantization and 8-way Tensor Parallelism

This command sets up the model with INT8 weight-only quantization for improved performance on hardware that supports INT8 operations. It uses 8-way tensor parallelism to distribute the model across 8 GPUs.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_int8 \
                            --dtype float16 \
                            --tp_size 8 \
                            --use_weight_only \
                            --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_int8 \
            --output_dir ./tmp/llama/70B/trt_engines/int8/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Build LLaMA 70B with FP8 Precision and 4-way Tensor Parallelism

This setup converts the LLaMA 70B model to use FP8 precision, aiming to achieve a balance between performance and precision. It utilizes 4-way tensor parallelism.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp4_fp8 \
                            --dtype float16 \
                            --tp_size 4 \
                            --enable_fp8

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp4_fp8 \
            --output_dir ./tmp/llama/70B/trt_engines/fp8/4-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Build LLaMA 70B with 16-way Tensor Parallelism for Maximum GPU Utilization

This command is designed for setups with a high number of GPUs, utilizing 16-way tensor parallelism to maximize GPU utilization across a large cluster.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp16 \
                            --dtype float16 \
                            --tp_size 16

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp16 \
            --output_dir ./tmp/llama/70B/trt_engines/fp16/16-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Using SmoothQuant for Enhanced Model Precision

This setup applies SmoothQuant with a specific alpha parameter to the model, aiming to improve the model's precision without significant performance degradation. It uses 8-way tensor parallelism.

python convert_checkpoint.py --model_dir ./llama-2-7b-chat-hf/ \
                            --output_dir ./tllm_checkpoint_70B_tp8_sq \
                            --dtype float16 \
                            --tp_size 8 \
                            --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_checkpoint_70B_tp8_sq \
            --output_dir ./tmp/llama/70B/trt_engines/sq/8-gpu/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16

Each command specifies a unique combination of precision, quantization, and parallelism settings to suit different hardware capabilities and performance goals.

PreviousExamples of running the convert_checkpoint.py scriptNextCheckpoint Script Arguments

Last updated 1 year ago

Was this helpful?

Page cover image