LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation
  2. Converting Checkpoints

checkpoint configuration file

Download into your container the following github repository which contains the configuration scripts and the execution scripts for the convert_checkpoint function

After having used the helper scripts, you can now run the checkpoint.py command

git clone https://github.com/Continuum-Labs-HQ/tensorrt-continuum.git

This will download the following files:

Checkpoint Conversation Configuration YAML File

A run_convert_checkpoint.py script that you execute when you are comfortable with the configuration

YAML Configuration File

Enter in your arguments based on the configuration files from the Huggingface files - they have to be consistent

model:
  model_dir: ./llama-2-7b-chat-hf
  output_dir: ../llama-2-7b-chat-hf-output
  dtype: float16  # Choices: float32, bfloat16, float16
  # Suggestions:
  # - Use float16 for better performance with minimal accuracy loss
  # - Use bfloat16 for a balance between performance and accuracy
  # - Use float32 for maximum accuracy but slower performance

checkpoint:
  tp_size: 1  # Tensor parallelism size
  pp_size: 1  # Pipeline parallelism size
  # Suggestions:
  # - Increase tp_size and pp_size for distributed training across multiple GPUs
  # - Keep tp_size and pp_size as 1 for single GPU training
  vocab_size: 32000
  # Suggestions:
  # - Adjust vocab_size based on the specific tokenizer and model requirements
  n_positions: 2048
  # Suggestions:
  # - Increase n_positions for longer sequence lengths
  # - Decrease n_positions for shorter sequence lengths to save memory
  n_layer: 32
  # Suggestions:
  # - Adjust n_layer based on the desired model depth
  # - Increase n_layer for more complex models
  # - Decrease n_layer for simpler models or faster training
  n_head: 32
  # Suggestions:
  # - Adjust n_head based on the desired number of attention heads
  # - Increase n_head for more fine-grained attention
  # - Decrease n_head for faster training or smaller models
  n_embd: 4096
  # Suggestions:
  # - Adjust n_embd based on the desired hidden size
  # - Increase n_embd for larger models with more capacity
  # - Decrease n_embd for smaller models or faster training
  inter_size: 11008
  # Suggestions:
  # - Adjust inter_size based on the desired intermediate size in the feed-forward layers
  # - Increase inter_size for more capacity in the feed-forward layers
  # - Decrease inter_size for smaller models or faster training
  
  # Additional checkpoint arguments
  meta_ckpt_dir: null  # ./path/to/meta/checkpoint
  n_kv_head: null  # 32
  rms_norm_eps: 1e-6
  use_weight_only: false
  disable_weight_only_quant_plugin: false
  weight_only_precision: int8  # Choices: int8, int4, int4_gptq
  smoothquant: null  # 0.5
  per_channel: false
  per_token: false
  int8_kv_cache: false
  ammo_quant_ckpt_path: null  # ./path/to/ammo/quant/checkpoint
  per_group: false
  load_by_shard: false
  hidden_act: silu
  rotary_base: 10000.0
  group_size: 128
  dataset_cache_dir: null  # ./path/to/dataset/cache
  load_model_on_cpu: false
  use_parallel_embedding: false
  embedding_sharding_dim: 0  # Choices: 0, 1
  use_embedding_sharing: false
  workers: 1
  moe_num_experts: 0
  moe_top_k: 0
  moe_tp_mode: 0
  moe_renorm_mode: 1
  save_config_only: false
model:
  model_dir: ./llama-2-7b-chat-hf
  output_dir: ../llama-2-7b-chat-hf-output
  dtype: float16  # Choices: float32, bfloat16, float16
  # Suggestions:
  # - Use float16 for better performance with minimal accuracy loss
  # - Use bfloat16 for a balance between performance and accuracy
  # - Use float32 for maximum accuracy but slower performance

checkpoint:
  tp_size: 1  # Tensor parallelism size
  pp_size: 1  # Pipeline parallelism size
  # Suggestions:
  # - Increase tp_size and pp_size for distributed training across multiple GPUs
  # - Keep tp_size and pp_size as 1 for single GPU training
  vocab_size: 32000
  # Suggestions:
  # - Adjust vocab_size based on the specific tokenizer and model requirements
  n_positions: 2048
  # Suggestions:
  # - Increase n_positions for longer sequence lengths
  # - Decrease n_positions for shorter sequence lengths to save memory
  n_layer: 32
  # Suggestions:
  # - Adjust n_layer based on the desired model depth
  # - Increase n_layer for more complex models
  # - Decrease n_layer for simpler models or faster training
  n_head: 32
  # Suggestions:
  # - Adjust n_head based on the desired number of attention heads
  # - Increase n_head for more fine-grained attention
  # - Decrease n_head for faster training or smaller models
  n_embd: 4096
  # Suggestions:
  # - Adjust n_embd based on the desired hidden size
  # - Increase n_embd for larger models with more capacity
  # - Decrease n_embd for smaller models or faster training
  inter_size: 11008
  # Suggestions:
  # - Adjust inter_size based on the desired intermediate size in the feed-forward layers
  # - Increase inter_size for more capacity in the feed-forward layers
  # - Decrease inter_size for smaller models or faster training
  
  # Additional checkpoint arguments
  meta_ckpt_dir: null  # ./path/to/meta/checkpoint
  n_kv_head: null  # 32
  rms_norm_eps: 1e-6
  use_weight_only: false
  disable_weight_only_quant_plugin: false
  weight_only_precision: int8  # Choices: int8, int4, int4_gptq
  smoothquant: null  # 0.5
  per_channel: false
  per_token: false
  int8_kv_cache: false
  ammo_quant_ckpt_path: null  # ./path/to/ammo/quant/checkpoint
  per_group: false
  load_by_shard: false
  hidden_act: silu
  rotary_base: 10000.0
  group_size: 128
  dataset_cache_dir: null  # ./path/to/dataset/cache
  load_model_on_cpu: false
  use_parallel_embedding: false
  embedding_sharding_dim: 0  # Choices: 0, 1
  use_embedding_sharing: false
  workers: 1
  moe_num_experts: 0
  moe_top_k: 0
  moe_tp_mode: 0
  moe_renorm_mode: 1
  save_config_only: false
PreviousCheckpoint Script ArgumentsNextrun_convert_checkpoint.py script

Last updated 1 year ago

Was this helpful?