LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation
  2. Converting Checkpoints

Checkpoint List - Arguments

Here's a table summarising the arguments that you can parse in the convert_checkpoint.py script, along with their default values:

Argument
Default Value
Description

--model_dir

None

Path to the Hugging Face model directory

--meta_ckpt_dir

None

Path to the meta checkpoint directory

--tp_size

1

N-way tensor parallelism size

--pp_size

1

N-way pipeline parallelism size

--dtype

'float16'

Data type ('float32', 'bfloat16', 'float16')

--vocab_size

32000

Vocabulary size

--n_positions

2048

Number of positions

--n_layer

32

Number of layers

--n_head

32

Number of attention heads

--n_kv_head

None

Number of key-value heads (defaults to n_head if not specified)

--n_embd

4096

Hidden size

--inter_size

11008

Intermediate size

--rms_norm_eps

1e-06

RMS normalization epsilon

--use_weight_only

False

Quantize weights for the various GEMMs to INT4/INT8

--disable_weight_only_quant_plugin

False

Use OOTB implementation instead of plugin for weight quantization

--weight_only_precision

'int8'

Precision for weight-only quantization ('int8', 'int4', 'int4_gptq')

--smoothquant

None

Set the α parameter for Smoothquant quantization (float value)

--per_channel

False

Use per-channel static scaling factor for GEMM's result

--per_token

False

Use per-token dynamic scaling factor for activations

--int8_kv_cache

False

Use INT8 quantization for KV cache

--ammo_quant_ckpt_path

None

Path to a quantized model checkpoint in .npz format

--per_group

False

Use per-group dynamic scaling factor for weights in INT4 range (for GPTQ/AWQ quantization)

--load_by_shard

False

Load a pretrained model shard-by-shard

--hidden_act

'silu'

Hidden activation function

--rotary_base

10,000

Rotary base value

--group_size

128

Group size used in GPTQ quantization

--dataset-cache-dir

None

Cache directory to load the Hugging Face dataset

--load_model_on_cpu

False

Load the model on CPU

--use_parallel_embedding

False

Enable embedding parallelism

--embedding_sharding_dim

0

Dimension for sharding the embedding lookup table (0: vocab dimension, 1: hidden dimension)

--use_embedding_sharing

False

Try to reduce the engine size by sharing the embedding lookup table between two layers

--output_dir

'tllm_checkpoint'

Path to save the TensorRT-LLM checkpoint

--workers

1

Number of workers for converting checkpoint in parallel

--moe_num_experts

0

Number of experts to use for MOE layers

--moe_top_k

0

Top_k value to use for MOE layers (defaults to 1 if --moe_num_experts is set)

--moe_tp_mode

MoeConfig.ParallelismMode.TENSOR_PARALLEL

Controls how to distribute experts in TP (check layers/moe.py for accepted values)

--moe_renorm_mode

MoeConfig.ExpertScaleNormalizationMode.RENORM

Controls renormalization after gate logits (check layers/moe.py for accepted values)

--save_config_only

False

Only save the model config without reading and converting weights (for debugging)

These arguments allow you to customise the behavior of the convert_checkpoint.py script according to your specific requirements. You can provide the desired values for these arguments when running the script.

PreviousConverting CheckpointsNextExamples of running the convert_checkpoint.py script

Last updated 1 year ago

Was this helpful?

Page cover image