LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

Proposed checkpoint config file for LLama3

Here's the updated configuration file for convert_checkpoint.py based on the LLaMA-3B model:

model:
  model_dir: ./llama-3b-hf
  output_dir: ../llama-3b-hf-output
  dtype: bfloat16  # Choices: float32, bfloat16, float16
  # Suggestion:
  # - Use bfloat16 for a balance between performance and accuracy, as used in the LLaMA-3B model

checkpoint:
  tp_size: 1  # Tensor parallelism size
  pp_size: 1  # Pipeline parallelism size
  # Suggestions:
  # - Increase tp_size and pp_size for distributed training across multiple GPUs
  # - Keep tp_size and pp_size as 1 for single GPU training

  vocab_size: 128256
  # Suggestion:
  # - Update vocab_size to match the LLaMA-3B model's vocabulary size

  n_positions: 8192
  # Suggestion:
  # - Update n_positions to match the LLaMA-3B model's max position embeddings

  n_layer: 32
  # Suggestions:
  # - Adjust n_layer based on the desired model depth
  # - Keep n_layer as 32 to match the LLaMA-3B model's configuration

  n_head: 32
  # Suggestions:
  # - Adjust n_head based on the desired number of attention heads
  # - Keep n_head as 32 to match the LLaMA-3B model's configuration

  n_embd: 4096
  # Suggestions:
  # - Adjust n_embd based on the desired hidden size
  # - Keep n_embd as 4096 to match the LLaMA-3B model's configuration

  inter_size: 14336
  # Suggestion:
  # - Update inter_size to match the LLaMA-3B model's intermediate size

  # Additional checkpoint arguments
  meta_ckpt_dir: null  # ./path/to/meta/checkpoint
  n_kv_head: 8
  # Suggestion:
  # - Update n_kv_head to match the LLaMA-3B model's number of key-value heads

  rms_norm_eps: 1e-5
  # Suggestion:
  # - Update rms_norm_eps to match the LLaMA-3B model's configuration

  use_weight_only: false
  disable_weight_only_quant_plugin: false
  weight_only_precision: int8  # Choices: int8, int4, int4_gptq
  smoothquant: null  # 0.5
  per_channel: false
  per_token: false
  int8_kv_cache: false
  ammo_quant_ckpt_path: null  # ./path/to/ammo/quant/checkpoint
  per_group: false
  load_by_shard: false
  hidden_act: silu
  rope_theta: 500000.0
  # Suggestion:
  # - Update rotary_base to rope_theta and set its value to 500000.0 to match the LLaMA-3B model's configuration

  group_size: 128
  dataset_cache_dir: null  # ./path/to/dataset/cache
  load_model_on_cpu: false
  use_parallel_embedding: false
  embedding_sharding_dim: 0  # Choices: 0, 1
  use_embedding_sharing: false
  workers: 1
  moe_num_experts: 0
  moe_top_k: 0
  moe_tp_mode: 0
  moe_renorm_mode: 1
  save_config_only: false

  # Additional configurations to match LLaMA-3B
  bos_token_id: 128000
  eos_token_id: 128001
  tie_word_embeddings: false
  use_cache: true
  torch_dtype: bfloat16

I have made the following changes and additions to align the configuration with the LLaMA-3B model:

  1. Updated vocab_size to 128256.

  2. Updated n_positions to 8192.

  3. Updated inter_size to 14336.

  4. Updated n_kv_head to 8.

  5. Updated rms_norm_eps to 1e-5.

  6. Replaced rotary_base with rope_theta and set its value to 500000.0.

  7. Added bos_token_id and set it to 128000.

  8. Added eos_token_id and set it to 128001.

  9. Added tie_word_embeddings and set it to false.

  10. Added use_cache and set it to true.

  11. Added torch_dtype and set it to bfloat16.

Please note that the convert_checkpoint.py script looks good and covers all the necessary configurations. The updated configuration file should work well with the convert_checkpoint.py script to convert the LLaMA-3B model checkpoint.

PreviousLLama3 configurationsNextProposed build config file for LLama3

Last updated 1 year ago

Was this helpful?