LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

LLama3 configurations

After comparing the configuration of the LLaMA-2.7B model built using TensorRT-LLM and the LLaMA-3B model configuration files, I noticed a few differences and inconsistencies.

Here are my findings and recommendations:

Vocabulary Size

  • The LLaMA-2.7B model built with TensorRT-LLM has a vocabulary size of 32,000, while the LLaMA-3B model has a vocabulary size of 128,256.

  • Recommendation: Update the vocab_size parameter in the convert_checkpoint.py and build.py scripts to match the LLaMA-3B model's vocabulary size of 128,256.

Hidden Size and Intermediate Size

  • The LLaMA-2.7B model has a hidden size of 4,096 and an intermediate size of 11,008, whereas the LLaMA-3B model has a hidden size of 4,096 and an intermediate size of 14,336.

  • Recommendation: Adjust the hidden_size and intermediate_size parameters in the configuration files for the convert_checkpoint.py and build.py scripts to match the LLaMA-3B model's values.

Max Position Embeddings

  • The LLaMA-2.7B model has a maximum position embedding size of 4,096, while the LLaMA-3B model has a maximum position embedding size of 8,192.

  • Recommendation: Update the max_position_embeddings parameter in the configuration files to match the LLaMA-3B model's value of 8,192.

Number of Key-Value Heads

  • The LLaMA-2.7B model configuration specifies 32 key-value heads, while the LLaMA-3B model configuration specifies 8 key-value heads.

  • Recommendation: Modify the num_key_value_heads parameter in the configuration files to match the LLaMA-3B model's value of 8.

RoPE (Rotary Position Embedding) Parameters

  • The LLaMA-2.7B model configuration uses rotary_base with a value of 10,000.0, while the LLaMA-3B model configuration uses rope_theta with a value of 500,000.0.

  • Recommendation: Update the RoPE-related parameters in the configuration files to match the LLaMA-3B model's values. Replace rotary_base with rope_theta and set its value to 500,000.0.

Data Type

  • The LLaMA-2.7B model uses float16 as the data type, while the LLaMA-3B model uses bfloat16.

  • Recommendation: Consider updating the data type in the configuration files to match the LLaMA-3B model's data type of bfloat16. Modify the dtype parameter in the convert_checkpoint.py and build.py scripts accordingly.

Token IDs

  • The LLaMA-3B model configuration specifies bos_token_id as 128,000 and eos_token_id as 128,001, while the LLaMA-2.7B model configuration doesn't mention these token IDs.

  • Recommendation: Add the bos_token_id and eos_token_id parameters to the configuration files for the convert_checkpoint.py and build.py scripts, and set their values to match the LLaMA-3B model's values.

By making these adjustments to the configuration files for the convert_checkpoint.py and build.py scripts, you can align the LLaMA-2.7B model configuration with the LLaMA-3B model configuration.

This will ensure consistency and compatibility between the models when building and running them using TensorRT-LLM.

Please note that some of these changes may have an impact on the model's performance and resource requirements, so it's important to consider the available hardware resources and adjust the parameters accordingly.

PreviousAnalysis of the output from build.pyNextProposed checkpoint config file for LLama3

Last updated 1 year ago

Was this helpful?