LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. LLama2 installation

LLama2 Files Analysis

config.json

  • "_name_or_path": Specifies the name or path of the pretrained model, in this case, "meta-llama/Llama-2-7b-chat-hf".

  • "architectures": Indicates the model architecture, which is ["LlamaForCausalLM"].

  • "bos_token_id" and "eos_token_id": Specify the IDs for the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens, respectively.

  • "hidden_act": Defines the activation function used in the model, which is "silu" (Sigmoid Linear Unit).

  • "hidden_size": Represents the dimensionality of the hidden states in the model (4096).

  • "initializer_range": Specifies the range for initializing the model's weights (0.02).

  • "intermediate_size": Indicates the dimensionality of the intermediate layer in the model (11008).

  • "max_position_embeddings": Defines the maximum sequence length that the model can handle (4096).

  • "model_type": Specifies the type of the model, which is "llama".

  • "num_attention_heads" and "num_key_value_heads": Represent the number of attention heads and key-value heads in the model (32).

  • "num_hidden_layers": Indicates the number of hidden layers in the model (32).

  • "pretraining_tp": Specifies the tensor parallelism used during pretraining (1).

  • "rms_norm_eps": Defines the epsilon value for RMS normalization (1e-05).

  • "rope_scaling": Indicates the scaling factor for RoPE (Rotary Position Embedding), which is set to null.

  • "tie_word_embeddings": Specifies whether to tie the word embeddings (false).

  • "torch_dtype": Indicates the data type used for the model's weights ("float16").

  • "transformers_version": Specifies the version of the transformers library used (4.32.0.dev0).

  • "use_cache": Indicates whether to use caching during inference (true).

  • "vocab_size": Represents the size of the model's vocabulary (32000).

generation.config.json

  • "bos_token_id" and "eos_token_id": Specify the IDs for the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens, respectively.

  • "do_sample": Indicates whether to use sampling during generation (true).

  • "max_length": Defines the maximum length of the generated sequence (4096).

  • "pad_token_id": Specifies the ID for the padding token (0).

  • "temperature": Controls the randomness of the generated output (0.6).

  • "top_p": Specifies the cumulative probability threshold for top-p sampling (0.9).

  • "transformers_version": Indicates the version of the transformers library used (4.32.0.dev0).

model.safetensors.index.json: This file contains the weight map, which maps the names of the model's parameters to their corresponding safetensors files. It helps in loading the model weights from the safetensors format.

special_tokens_map.json

  • "bos_token", "eos_token", and "unk_token": Define the special tokens used in the model, such as the beginning-of-sequence (BOS), end-of-sequence (EOS), and unknown (UNK) tokens. Each token is represented as an object with properties like "content", "lstrip", "normalized", "rstrip", and "single_word".

tokenizer_config.json

  • "add_bos_token" and "add_eos_token": Specify whether to add the BOS and EOS tokens during tokenization (true and false, respectively).

  • "bos_token" and "eos_token": Define the BOS and EOS tokens as added tokens with properties similar to the special tokens.

  • "chat_template": Provides a template for generating chat-based responses. It includes instructions for handling system messages, user messages, and assistant messages, as well as special tokens like <<SYS>>, <</SYS>>, [INST], [/INST].

  • "clean_up_tokenization_spaces": Indicates whether to clean up tokenization spaces (false).

  • "legacy": Specifies whether to use legacy tokenization (false).

  • "model_max_length": Defines the maximum length of the model (a very large value).

  • "pad_token": Specifies the padding token (null).

Previousrun_convert_checkpoint.py scriptNextTensorRT-LLM Build Engine Process

Last updated 1 year ago

Was this helpful?