LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Model Configuration and World Configuration.
  • Model Configuration
  • World Configuration
  • Usage Example
  • Simplified API
  • Execution
  • Summary

Was this helpful?

  1. TensorRT-LLM Architecture and Process

Model Configuration

Configuration and execution process of the C++ GPT Runtime in TensorRT-LLM

Model Configuration and World Configuration.

Model Configuration

This configuration is defined by the GptModelConfig class, which encapsulates several parameters:

Vocabulary Size (vocabSize): The total number of unique words or tokens that the model can recognise.

Number of Layers (numLayers): The depth of the model, indicated by its layer count.

Number of Attention Heads (numHeads): In the attention block, this is the count of distinct 'heads' used for parallel processing of the attention mechanism.

Number of K/V Heads (numKvHeads): This specifies the number of heads for the Key (K) and Value (V) components in the attention mechanism. It defines the type of attention (Multi-head, Multi-query, or Group-query).

Hidden Size (hiddenSize): The dimensionality of the hidden layers.

Data Type (dataType): The data type used during model training and inference.

GPT Attention Plugin Usage (useGptAttentionPlugin): Indicates if a specialised GPT Attention plugin was used.

Input Packing (inputPacked): Determines if the input should be packed or padded.

Paged K/V Cache (pagedKvCache): Indicates if the Key/Value cache uses paging.

Tokens Per Block (tokensPerBlock): Relevant for paged K/V cache, indicating the number of tokens in each cache block.

Quantization Mode (quantMode): Controls the model's quantization method.

Max Batch Size (maxBatchSize) and Max Input/Output Lengths: Define the limits for batch size and sequence lengths.

World Configuration

This part is for executing the model in a distributed environment (using multiple GPUs, possibly across multiple nodes):

Tensor Parallelism (tensorParallelism): The number of ranks (processes) working together in Tensor Parallelism, suitable for environments with high inter-GPU bandwidth like NVLINK.

Pipeline Parallelism (pipelineParallelism): The number of ranks for Pipeline Parallelism, ideal for setups with lower inter-GPU bandwidth.

Rank (rank): The unique identifier for each process in the distributed setup.

GPUs Per Node (gpusPerNode): Helps optimise communications between GPUs on the same node.

Usage Example

  • Initialise MPI (Message Passing Interface) for distributed processing.

  • Obtain the rank and size of the MPI world.

  • Configure the WorldConfig for each process (rank).

  • Create a GptSession for each process.

Simplified API

TensorRT-LLM offers a simplified API to create a WorldConfig using MPI settings.

Execution

The compiled C++ code should be executed using the mpirun command, specifying the number of processes (ranks).

Summary

The C++ GPT Runtime in TensorRT-LLM allows the execution of large language models like GPT in a highly efficient, distributed manner.

Model configuration sets up the model's parameters, and world configuration manages its distributed execution across multiple GPUs and nodes.

The use of MPI and specific settings like Tensor and Pipeline Parallelism ensure optimised utilisation of resources for high-performance computing tasks.

PreviousWeight BindingsNextTensorRT-LLM build workflow

Last updated 1 year ago

Was this helpful?