Model Configuration
Configuration and execution process of the C++ GPT Runtime in TensorRT-LLM
Model Configuration and World Configuration.
Model Configuration
This configuration is defined by the GptModelConfig
class, which encapsulates several parameters:
Vocabulary Size (vocabSize
): The total number of unique words or tokens that the model can recognise.
Number of Layers (numLayers
): The depth of the model, indicated by its layer count.
Number of Attention Heads (numHeads
): In the attention block, this is the count of distinct 'heads' used for parallel processing of the attention mechanism.
Number of K/V Heads (numKvHeads
): This specifies the number of heads for the Key (K) and Value (V) components in the attention mechanism. It defines the type of attention (Multi-head, Multi-query, or Group-query).
Hidden Size (hiddenSize
): The dimensionality of the hidden layers.
Data Type (dataType
): The data type used during model training and inference.
GPT Attention Plugin Usage (useGptAttentionPlugin
): Indicates if a specialised GPT Attention plugin was used.
Input Packing (inputPacked
): Determines if the input should be packed or padded.
Paged K/V Cache (pagedKvCache): Indicates if the Key/Value cache uses paging.
Tokens Per Block (tokensPerBlock
): Relevant for paged K/V cache, indicating the number of tokens in each cache block.
Quantization Mode (quantMode
): Controls the model's quantization method.
Max Batch Size (maxBatchSize
) and Max Input/Output Lengths: Define the limits for batch size and sequence lengths.
World Configuration
This part is for executing the model in a distributed environment (using multiple GPUs, possibly across multiple nodes):
Tensor Parallelism (tensorParallelism
): The number of ranks (processes) working together in Tensor Parallelism, suitable for environments with high inter-GPU bandwidth like NVLINK.
Pipeline Parallelism (pipelineParallelism
): The number of ranks for Pipeline Parallelism, ideal for setups with lower inter-GPU bandwidth.
Rank (rank
): The unique identifier for each process in the distributed setup.
GPUs Per Node (gpusPerNode
): Helps optimise communications between GPUs on the same node.
Usage Example
Initialise MPI (Message Passing Interface) for distributed processing.
Obtain the rank and size of the MPI world.
Configure the
WorldConfig
for each process (rank).Create a
GptSession
for each process.
Simplified API
TensorRT-LLM offers a simplified API to create a WorldConfig
using MPI settings.
Execution
The compiled C++ code should be executed using the mpirun
command, specifying the number of processes (ranks).
Summary
The C++ GPT Runtime in TensorRT-LLM allows the execution of large language models like GPT in a highly efficient, distributed manner.
Model configuration sets up the model's parameters, and world configuration manages its distributed execution across multiple GPUs and nodes.
The use of MPI and specific settings like Tensor and Pipeline Parallelism ensure optimised utilisation of resources for high-performance computing tasks.
Last updated