Model
This module contains a collection of pre-defined model architectures and utilities for building and customizing large language models (LLMs).
Key Model Classes
PretrainedModel
A base class for all pre-trained models in TensorRT-LLM. It provides common functionality such as loading weights, saving checkpoints, and preparing inputs for the model.
DecoderModel
A base class for decoder-only models used in tasks like language modeling and generation. It inherits from PretrainedModel and adds specific methods for causal language modeling.
EncoderModel
A base class for encoder-only models used in tasks like sequence classification and question answering. It inherits from PretrainedModel and provides methods for encoding input sequences.
Model-specific classes
TensorRT-LLM provides several pre-defined model architectures, each with its own class. Some notable examples include:
GPTModel and GPTForCausalLM: Implements the GPT (Generative Pre-trained Transformer) architecture for language modelling and generation.
LLaMAModel and LLaMAForCausalLM: Implements the LLaMA (Large Language Model Adaptation) architecture, a powerful and efficient LLM.
BertModel, BertForSequenceClassification, and BertForQuestionAnswering: Implements the BERT (Bidirectional Encoder Representations from Transformers) architecture for sequence classification and question answering tasks.
PretrainedConfig
A configuration class that stores hyperparameters and settings for a pre-trained model. It can be loaded from a JSON file or a dictionary.
Key Features and Functionalities
Model initialization: The from_config and from_checkpoint class methods allow initializing a model from a PretrainedConfig object or a checkpoint directory, respectively. This makes it easy to load pre-trained weights and configure the model architecture.
Quantization: The quantize class method enables quantizing a pre-trained model for reduced memory footprint and faster inference. It supports various quantization configurations through the QuantConfig class.
Dynamic input shapes: The prepare_inputs method in PretrainedModel and its subclasses allows specifying the maximum input sizes for dynamic shape inference. This enables efficient memory allocation and optimization when using TensorRT.
Multi-GPU support: TensorRT-LLM models can be distributed across multiple GPUs using the Mapping class, which specifies the parallel strategy for tensor sharding and model parallelism.
LoRA (Low-Rank Adaptation): Some models, like LLaMAForCausalLM, support LoRA for efficient fine-tuning and adaptation. The use_lora method allows loading LoRA weights from a LoraBuildConfig object.
Customisation: The modular design of TensorRT-LLM allows users to easily customize and extend the provided model architectures. Users can subclass the base model classes and override methods to incorporate new features or modify the model behavior.
Inference and generation: The PretrainedModel class inherits from the GenerationMixin, which provides methods for text generation and inference, such as generate and greedy_search. These methods leverage the optimized kernels and plugins provided by TensorRT for efficient inference.
The TensorRT-LLM models module offers a wide range of pre-trained models and a flexible API for building and customizing LLMs.
It integrates closely with the underlying TensorRT engine to leverage optimizations like kernel fusion, mixed precision, and dynamic shape inference.
By using the provided model classes and configuration options, users can easily load pre-trained weights, quantize models, distribute across multiple GPUs, and perform efficient inference and generation.
The modular design allows for extensibility and customization, enabling users to adapt the models to their specific use cases and requirements.
Overall, the TensorRT-LLM models module provides a powerful and user-friendly interface for working with state-of-the-art LLMs while leveraging the performance optimizations offered by TensorRT.
Last updated