Runtime
The TensorRT-LLM Runtime API provides a set of classes and functions for efficient execution and management of large language models (LLMs) using TensorRT.
It offers a high-level interface for loading models, performing inference, and generating sequences. Let's dive into the key components and how they should be used.
GenerationSession
The
GenerationSession
class is the core component of the runtime API. It encapsulates the TensorRT execution engine, handles memory allocation, and provides methods for sequence generation.To use the
GenerationSession
, you need to create an instance by providing the model configuration, engine buffer, and mapping information.The
setup
method is used to configure the session with parameters such as batch size, maximum context length, and beam width.The
decode
method is the main entry point for sequence generation. It takes input IDs, context lengths, and sampling configuration as input and generates output sequences.The
GenerationSession
also provides methods for handling specific generation scenarios, such as regular decoding (decode_regular
) and streaming decoding (decode_stream
).
ModelConfig
The
ModelConfig
class stores the configuration parameters of the LLM, such as the maximum batch size, beam width, vocabulary size, number of layers, and attention heads.It is used to initialize the
GenerationSession
and provides information about the model architecture and capabilities.
ModelRunner
The
ModelRunner
class is a high-level interface that wraps theGenerationSession
and provides a user-friendly API for generating sequences.It can be created using the
from_dir
orfrom_engine
class methods, which load the model from a directory or a TensorRT engine, respectively.The
generate
method is the primary method for generating sequences. It takes a list of input IDs, sampling configuration, and optional parameters such as prompt tables, LoRA weights, and stopping criteria.The
ModelRunner
also provides properties to access model information, such as the vocabulary size, hidden size, and number of layers.
SamplingConfig
The
SamplingConfig
class represents the configuration for controlling the generation process, such as the maximum number of new tokens, beam search parameters, and various sampling techniques (e.g., temperature, top-k, top-p).It is used as an input to the
generate
method of theModelRunner
to customize the generation behavior.
StoppingCriteria and LogitsProcessor
The
StoppingCriteria
andLogitsProcessor
classes provide extensibility points for custom stopping criteria and logits processing during generation.You can create your own stopping criteria by subclassing
StoppingCriteria
and implementing the desired logic.Similarly, you can create custom logits processors by subclassing
LogitsProcessor
to modify the generated logits before sampling.
KVCacheManager
The
KVCacheManager
class manages the key-value cache for efficient memory utilization during generation.It is used internally by the
GenerationSession
to allocate and manage memory blocks for storing the key-value pairs.
Session
The
Session
class represents a managed TensorRT runtime session.It provides methods for creating a session from an existing TensorRT engine or a serialized engine.
The
run
method is used to execute the TensorRT engine with the given inputs and outputs.
To use the TensorRT-LLM Runtime API, you typically start by creating a ModelRunner
instance using the from_dir
or from_engine
methods, specifying the model directory or TensorRT engine file.
Then, you can call the generate
method on the ModelRunner
instance, providing the input IDs, sampling configuration, and any additional parameters.
The runtime API handles the underlying execution details, such as memory management, tensor allocation, and TensorRT engine execution. It abstracts away the complexities of TensorRT and provides a high-level interface for generating sequences efficiently.
It's important to note that the runtime API is designed to work with models that have been optimized and compiled using TensorRT. You need to ensure that the model is properly converted and serialized into a TensorRT engine before using it with the runtime API.
Overall, the TensorRT-LLM Runtime API simplifies the process of deploying and executing large language models in production environments. It leverages the performance optimizations provided by TensorRT while offering a convenient and flexible interface for generating sequences and customizing the generation process.
Last updated