Runtime
The TensorRT-LLM Runtime API provides a set of classes and functions for efficient execution and management of large language models (LLMs) using TensorRT.
It offers a high-level interface for loading models, performing inference, and generating sequences. Let's dive into the key components and how they should be used.
GenerationSession
The
GenerationSessionclass is the core component of the runtime API. It encapsulates the TensorRT execution engine, handles memory allocation, and provides methods for sequence generation.To use the
GenerationSession, you need to create an instance by providing the model configuration, engine buffer, and mapping information.The
setupmethod is used to configure the session with parameters such as batch size, maximum context length, and beam width.The
decodemethod is the main entry point for sequence generation. It takes input IDs, context lengths, and sampling configuration as input and generates output sequences.The
GenerationSessionalso provides methods for handling specific generation scenarios, such as regular decoding (decode_regular) and streaming decoding (decode_stream).
ModelConfig
The
ModelConfigclass stores the configuration parameters of the LLM, such as the maximum batch size, beam width, vocabulary size, number of layers, and attention heads.It is used to initialize the
GenerationSessionand provides information about the model architecture and capabilities.
ModelRunner
The
ModelRunnerclass is a high-level interface that wraps theGenerationSessionand provides a user-friendly API for generating sequences.It can be created using the
from_dirorfrom_engineclass methods, which load the model from a directory or a TensorRT engine, respectively.The
generatemethod is the primary method for generating sequences. It takes a list of input IDs, sampling configuration, and optional parameters such as prompt tables, LoRA weights, and stopping criteria.The
ModelRunneralso provides properties to access model information, such as the vocabulary size, hidden size, and number of layers.
SamplingConfig
The
SamplingConfigclass represents the configuration for controlling the generation process, such as the maximum number of new tokens, beam search parameters, and various sampling techniques (e.g., temperature, top-k, top-p).It is used as an input to the
generatemethod of theModelRunnerto customize the generation behavior.
StoppingCriteria and LogitsProcessor
The
StoppingCriteriaandLogitsProcessorclasses provide extensibility points for custom stopping criteria and logits processing during generation.You can create your own stopping criteria by subclassing
StoppingCriteriaand implementing the desired logic.Similarly, you can create custom logits processors by subclassing
LogitsProcessorto modify the generated logits before sampling.
KVCacheManager
The
KVCacheManagerclass manages the key-value cache for efficient memory utilization during generation.It is used internally by the
GenerationSessionto allocate and manage memory blocks for storing the key-value pairs.
Session
The
Sessionclass represents a managed TensorRT runtime session.It provides methods for creating a session from an existing TensorRT engine or a serialized engine.
The
runmethod is used to execute the TensorRT engine with the given inputs and outputs.
To use the TensorRT-LLM Runtime API, you typically start by creating a ModelRunner instance using the from_dir or from_engine methods, specifying the model directory or TensorRT engine file.
Then, you can call the generate method on the ModelRunner instance, providing the input IDs, sampling configuration, and any additional parameters.
The runtime API handles the underlying execution details, such as memory management, tensor allocation, and TensorRT engine execution. It abstracts away the complexities of TensorRT and provides a high-level interface for generating sequences efficiently.
It's important to note that the runtime API is designed to work with models that have been optimized and compiled using TensorRT. You need to ensure that the model is properly converted and serialized into a TensorRT engine before using it with the runtime API.
Overall, the TensorRT-LLM Runtime API simplifies the process of deploying and executing large language models in production environments. It leverages the performance optimizations provided by TensorRT while offering a convenient and flexible interface for generating sequences and customizing the generation process.
Last updated
Was this helpful?


