Page cover image

Runtime

The TensorRT-LLM Runtime API provides a set of classes and functions for efficient execution and management of large language models (LLMs) using TensorRT.

It offers a high-level interface for loading models, performing inference, and generating sequences. Let's dive into the key components and how they should be used.

GenerationSession

  • The GenerationSession class is the core component of the runtime API. It encapsulates the TensorRT execution engine, handles memory allocation, and provides methods for sequence generation.

  • To use the GenerationSession, you need to create an instance by providing the model configuration, engine buffer, and mapping information.

  • The setup method is used to configure the session with parameters such as batch size, maximum context length, and beam width.

  • The decode method is the main entry point for sequence generation. It takes input IDs, context lengths, and sampling configuration as input and generates output sequences.

  • The GenerationSession also provides methods for handling specific generation scenarios, such as regular decoding (decode_regular) and streaming decoding (decode_stream).

ModelConfig

  • The ModelConfig class stores the configuration parameters of the LLM, such as the maximum batch size, beam width, vocabulary size, number of layers, and attention heads.

  • It is used to initialize the GenerationSession and provides information about the model architecture and capabilities.

ModelRunner

  • The ModelRunner class is a high-level interface that wraps the GenerationSession and provides a user-friendly API for generating sequences.

  • It can be created using the from_dir or from_engine class methods, which load the model from a directory or a TensorRT engine, respectively.

  • The generate method is the primary method for generating sequences. It takes a list of input IDs, sampling configuration, and optional parameters such as prompt tables, LoRA weights, and stopping criteria.

  • The ModelRunner also provides properties to access model information, such as the vocabulary size, hidden size, and number of layers.

SamplingConfig

  • The SamplingConfig class represents the configuration for controlling the generation process, such as the maximum number of new tokens, beam search parameters, and various sampling techniques (e.g., temperature, top-k, top-p).

  • It is used as an input to the generate method of the ModelRunner to customize the generation behavior.

StoppingCriteria and LogitsProcessor

  • The StoppingCriteria and LogitsProcessor classes provide extensibility points for custom stopping criteria and logits processing during generation.

  • You can create your own stopping criteria by subclassing StoppingCriteria and implementing the desired logic.

  • Similarly, you can create custom logits processors by subclassing LogitsProcessor to modify the generated logits before sampling.

KVCacheManager

  • The KVCacheManager class manages the key-value cache for efficient memory utilization during generation.

  • It is used internally by the GenerationSession to allocate and manage memory blocks for storing the key-value pairs.

Session

  • The Session class represents a managed TensorRT runtime session.

  • It provides methods for creating a session from an existing TensorRT engine or a serialized engine.

  • The run method is used to execute the TensorRT engine with the given inputs and outputs.

To use the TensorRT-LLM Runtime API, you typically start by creating a ModelRunner instance using the from_dir or from_engine methods, specifying the model directory or TensorRT engine file.

Then, you can call the generate method on the ModelRunner instance, providing the input IDs, sampling configuration, and any additional parameters.

The runtime API handles the underlying execution details, such as memory management, tensor allocation, and TensorRT engine execution. It abstracts away the complexities of TensorRT and provides a high-level interface for generating sequences efficiently.

It's important to note that the runtime API is designed to work with models that have been optimized and compiled using TensorRT. You need to ensure that the model is properly converted and serialized into a TensorRT engine before using it with the runtime API.

Overall, the TensorRT-LLM Runtime API simplifies the process of deploying and executing large language models in production environments. It leverages the performance optimizations provided by TensorRT while offering a convenient and flexible interface for generating sequences and customizing the generation process.

Last updated

Was this helpful?