Runtime
A runtime in the context of software development, especially in relation to neural networks and machine learning models, refers to the environment in which a program or code executes.
Specifically, a C++ GPT Runtime in TensorRT-LLM is a component designed to execute TensorRT engines that are built using the provided Python API.
Key Points about C++ GPT Runtime
Purpose and Compatibility
The C++ runtime in TensorRT-LLM is developed to execute TensorRT engines for running models like GPT (Generative Pretrained Transformer) and similar auto-regressive models (such as BLOOM, GPT-J, GPT-NeoX, or LLaMA).
It is not limited to GPT models alone but is applicable to a range of auto-regressive models.
Implementation
The runtime API is composed of classes declared in
cpp/include/tensorrt_llm/runtime
and implemented incpp/tensorrt_llm/runtime
.Example usage for a GPT-like model is provided in
cpp/tests/runtime/gptSessionTest.cpp
.
The Session Component
The core of the C++ runtime is the "session", particularly the
GptSession
class for GPT-like models.The session manages the execution of the model inference within the runtime environment.
Creating a Session
To create a session, users specify the model details (via
GptModelConfig
) and the TensorRT engine (pointer to the compiled engine and its size).The environment configuration is provided through
WorldConfig
(reflecting MPI terminology, with MPI being a standard for parallel programming).Optionally, a logger can be included to capture informational, warning, and error messages.
How to Use C++ GPT Runtime
Model Configuration:
Define the model configuration (
GptModelConfig
) describing the model's structure, parameters, etc.
Load TensorRT Engine:
Load the pre-compiled TensorRT engine, which is the optimised model ready for execution.
Environment Setup:
Configure the execution environment (
WorldConfig
) to define how the model interacts with the hardware, such as GPU settings.
Instantiate Session:
Create a
GptSession
instance with the model config, environment config, engine pointer, engine size, and an optional loggerUse the session to run inference tasks with the model, feeding in input data and retrieving output predictions.
Logging and Debugging
Use the logging capabilities to monitor the session's execution and troubleshoot any issues.
Practical Considerations
Flexibility: While the focus is on GPT-like models, the runtime is designed to be adaptable for other auto-regressive models.
Future Updates: The documentation hints at upcoming support for encoder-decoder models like T5, indicating an ongoing expansion of the runtime's capabilities.
Developer's Perspective: From a software engineering standpoint, using a C++ runtime is beneficial for performance-critical applications, especially when dealing with complex models like GPT on powerful hardware like GPUs.
Last updated