Runtime

A runtime in the context of software development, especially in relation to neural networks and machine learning models, refers to the environment in which a program or code executes.

Specifically, a C++ GPT Runtime in TensorRT-LLM is a component designed to execute TensorRT engines that are built using the provided Python API.

Key Points about C++ GPT Runtime

Purpose and Compatibility

The C++ runtime in TensorRT-LLM is developed to execute TensorRT engines for running models like GPT (Generative Pretrained Transformer) and similar auto-regressive models (such as BLOOM, GPT-J, GPT-NeoX, or LLaMA).
It is not limited to GPT models alone but is applicable to a range of auto-regressive models.

Implementation

The runtime API is composed of classes declared in cpp/include/tensorrt_llm/runtime and implemented in cpp/tensorrt_llm/runtime.
Example usage for a GPT-like model is provided in cpp/tests/runtime/gptSessionTest.cpp.

The Session Component

The core of the C++ runtime is the "session", particularly the GptSession class for GPT-like models.
The session manages the execution of the model inference within the runtime environment.

Creating a Session

To create a session, users specify the model details (via GptModelConfig) and the TensorRT engine (pointer to the compiled engine and its size).
The environment configuration is provided through WorldConfig (reflecting MPI terminology, with MPI being a standard for parallel programming).
Optionally, a logger can be included to capture informational, warning, and error messages.

How to Use C++ GPT Runtime

Model Configuration:

Define the model configuration (GptModelConfig) describing the model's structure, parameters, etc.

Load TensorRT Engine:

Load the pre-compiled TensorRT engine, which is the optimised model ready for execution.

Environment Setup:

Configure the execution environment (WorldConfig) to define how the model interacts with the hardware, such as GPU settings.

Instantiate Session:

Create a GptSession instance with the model config, environment config, engine pointer, engine size, and an optional logger
Use the session to run inference tasks with the model, feeding in input data and retrieving output predictions.

Logging and Debugging

Use the logging capabilities to monitor the session's execution and troubleshoot any issues.

Practical Considerations

Flexibility: While the focus is on GPT-like models, the runtime is designed to be adaptable for other auto-regressive models.

Future Updates: The documentation hints at upcoming support for encoder-decoder models like T5, indicating an ongoing expansion of the runtime's capabilities.

Developer's Perspective: From a software engineering standpoint, using a C++ runtime is beneficial for performance-critical applications, especially when dealing with complex models like GPT on powerful hardware like GPUs.

PreviousHuggingface Bloom Documentation NextGraph Rewriting (GW) module

Last updated 1 year ago

Was this helpful?