Page cover image

Runtime

A runtime in the context of software development, especially in relation to neural networks and machine learning models, refers to the environment in which a program or code executes.

Specifically, a C++ GPT Runtime in TensorRT-LLM is a component designed to execute TensorRT engines that are built using the provided Python API.

Key Points about C++ GPT Runtime

Purpose and Compatibility

  • The C++ runtime in TensorRT-LLM is developed to execute TensorRT engines for running models like GPT (Generative Pretrained Transformer) and similar auto-regressive models (such as BLOOM, GPT-J, GPT-NeoX, or LLaMA).

  • It is not limited to GPT models alone but is applicable to a range of auto-regressive models.

Implementation

  • The runtime API is composed of classes declared in cpp/include/tensorrt_llm/runtime and implemented in cpp/tensorrt_llm/runtime.

  • Example usage for a GPT-like model is provided in cpp/tests/runtime/gptSessionTest.cpp.

The Session Component

  • The core of the C++ runtime is the "session", particularly the GptSession class for GPT-like models.

  • The session manages the execution of the model inference within the runtime environment.

Creating a Session

  • To create a session, users specify the model details (via GptModelConfig) and the TensorRT engine (pointer to the compiled engine and its size).

  • The environment configuration is provided through WorldConfig (reflecting MPI terminology, with MPI being a standard for parallel programming).

  • Optionally, a logger can be included to capture informational, warning, and error messages.

How to Use C++ GPT Runtime

Model Configuration:

  • Define the model configuration (GptModelConfig) describing the model's structure, parameters, etc.

Load TensorRT Engine:

  • Load the pre-compiled TensorRT engine, which is the optimised model ready for execution.

Environment Setup:

  • Configure the execution environment (WorldConfig) to define how the model interacts with the hardware, such as GPU settings.

Instantiate Session:

  • Create a GptSession instance with the model config, environment config, engine pointer, engine size, and an optional logger

  • Use the session to run inference tasks with the model, feeding in input data and retrieving output predictions.

Logging and Debugging

  • Use the logging capabilities to monitor the session's execution and troubleshoot any issues.

Practical Considerations

Flexibility: While the focus is on GPT-like models, the runtime is designed to be adaptable for other auto-regressive models.

Future Updates: The documentation hints at upcoming support for encoder-decoder models like T5, indicating an ongoing expansion of the runtime's capabilities.

Developer's Perspective: From a software engineering standpoint, using a C++ runtime is beneficial for performance-critical applications, especially when dealing with complex models like GPT on powerful hardware like GPUs.

Last updated

Was this helpful?