LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key Points about C++ GPT Runtime
  • How to Use C++ GPT Runtime
  • Practical Considerations

Was this helpful?

Runtime

A runtime in the context of software development, especially in relation to neural networks and machine learning models, refers to the environment in which a program or code executes.

Specifically, a C++ GPT Runtime in TensorRT-LLM is a component designed to execute TensorRT engines that are built using the provided Python API.

Key Points about C++ GPT Runtime

Purpose and Compatibility

  • The C++ runtime in TensorRT-LLM is developed to execute TensorRT engines for running models like GPT (Generative Pretrained Transformer) and similar auto-regressive models (such as BLOOM, GPT-J, GPT-NeoX, or LLaMA).

  • It is not limited to GPT models alone but is applicable to a range of auto-regressive models.

Implementation

  • The runtime API is composed of classes declared in cpp/include/tensorrt_llm/runtime and implemented in cpp/tensorrt_llm/runtime.

  • Example usage for a GPT-like model is provided in cpp/tests/runtime/gptSessionTest.cpp.

The Session Component

  • The core of the C++ runtime is the "session", particularly the GptSession class for GPT-like models.

  • The session manages the execution of the model inference within the runtime environment.

Creating a Session

  • To create a session, users specify the model details (via GptModelConfig) and the TensorRT engine (pointer to the compiled engine and its size).

  • The environment configuration is provided through WorldConfig (reflecting MPI terminology, with MPI being a standard for parallel programming).

  • Optionally, a logger can be included to capture informational, warning, and error messages.

How to Use C++ GPT Runtime

Model Configuration:

  • Define the model configuration (GptModelConfig) describing the model's structure, parameters, etc.

Load TensorRT Engine:

  • Load the pre-compiled TensorRT engine, which is the optimised model ready for execution.

Environment Setup:

  • Configure the execution environment (WorldConfig) to define how the model interacts with the hardware, such as GPU settings.

Instantiate Session:

  • Create a GptSession instance with the model config, environment config, engine pointer, engine size, and an optional logger

  • Use the session to run inference tasks with the model, feeding in input data and retrieving output predictions.

Logging and Debugging

  • Use the logging capabilities to monitor the session's execution and troubleshoot any issues.

Practical Considerations

Flexibility: While the focus is on GPT-like models, the runtime is designed to be adaptable for other auto-regressive models.

Future Updates: The documentation hints at upcoming support for encoder-decoder models like T5, indicating an ongoing expansion of the runtime's capabilities.

Developer's Perspective: From a software engineering standpoint, using a C++ runtime is beneficial for performance-critical applications, especially when dealing with complex models like GPT on powerful hardware like GPUs.

PreviousHuggingface Bloom DocumentationNextGraph Rewriting (GW) module

Last updated 1 year ago

Was this helpful?

Page cover image