LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • GenerationSession
  • ModelConfig
  • ModelRunner
  • SamplingConfig
  • StoppingCriteria and LogitsProcessor
  • KVCacheManager

Was this helpful?

  1. The Python API

Runtime

The TensorRT-LLM Runtime API provides a set of classes and functions for efficient execution and management of large language models (LLMs) using TensorRT.

It offers a high-level interface for loading models, performing inference, and generating sequences. Let's dive into the key components and how they should be used.

GenerationSession

  • The GenerationSession class is the core component of the runtime API. It encapsulates the TensorRT execution engine, handles memory allocation, and provides methods for sequence generation.

  • To use the GenerationSession, you need to create an instance by providing the model configuration, engine buffer, and mapping information.

  • The setup method is used to configure the session with parameters such as batch size, maximum context length, and beam width.

  • The decode method is the main entry point for sequence generation. It takes input IDs, context lengths, and sampling configuration as input and generates output sequences.

  • The GenerationSession also provides methods for handling specific generation scenarios, such as regular decoding (decode_regular) and streaming decoding (decode_stream).

ModelConfig

  • The ModelConfig class stores the configuration parameters of the LLM, such as the maximum batch size, beam width, vocabulary size, number of layers, and attention heads.

  • It is used to initialize the GenerationSession and provides information about the model architecture and capabilities.

ModelRunner

  • The ModelRunner class is a high-level interface that wraps the GenerationSession and provides a user-friendly API for generating sequences.

  • It can be created using the from_dir or from_engine class methods, which load the model from a directory or a TensorRT engine, respectively.

  • The generate method is the primary method for generating sequences. It takes a list of input IDs, sampling configuration, and optional parameters such as prompt tables, LoRA weights, and stopping criteria.

  • The ModelRunner also provides properties to access model information, such as the vocabulary size, hidden size, and number of layers.

SamplingConfig

  • The SamplingConfig class represents the configuration for controlling the generation process, such as the maximum number of new tokens, beam search parameters, and various sampling techniques (e.g., temperature, top-k, top-p).

  • It is used as an input to the generate method of the ModelRunner to customize the generation behavior.

StoppingCriteria and LogitsProcessor

  • The StoppingCriteria and LogitsProcessor classes provide extensibility points for custom stopping criteria and logits processing during generation.

  • You can create your own stopping criteria by subclassing StoppingCriteria and implementing the desired logic.

  • Similarly, you can create custom logits processors by subclassing LogitsProcessor to modify the generated logits before sampling.

KVCacheManager

  • The KVCacheManager class manages the key-value cache for efficient memory utilization during generation.

  • It is used internally by the GenerationSession to allocate and manage memory blocks for storing the key-value pairs.

Session

  • The Session class represents a managed TensorRT runtime session.

  • It provides methods for creating a session from an existing TensorRT engine or a serialized engine.

  • The run method is used to execute the TensorRT engine with the given inputs and outputs.

To use the TensorRT-LLM Runtime API, you typically start by creating a ModelRunner instance using the from_dir or from_engine methods, specifying the model directory or TensorRT engine file.

Then, you can call the generate method on the ModelRunner instance, providing the input IDs, sampling configuration, and any additional parameters.

The runtime API handles the underlying execution details, such as memory management, tensor allocation, and TensorRT engine execution. It abstracts away the complexities of TensorRT and provides a high-level interface for generating sequences efficiently.

It's important to note that the runtime API is designed to work with models that have been optimized and compiled using TensorRT. You need to ensure that the model is properly converted and serialized into a TensorRT engine before using it with the runtime API.

Overall, the TensorRT-LLM Runtime API simplifies the process of deploying and executing large language models in production environments. It leverages the performance optimizations provided by TensorRT while offering a convenient and flexible interface for generating sequences and customizing the generation process.

PreviousQuantizationNextRuntime Process

Last updated 1 year ago

Was this helpful?

Page cover image