LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key Model Classes
  • Model-specific classes
  • PretrainedConfig
  • Key Features and Functionalities

Was this helpful?

  1. The Python API

Model

This module contains a collection of pre-defined model architectures and utilities for building and customizing large language models (LLMs).

Key Model Classes

PretrainedModel

A base class for all pre-trained models in TensorRT-LLM. It provides common functionality such as loading weights, saving checkpoints, and preparing inputs for the model.

DecoderModel

A base class for decoder-only models used in tasks like language modeling and generation. It inherits from PretrainedModel and adds specific methods for causal language modeling.

EncoderModel

A base class for encoder-only models used in tasks like sequence classification and question answering. It inherits from PretrainedModel and provides methods for encoding input sequences.

Model-specific classes

TensorRT-LLM provides several pre-defined model architectures, each with its own class. Some notable examples include:

GPTModel and GPTForCausalLM: Implements the GPT (Generative Pre-trained Transformer) architecture for language modelling and generation.

LLaMAModel and LLaMAForCausalLM: Implements the LLaMA (Large Language Model Adaptation) architecture, a powerful and efficient LLM.

BertModel, BertForSequenceClassification, and BertForQuestionAnswering: Implements the BERT (Bidirectional Encoder Representations from Transformers) architecture for sequence classification and question answering tasks.

PretrainedConfig

A configuration class that stores hyperparameters and settings for a pre-trained model. It can be loaded from a JSON file or a dictionary.

Key Features and Functionalities

Model initialization: The from_config and from_checkpoint class methods allow initializing a model from a PretrainedConfig object or a checkpoint directory, respectively. This makes it easy to load pre-trained weights and configure the model architecture.

Quantization: The quantize class method enables quantizing a pre-trained model for reduced memory footprint and faster inference. It supports various quantization configurations through the QuantConfig class.

Dynamic input shapes: The prepare_inputs method in PretrainedModel and its subclasses allows specifying the maximum input sizes for dynamic shape inference. This enables efficient memory allocation and optimization when using TensorRT.

Multi-GPU support: TensorRT-LLM models can be distributed across multiple GPUs using the Mapping class, which specifies the parallel strategy for tensor sharding and model parallelism.

LoRA (Low-Rank Adaptation): Some models, like LLaMAForCausalLM, support LoRA for efficient fine-tuning and adaptation. The use_lora method allows loading LoRA weights from a LoraBuildConfig object.

Customisation: The modular design of TensorRT-LLM allows users to easily customize and extend the provided model architectures. Users can subclass the base model classes and override methods to incorporate new features or modify the model behavior.

Inference and generation: The PretrainedModel class inherits from the GenerationMixin, which provides methods for text generation and inference, such as generate and greedy_search. These methods leverage the optimized kernels and plugins provided by TensorRT for efficient inference.

The TensorRT-LLM models module offers a wide range of pre-trained models and a flexible API for building and customizing LLMs.

It integrates closely with the underlying TensorRT engine to leverage optimizations like kernel fusion, mixed precision, and dynamic shape inference.

By using the provided model classes and configuration options, users can easily load pre-trained weights, quantize models, distribute across multiple GPUs, and perform efficient inference and generation.

The modular design allows for extensibility and customization, enabling users to adapt the models to their specific use cases and requirements.

Overall, the TensorRT-LLM models module provides a powerful and user-friendly interface for working with state-of-the-art LLMs while leveraging the performance optimizations offered by TensorRT.

Previoustensorrt_llm.functional.rms_normNextQuantization

Last updated 1 year ago

Was this helpful?

Page cover image