LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. The Python API

tensorrt_llm.functional.rms_norm

The tensorrt_llm.functional.rms_norm function in TensorRT-LLM applies Root Mean Square (RMS) normalization to a tensor. This operation is similar to layer normalization but with some key differences, particularly in how the normalization is computed. RMS normalization is often used in neural network architectures, including large language models, to help stabilize training and improve the learning process.

Function Purpose

  • RMS Normalization: Normalizes the input tensor based on the root mean square of the elements along the specified axes. It's an alternative to the standard layer normalization, with the variance normalized separately and not jointly with the mean.

Parameters

  1. input (Tensor):

    • The input tensor to be normalized.

    • Typically, this would be the output of a neural network layer.

  2. normalized_shape (int or Tuple[int]):

    • The dimensions over which normalization is performed.

    • In language models, it usually corresponds to the hidden dimension or feature dimension of the tensor.

  3. weight (Tensor, optional):

    • The scale coefficient (often denoted as gamma) for normalization, applied element-wise to the normalized tensor.

    • It should have the same shape as normalized_shape. If omitted, the tensor is normalized without scaling.

  4. eps (float):

    • A small constant (epsilon) added for numerical stability to avoid division by zero when computing the root mean square.

    • Typically a very small value like 1e-6.

How to Use

  • Prepare Input Tensor: Make sure your input tensor is ready and in the correct format (shape and data type).

  • Determine Normalization Shape: Choose the normalized_shape based on the dimensions you want to normalize, typically the feature dimensions.

  • Optional Weight: Provide a weight tensor for scaling if you have specific scaling parameters. If not provided, the operation defaults to RMS normalization without scaling.

  • Set Epsilon: Choose an appropriate eps value to avoid division by zero issues.

Returns

  • Tensor: The function returns a tensor that has been RMS normalized. It maintains the same shape as the input tensor.

Example Use Case

In a neural network layer, particularly in transformers used in large language models, RMS normalization can be applied to the output of layers (such as after a multi-head attention or feed-forward block). This normalization ensures that the scale of outputs is controlled, potentially leading to more stable and efficient training. RMS normalization is particularly useful when the variance of the features is of interest independently from the mean.

Previoustensorrt_llm.functional.layer_normNextModel

Last updated 1 year ago

Was this helpful?