LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. The Python API

tensorrt_llm.functional.layer_norm

The tensorrt_llm.functional.layer_norm function in TensorRT-LLM applies layer normalization to a tensor, a common operation in neural networks, particularly in large language models (LLMs). Layer normalization is used to stabilize the learning process and improve convergence. Here's a breakdown of how to use this function and what each parameter means:

Function Purpose

  • Layer Normalization: Applies normalization on a specified axis or axes of the input tensor. It normalizes the input tensor by subtracting the mean and dividing by the standard deviation of the elements of the tensor.

Parameters

  1. input (Tensor):

    • The input tensor that you want to normalize.

    • In neural networks, this is often the output of a linear transformation or activation function.

  2. normalized_shape (int or Tuple[int]):

    • The shape of the sub-tensor to be normalized, typically the feature dimension in LLMs.

    • If the input tensor is 2D, normalized_shape is usually the second dimension of the tensor.

  3. weight (Tensor, optional):

    • The scale coefficient (gamma) for the normalization, applied element-wise to the normalized tensor.

    • It should have the same shape as normalized_shape.

  4. bias (Tensor, optional):

    • The shift coefficient (beta) for the normalization, applied element-wise to the normalized tensor.

    • It should have the same shape as normalized_shape.

  5. eps (float):

    • A small constant (epsilon) added to the variance to avoid division by zero.

    • Commonly set to a small value like 1e-5.

  6. use_diff_of_squares (bool):

    • When set to True, the function uses a difference of squares method to compute the variance (Var = Mean(X^2) - Mean(X)^2).

    • This can be more numerically stable in some cases.

How to Use

  • Prepare Your Input Tensor: Ensure your input tensor is in the correct shape and data type.

  • Determine Normalization Shape: Set normalized_shape to match the dimensions of the tensor you want to normalize (usually the feature dimension).

  • Optional Weight and Bias: If you have specific scaling and shifting parameters (gamma and beta), provide them as weight and bias. If not, they can be omitted, and the operation will default to standard layer normalization without scaling and shifting.

  • Set Epsilon: Choose an appropriate eps value; the default is typically sufficient.

  • Use Difference of Squares: Decide whether to use the difference of squares method based on your model's numerical stability requirements.

Returns

  • Tensor: The function returns a normalized tensor with the same shape as the input tensor.

Example Use Case

In a transformer model, after each sub-block (like a multi-head attention or a feed-forward network), you often apply layer normalization to the output of these sub-blocks. This ensures that the values across different features have a mean of zero and a standard deviation of one, which helps stabilize training and improve convergence.

Previoustensorrt_llm.functional.gpt_attentionNexttensorrt_llm.functional.rms_norm

Last updated 1 year ago

Was this helpful?