LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page

Was this helpful?

  1. Transformer Architecture

Layer Normalisation

PreviousScaled dot-product attentionNextActivation Functions

Last updated 1 year ago

Was this helpful?

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer, so the normalization does not introduce any new dependencies between training cases. It works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with Transformer models.

In the architecture of the Transformer model, layer normalization is typically applied at multiple points to stabilize training and improve convergence. The Transformer architecture generally consists of an encoder and a decoder, each with multiple layers. Each layer usually consists of an attention mechanism and a position-wise feed-forward network.

Here is how layer normalization usually fits into this architecture:

  1. Post-Attention Layer Normalization: After the multi-head attention sub-layer, the output typically goes through a residual connection followed by layer normalization. This is done to stabilize the activations before passing them to the next sub-layer in the encoder or decoder.

  2. Post-Feed-Forward Layer Normalization: Similarly, after the output of the feed-forward neural network sub-layer goes through another residual connection, it's followed by layer normalization for the same reasons of stabilization and convergence.

  3. Pre-Layer Normalization: Some variations of the Transformer architecture apply layer normalization before the attention and feed-forward sub-layers, instead of after. This approach is known as "Pre-Layer Normalization" and is an alternative to the original "Post-Layer Normalization" design.

So, in essence, layer normalization is strategically placed after (or before, in some variants) major sub-layers in both the encoder and decoder parts of the Transformer model to assist with stabilizing the training and facilitating faster convergence.

Normalization

Normalization is a technique used in deep learning to standardize the inputs within a particular layer so that they have a mean of 0 and a standard deviation of 1. This helps in accelerating the training process by addressing the problem of internal covariate shift. Let's break down the key points in understanding RMSNorm as it's implemented in LLaMA:

What is Normalization?

  • Imagine the inputs to a neural network layer as a group of runners. If they start at different positions, they'll reach the finish line at different times. Normalization lines them up at the same starting point, ensuring they move in sync, which helps the network learn more efficiently.

Layer Normalization

  • Commonly used in transformer models, layer normalization is applied after each layer within the block. It's like realigning the runners at every checkpoint.

Root Mean Square Layer Normalization (RMSNorm):

  • RMSNorm is a simplified version of layer normalization, where the root mean square (RMS) is used to scale the inputs. Think of it as a more efficient way to get the runners in line by using a different calculation method.

Pre-Normalization in LLaMA:

  • LLaMA uses a pre-normalization variant, applying RMSNorm before the major layers in the transformer block. If layer normalization is like aligning runners after each checkpoint, pre-normalization is aligning them right before the next race starts. This ensures everything is in place before the computation begins for that layer.

Benefits of RMSNorm:

  • The primary advantage is training stability and generalization, with 10-50% improvement in efficiency. It's akin to a refined alignment method that lets the runners reach the finish line more swiftly and consistently.

In mathematical terms, RMSNorm is typically formulated without subtracting the mean, focusing on scaling by the RMS of the inputs. This simplification can provide comparable performance to layer normalization but with less computational overhead.

By carefully choosing the normalization method and where it is applied within the neural network, one can have significant effects on training dynamics and model performance. It's like picking the right tune-up for a sports car: the right choices can lead to smoother rides and faster race times.

Page cover image