LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Understanding RAB
  • Modes of RAB
  • Significance of RAB
  • Conclusion

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

Relative Attention Bias

Relative Attention Bias (RAB) is an advanced feature used in natural language processing models, particularly in the context of attention mechanisms like those found in Transformer models.

It's a method for incorporating information about the relative positions of tokens (words or other elements) in a sequence. Here's an in-depth look at how RAB functions and its significance:

Understanding RAB

In Transformer models, the attention mechanism is pivotal.

It computes weights or 'attention scores' for each token in a sequence, determining how much focus to give to each token when processing any particular token. These calculations are typically done using the formula Q*K^T (where Q and K are the query and key matrices, respectively).

Adding Positional Information: RAB modifies the standard attention mechanism by adding a bias term that accounts for the relative positions of tokens. In simpler terms, it adds a positional factor to the attention calculation (Q*K^T+bias), allowing the model to consider not just the tokens themselves but also their positions relative to each other.

Lightweight Positional Encoding: Unlike other positional encoding methods that might add substantial complexity, RAB is considered a lightweight method to include positional information. This makes it a popular choice in models like T5 (Text-to-Text Transfer Transformer) where efficient positional encoding is necessary.

Modes of RAB

  1. Regular Mode: In this mode, the relative attention bias is pre-computed before the Multi-Head Attention (MHA) process. The model uses these pre-computed values during its attention calculations. This mode is straightforward but can be memory-intensive if the relative biases are large.

  2. Implicit Mode: This mode is useful when dealing with large sequences where storing the entire relative bias matrix can become impractical. In implicit mode, the relative attention bias is computed 'on the fly' during the MHA process. This dynamic computation is triggered by setting a parameter like max_distance, determining how far the model looks to compute these biases.

Significance of RAB

  1. Enhanced Contextual Understanding: By factoring in the relative positions of tokens, RAB allows models to better understand the context and structure of the input sequence. This is crucial in tasks where the meaning depends significantly on word order and relationships.

  2. Flexibility and Efficiency: The two modes of RAB offer flexibility. The regular mode provides pre-computed efficiency, while the implicit mode offers a more scalable solution for large sequences.

  3. Applicability in Various Models: While RAB is noted for its use in the T5 model, its utility extends to other Transformer-based models, especially those dealing with long sequences where traditional positional encoding methods might falter.

Conclusion

Relative Attention Bias in Transformer models, especially in the context of TensorRT LLM, provides a nuanced and efficient way to incorporate positional information into the attention mechanism. This enhances the model's ability to process sequences based on not just the individual token values but also their relative positions, leading to more accurate and context-aware outputs in language processing tasks.

PreviousAlibiNextBeam Search

Last updated 1 year ago

Was this helpful?