LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Function of ALiBi
  • Advantages of ALiBi
  • Application in TensorRT LLM

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

Alibi

ALiBi, which stands for "Attention with Linear Biases," is an approach used in the attention mechanisms of language models like GPT. It's applied within the GPT attention operator in the TensorRT LLM framework. Here's a breakdown of what ALiBi does and how it functions:

Function of ALiBi

Applied to Attention Calculations: In the context of GPT models, the attention mechanism is a crucial component. It calculates the relevance or weight of each token in a sequence relative to others. Typically, this involves computing the dot product of query (Q) and key (K) matrices (Q*K^T).

Introduces Biases Based on Position: ALiBi modifies this process by introducing a linear bias based on the relative positions of tokens. This bias is added to the result of the Q*K^T product. The idea is to provide the model with an understanding of how far apart different tokens are in the sequence.

Bias Computation: The bias in ALiBi is computed on-the-fly during the attention calculation. This means that the bias is dynamically calculated based on the positions of tokens at each step of the model's processing, rather than being a static, pre-computed value.

Advantages of ALiBi

Enhanced Positional Awareness: By incorporating biases based on token positions, ALiBi allows the model to better understand and incorporate the order and relative distances of words or tokens in a sequence. This is particularly important for language tasks where the meaning is highly dependent on word order and proximity.

Efficiency: Since the biases are calculated on-the-fly within the optimized kernel, ALiBi can be more computationally efficient than methods that rely on separate positional encoding layers or more complex positional encoding schemes.

Simplicity and Effectiveness: ALiBi offers a simpler yet effective alternative to more complex positional encoding mechanisms, maintaining the model's performance while reducing computational overhead.

Application in TensorRT LLM

In TensorRT LLM's implementation, ALiBi is an integrated feature within the GPT attention operator.

It enhances the model's ability to process language data by giving it an intrinsic sense of the linear distance between tokens in the input sequence, thereby improving the model's ability to understand context and generate more coherent and contextually relevant text.

In summary, ALiBi in TensorRT LLM is a feature that introduces linear positional biases into the attention mechanism of GPT models, improving their contextual understanding and efficiency in language processing tasks.

PreviousBatch ManagerNextRelative Attention Bias

Last updated 1 year ago

Was this helpful?