LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Attention Mechanism: Scaled dot-product attention
  • The scaled dot-product attention mechanism consists of the following steps:
  • Softmax Normalization
  • Weighted Sum of Values

Was this helpful?

  1. Transformer Architecture

Scaled dot-product attention

Attention Mechanism: Scaled dot-product attention

The Scaled Dot-Product Attention is a core part of the attention mechanism within the Transformer architecture. It's one of the main building blocks that allows the model to focus on different parts of the input sequence for different tasks. Here's a brief overview of how it fits into the overall process:

Scaled dot-product attention is a mechanism used in the multi-head self-attention layer of the Transformer model.

It is designed to capture the relationships between different elements in a sequence by computing attention scores that represent the importance of each element with respect to others. This mechanism is particularly useful for tasks that require understanding the context and dependencies between elements, such as language modeling or machine translation.

The scaled dot-product attention mechanism consists of the following steps:

Input

The attention mechanism takes three inputs: Query (Q), Key (K), and Value (V). These are derived from the input sequence, often through linear transformations. In the case of self-attention, Q, K, and V all come from the same input sequence, but they can also come from different sequences in the case of encoder-decoder attention.

Dot Product (Attention Score)

The attention scores are computed using the dot product between the Query (Q) and Key (K) matrices. For each token in the sequence, the dot product between its Query vector and all the Key vectors is calculated. This dot product measures the similarity or compatibility between the Query and the Keys. The higher the dot product, the more similar the Query is to a particular Key, indicating that the corresponding Value should be given more attention.

Scaling

Scale the dot product scores by dividing them by the square root of the key's dimension (d_k). This scaling step is done to prevent the softmax function from becoming too sensitive to small differences in the dot product values, which could lead to extremely small gradients during the backpropagation process.

In summary, scaled dot-product attention is a mechanism for computing the relevance of different elements in a sequence with respect to a given query. By calculating the dot product of the query and key vectors, scaling the result, applying the softmax function, and computing the weighted sum of the value vectors, the mechanism can effectively capture the contextual relationships and dependencies between elements in the input sequence.

Softmax Normalization

Once the dot products (attention score) are calculated, they are passed through a softmax function. The softmax function normalizes the attention score, converting them into a probability distribution that sums to 1. This ensures that the attention scores represent a relative weighting of the importance of each token in the sequence.

The resulting softmax output represents the attention weights, which indicate the importance of each value in the sequence concerning the query.

Weighted Sum of Values

Finally, the attention scores are used to weigh the Value (V) matrix. Each Value vector is multiplied by its corresponding attention score, and the results are summed to produce the output vector for the current token. This output vector is a context-aware representation of the input sequence, where each component of the Value matrix is weighted according to its relevance to the Query.

Now, let's simplify the explanation:

The Query matrix helps determine which components of the Value matrix to pay attention to by computing attention scores with the Key matrix.

These attention scores measure the similarity between the Query and the Keys.

The higher the similarity, the more attention the corresponding Value component should receive. The softmax function normalizes the attention scores, ensuring they represent a probability distribution. Finally, the attention scores are used to weigh the Value matrix, resulting in a context-aware output.

PreviousPositional EncodingNextLayer Normalisation

Last updated 1 year ago

Was this helpful?

Page cover image