LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Detailed Explanation of Beam Search
  • Integration with GPT's Attention Operator
  • Relevance to Large Language Models (LLMs)

Was this helpful?

  1. Best Practices for Tuning the Performance of TensorRT-LLM

Beam Search

Beam search is a heuristic search algorithm used in sequence modeling, particularly relevant in language models like GPT for tasks such as text generation. It's a type of breadth-first search that expands and examines a set of most promising nodes at each level of the tree.

Detailed Explanation of Beam Search

  1. Basic Concept: Beam search maintains a fixed number of best candidates (or "beams") at each step of the model's output sequence. These candidates are the most promising sequences according to the model's scoring function.

  2. Selection of Beams: At each step, it expands each candidate and keeps the top N expansions based on their likelihood, where N is the beam width. This process repeats for each step in the sequence generation.

  3. Beam Width: This is a crucial hyperparameter. A larger beam width increases the chances of finding a better output sequence but also increases computational expense. Conversely, a smaller beam width reduces the computational burden but might miss more optimal sequences.

  4. Sequence Scoring: The scoring of sequences typically involves calculating the cumulative probability of the sequence. Longer sequences might need to be normalized to ensure a fair comparison with shorter sequences.

  5. Termination: The process continues until a termination condition is met, which could be reaching a maximum sequence length or all candidates reaching an end-of-sequence token.

Integration with GPT's Attention Operator

  1. Context Phase: In the initial phase, a single beam is computed per input sequence, meaning it follows the most probable path according to the model's predictions.

  2. Generation Phase: This is where beam search becomes particularly interesting. The attention mechanism (MHA/MQA/GQA) uses an additional tensor, called cache_indirection, to keep track of the paths in the beam search.

  3. Cache Indirection Tensor: The shape of this tensor is [batch_size, beam_width, max_seqlen]. Each element of this tensor tells the model from which path in the beam (which previous token) to take the key (K) and value (V) elements for the current token. This mechanism allows the model to effectively manage multiple potential sequences in parallel.

Relevance to Large Language Models (LLMs)

  1. Handling Ambiguity and Complexity: Beam search is crucial in LLMs for tasks like text generation, where there are often many possible valid continuations of a given text. It allows the model to explore multiple paths and find a more coherent and contextually appropriate output.

  2. Quality of Output: By considering multiple paths simultaneously, beam search often produces more accurate and natural results compared to greedy approaches, which only consider the single most likely next step.

  3. Use in Diverse Applications: This method is used in a variety of applications like machine translation, summarization, and conversational AI, where generating coherent and contextually relevant text is key.

In summary, beam search is a balance between exploring a variety of possible sequences and computational feasibility. In the context of LLMs like GPT, it is a fundamental technique for managing the complexities of natural language generation, ensuring that the output is not just probable but contextually coherent and diverse.

PreviousRelative Attention BiasNextRotary Positional Embedding (RoPE)

Last updated 1 year ago

Was this helpful?