LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
  • TensorRT-LLM
  • The TensorRT-LLM Process
  • Performance
  • Virtual Machine Creation
  • CUDA Introduction
    • CUDA Architecture
    • Stream Multiprocessors: The Heart of GPU Computing
    • Pre Installation
    • Compatibility Assessment
    • NVCC: The NVIDIA CUDA Compiler
    • Installing Cuda
    • Installing the NVIDIA Container Toolkit
    • CUDA and bandwidth
    • Tensor Cores
  • Building TensorRT-LLM
    • Building from Source
    • TensorRT-LLM Dockerfile
      • Base Image
      • install_base.sh
      • install_cmake.sh
      • install_tensorrt.sh
      • install_pytorch.sh
      • requirements.txt
      • build_wheel.py
      • setup.py
      • Docker Makefile
      • Persistence
      • Running with persistent volumes
  • TensorRT-LLM Architecture and Process
    • The TensorRT-LLM process
    • INetworkDefinition
    • Model Definition
    • Compilation
    • Runtime Engine
    • Weight Bindings
    • Model Configuration
  • TensorRT-LLM build workflow
    • TensorRT-LLM build workflow - process
  • CUDA Graphs
    • Experimentation with CUDA Graphs
  • TensorRT-LLM Libraries
    • tensorrt_llm folders
    • tensorrt_llm/builder.py
    • tensorrt_llm/network.py
    • tensorrt_llm/module.py
    • top_model_mixin.py
    • trt-llm build command
    • trtllm-build CLI configurations
  • LLama2 installation
    • Converting Checkpoints
      • Checkpoint List - Arguments
      • Examples of running the convert_checkpoint.py script
      • convert_checkpoint examples
      • Checkpoint Script Arguments
      • checkpoint configuration file
      • run_convert_checkpoint.py script
    • LLama2 Files Analysis
    • TensorRT-LLM Build Engine Process
    • TensorRT-LLM Build Process Documentation
    • Build arguments
    • trtllm build configuration file
    • Run the buildconfig file
    • Analysis of the output from build.py
    • LLama3 configurations
    • Proposed checkpoint config file for LLama3
    • Proposed build config file for LLama3
    • run.py for inference
    • Using the models - running Llama
    • generate_int8 function
    • summarize.py script in Llama folder
    • Compiling LLama Models
  • Tasks
  • LLama Model Directory
    • llama/model.py
    • llama/utils.py
    • llama/weight.py
    • llama/convert.py
    • PreTrainedModel class
    • LlamaForCausalLM class
    • PretrainedConfig class
  • TensorRT-LLM Tutorial
  • Tutorial 2 - get inference going
  • examples/run.py
  • examples/utils.py
  • examples/summarize.py
  • The Python API
    • Layers
    • Functionals
    • functional.py
    • tensorrt_llm.functional.embedding
    • tensorrt_llm.functional.gpt_attention
    • tensorrt_llm.functional.layer_norm
    • tensorrt_llm.functional.rms_norm
    • Model
    • Quantization
    • Runtime
    • Runtime Process
  • Transformer Architecture
    • Attention Mechanism
    • Multi Head Attention
    • Positional Encoding
    • Scaled dot-product attention
    • Layer Normalisation
    • Activation Functions
    • Residual Connections
    • Position Wise Feed-Forward Layer
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • KV Cache
      • Efficient Streaming Language Models with Attention Sinks
      • Input QKV tensor
    • General Notes on Model Architecture
  • Best Practices for Tuning the Performance of TensorRT-LLM
    • Optimisation Techniques
    • Batch Manager
    • Alibi
    • Relative Attention Bias
    • Beam Search
    • Rotary Positional Embedding (RoPE)
    • Numerical Precision
    • FP8 Formats for Deep Learning
  • Graph Rewriting
  • Reducing Activation Recomputation in Large Transformer Models
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Numerical Position
  • TensorRT Models
  • Bloom
    • Huggingface Bloom Documentation
  • Runtime
  • Graph Rewriting (GW) module
  • FasterTransfomer Library
  • Dual ABI issues
  • Phi 2.0
  • ONNX
  • Message Passing Interface (MPI)
  • NVIDIA Nsight Systems: A Comprehensive Guide for TensorRT-LLM and Triton Inference Server
  • NCCL
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Function Purpose
  • Parameters
  • How Parameters are Chosen
  • Returns
  • Use Case

Was this helpful?

  1. The Python API

tensorrt_llm.functional.embedding

The tensorrt_llm.functional.embedding function in TensorRT-LLM is used to perform an embedding lookup, which is a common operation in neural network models, particularly in natural language processing.

This function maps discrete objects, such as words in text, to vectors of real numbers. Let's break down how this function works and explain its parameters:

Function Purpose

  • Embedding Lookup: It performs the embedding lookup operation where the input tensor contains identifiers (like word indices), and the weight tensor is the embedding table where each row corresponds to an embedding vector.

Parameters

input (Tensor):

  • Contains the indices for which embeddings are to be fetched.

  • For instance, in a language model, this could be a tensor of word indices.

weight (Tensor):

  • The embedding table where each row represents an embedding vector.

  • Size is typically [vocab_size, embedding_dim] where vocab_size is the total number of unique items (e.g., words) and embedding_dim is the dimensionality of the embeddings.

tp_size (int):

  • Indicates the number of GPUs used for distributed computing (tensor parallelism).

  • If greater than 1, it implies the embedding operation is distributed across multiple GPUs.

tp_group (Optional[List[int]]):

  • The group of ranks (GPUs) participating in the operation, relevant in the case of distributed computing.

sharding_dim (int):

  • Dictates how the embedding table is split among different GPUs.

  • sharding_dim = 0 means sharding by rows (vocab dimension).

  • sharding_dim = 1 means sharding by columns (embedding dimension).

tp_rank (int):

  • The specific rank of the GPU in the tensor parallelism setup.

  • Used to calculate the offset in the embedding table.

workspace (Optional[Tensor]):

  • Used for memory allocation required during the operation, especially in the distributed context.

instance_id (int):

  • An identifier used for synchronization purposes in distributed setups.

How Parameters are Chosen

  • Choosing input and weight: Based on your model's architecture and the specific task (like word embeddings in an NLP task).

  • Distributed Settings (tp_size, tp_group, tp_rank):

    • Decided based on the computational resources (number of GPUs) and how you want to distribute the computation.

    • In a single GPU setup, tp_size would be 1.

  • sharding_dim:

    • Based on whether you want to shard the embedding table by rows or columns across multiple GPUs. This is typically a design choice depending on the model architecture and memory constraints.

  • workspace and instance_id:

    • These are more technical and are often determined by the system architecture and memory management requirements.

Returns

  • Tensor: The output tensor after performing the embedding lookup.

Use Case

In a typical scenario, you would use this function to convert indices (like word indices) into their corresponding embedding vectors using a pre-trained or dynamically trained embedding table.

This is crucial in models where you need to convert categorical data into a form that can be processed by neural networks.

The distributed computing parameters come into play in large-scale models where the computation is spread across multiple GPUs.

Previousfunctional.pyNexttensorrt_llm.functional.gpt_attention

Last updated 1 year ago

Was this helpful?