tensorrt_llm.functional.embedding

The tensorrt_llm.functional.embedding function in TensorRT-LLM is used to perform an embedding lookup, which is a common operation in neural network models, particularly in natural language processing.

This function maps discrete objects, such as words in text, to vectors of real numbers. Let's break down how this function works and explain its parameters:

Function Purpose

  • Embedding Lookup: It performs the embedding lookup operation where the input tensor contains identifiers (like word indices), and the weight tensor is the embedding table where each row corresponds to an embedding vector.

Parameters

input (Tensor):

  • Contains the indices for which embeddings are to be fetched.

  • For instance, in a language model, this could be a tensor of word indices.

weight (Tensor):

  • The embedding table where each row represents an embedding vector.

  • Size is typically [vocab_size, embedding_dim] where vocab_size is the total number of unique items (e.g., words) and embedding_dim is the dimensionality of the embeddings.

tp_size (int):

  • Indicates the number of GPUs used for distributed computing (tensor parallelism).

  • If greater than 1, it implies the embedding operation is distributed across multiple GPUs.

tp_group (Optional[List[int]]):

  • The group of ranks (GPUs) participating in the operation, relevant in the case of distributed computing.

sharding_dim (int):

  • Dictates how the embedding table is split among different GPUs.

  • sharding_dim = 0 means sharding by rows (vocab dimension).

  • sharding_dim = 1 means sharding by columns (embedding dimension).

tp_rank (int):

  • The specific rank of the GPU in the tensor parallelism setup.

  • Used to calculate the offset in the embedding table.

workspace (Optional[Tensor]):

  • Used for memory allocation required during the operation, especially in the distributed context.

instance_id (int):

  • An identifier used for synchronization purposes in distributed setups.

How Parameters are Chosen

  • Choosing input and weight: Based on your model's architecture and the specific task (like word embeddings in an NLP task).

  • Distributed Settings (tp_size, tp_group, tp_rank):

    • Decided based on the computational resources (number of GPUs) and how you want to distribute the computation.

    • In a single GPU setup, tp_size would be 1.

  • sharding_dim:

    • Based on whether you want to shard the embedding table by rows or columns across multiple GPUs. This is typically a design choice depending on the model architecture and memory constraints.

  • workspace and instance_id:

    • These are more technical and are often determined by the system architecture and memory management requirements.

Returns

  • Tensor: The output tensor after performing the embedding lookup.

Use Case

In a typical scenario, you would use this function to convert indices (like word indices) into their corresponding embedding vectors using a pre-trained or dynamically trained embedding table.

This is crucial in models where you need to convert categorical data into a form that can be processed by neural networks.

The distributed computing parameters come into play in large-scale models where the computation is spread across multiple GPUs.

Last updated