tensorrt_llm.functional.embedding
The tensorrt_llm.functional.embedding
function in TensorRT-LLM is used to perform an embedding lookup, which is a common operation in neural network models, particularly in natural language processing.
This function maps discrete objects, such as words in text, to vectors of real numbers. Let's break down how this function works and explain its parameters:
Function Purpose
Embedding Lookup: It performs the embedding lookup operation where the
input
tensor contains identifiers (like word indices), and theweight
tensor is the embedding table where each row corresponds to an embedding vector.
Parameters
input (Tensor):
Contains the indices for which embeddings are to be fetched.
For instance, in a language model, this could be a tensor of word indices.
weight (Tensor):
The embedding table where each row represents an embedding vector.
Size is typically
[vocab_size, embedding_dim]
wherevocab_size
is the total number of unique items (e.g., words) andembedding_dim
is the dimensionality of the embeddings.
tp_size (int):
Indicates the number of GPUs used for distributed computing (tensor parallelism).
If greater than 1, it implies the embedding operation is distributed across multiple GPUs.
tp_group (Optional[List[int]]):
The group of ranks (GPUs) participating in the operation, relevant in the case of distributed computing.
sharding_dim (int):
Dictates how the embedding table is split among different GPUs.
sharding_dim = 0
means sharding by rows (vocab dimension).sharding_dim = 1
means sharding by columns (embedding dimension).
tp_rank (int):
The specific rank of the GPU in the tensor parallelism setup.
Used to calculate the offset in the embedding table.
workspace (Optional[Tensor]):
Used for memory allocation required during the operation, especially in the distributed context.
instance_id (int):
An identifier used for synchronization purposes in distributed setups.
How Parameters are Chosen
Choosing
input
andweight
: Based on your model's architecture and the specific task (like word embeddings in an NLP task).Distributed Settings (
tp_size
,tp_group
,tp_rank
):Decided based on the computational resources (number of GPUs) and how you want to distribute the computation.
In a single GPU setup,
tp_size
would be 1.
sharding_dim
:Based on whether you want to shard the embedding table by rows or columns across multiple GPUs. This is typically a design choice depending on the model architecture and memory constraints.
workspace
andinstance_id
:These are more technical and are often determined by the system architecture and memory management requirements.
Returns
Tensor: The output tensor after performing the embedding lookup.
Use Case
In a typical scenario, you would use this function to convert indices (like word indices) into their corresponding embedding vectors using a pre-trained or dynamically trained embedding table.
This is crucial in models where you need to convert categorical data into a form that can be processed by neural networks.
The distributed computing parameters come into play in large-scale models where the computation is spread across multiple GPUs.
Last updated