tensorrt_llm.functional.layer_norm
The tensorrt_llm.functional.layer_norm
function in TensorRT-LLM applies layer normalization to a tensor, a common operation in neural networks, particularly in large language models (LLMs). Layer normalization is used to stabilize the learning process and improve convergence. Here's a breakdown of how to use this function and what each parameter means:
Function Purpose
Layer Normalization: Applies normalization on a specified axis or axes of the input tensor. It normalizes the input tensor by subtracting the mean and dividing by the standard deviation of the elements of the tensor.
Parameters
input (Tensor):
The input tensor that you want to normalize.
In neural networks, this is often the output of a linear transformation or activation function.
normalized_shape (int or Tuple[int]):
The shape of the sub-tensor to be normalized, typically the feature dimension in LLMs.
If the input tensor is 2D,
normalized_shape
is usually the second dimension of the tensor.
weight (Tensor, optional):
The scale coefficient (gamma) for the normalization, applied element-wise to the normalized tensor.
It should have the same shape as
normalized_shape
.
bias (Tensor, optional):
The shift coefficient (beta) for the normalization, applied element-wise to the normalized tensor.
It should have the same shape as
normalized_shape
.
eps (float):
A small constant (epsilon) added to the variance to avoid division by zero.
Commonly set to a small value like
1e-5
.
use_diff_of_squares (bool):
When set to
True
, the function uses a difference of squares method to compute the variance (Var = Mean(X^2) - Mean(X)^2
).This can be more numerically stable in some cases.
How to Use
Prepare Your Input Tensor: Ensure your input tensor is in the correct shape and data type.
Determine Normalization Shape: Set
normalized_shape
to match the dimensions of the tensor you want to normalize (usually the feature dimension).Optional Weight and Bias: If you have specific scaling and shifting parameters (
gamma
andbeta
), provide them asweight
andbias
. If not, they can be omitted, and the operation will default to standard layer normalization without scaling and shifting.Set Epsilon: Choose an appropriate
eps
value; the default is typically sufficient.Use Difference of Squares: Decide whether to use the difference of squares method based on your model's numerical stability requirements.
Returns
Tensor: The function returns a normalized tensor with the same shape as the input tensor.
Example Use Case
In a transformer model, after each sub-block (like a multi-head attention or a feed-forward network), you often apply layer normalization to the output of these sub-blocks. This ensures that the values across different features have a mean of zero and a standard deviation of one, which helps stabilize training and improve convergence.
Last updated