tensorrt_llm.functional.layer_norm

The tensorrt_llm.functional.layer_norm function in TensorRT-LLM applies layer normalization to a tensor, a common operation in neural networks, particularly in large language models (LLMs). Layer normalization is used to stabilize the learning process and improve convergence. Here's a breakdown of how to use this function and what each parameter means:

Function Purpose

  • Layer Normalization: Applies normalization on a specified axis or axes of the input tensor. It normalizes the input tensor by subtracting the mean and dividing by the standard deviation of the elements of the tensor.

Parameters

  1. input (Tensor):

    • The input tensor that you want to normalize.

    • In neural networks, this is often the output of a linear transformation or activation function.

  2. normalized_shape (int or Tuple[int]):

    • The shape of the sub-tensor to be normalized, typically the feature dimension in LLMs.

    • If the input tensor is 2D, normalized_shape is usually the second dimension of the tensor.

  3. weight (Tensor, optional):

    • The scale coefficient (gamma) for the normalization, applied element-wise to the normalized tensor.

    • It should have the same shape as normalized_shape.

  4. bias (Tensor, optional):

    • The shift coefficient (beta) for the normalization, applied element-wise to the normalized tensor.

    • It should have the same shape as normalized_shape.

  5. eps (float):

    • A small constant (epsilon) added to the variance to avoid division by zero.

    • Commonly set to a small value like 1e-5.

  6. use_diff_of_squares (bool):

    • When set to True, the function uses a difference of squares method to compute the variance (Var = Mean(X^2) - Mean(X)^2).

    • This can be more numerically stable in some cases.

How to Use

  • Prepare Your Input Tensor: Ensure your input tensor is in the correct shape and data type.

  • Determine Normalization Shape: Set normalized_shape to match the dimensions of the tensor you want to normalize (usually the feature dimension).

  • Optional Weight and Bias: If you have specific scaling and shifting parameters (gamma and beta), provide them as weight and bias. If not, they can be omitted, and the operation will default to standard layer normalization without scaling and shifting.

  • Set Epsilon: Choose an appropriate eps value; the default is typically sufficient.

  • Use Difference of Squares: Decide whether to use the difference of squares method based on your model's numerical stability requirements.

Returns

  • Tensor: The function returns a normalized tensor with the same shape as the input tensor.

Example Use Case

In a transformer model, after each sub-block (like a multi-head attention or a feed-forward network), you often apply layer normalization to the output of these sub-blocks. This ensures that the values across different features have a mean of zero and a standard deviation of one, which helps stabilize training and improve convergence.

Last updated