Layer Normalisation

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer, so the normalization does not introduce any new dependencies between training cases. It works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with Transformer models.

In the architecture of the Transformer model, layer normalization is typically applied at multiple points to stabilize training and improve convergence. The Transformer architecture generally consists of an encoder and a decoder, each with multiple layers. Each layer usually consists of an attention mechanism and a position-wise feed-forward network.

Here is how layer normalization usually fits into this architecture:

Post-Attention Layer Normalization: After the multi-head attention sub-layer, the output typically goes through a residual connection followed by layer normalization. This is done to stabilize the activations before passing them to the next sub-layer in the encoder or decoder.
Post-Feed-Forward Layer Normalization: Similarly, after the output of the feed-forward neural network sub-layer goes through another residual connection, it's followed by layer normalization for the same reasons of stabilization and convergence.
Pre-Layer Normalization: Some variations of the Transformer architecture apply layer normalization before the attention and feed-forward sub-layers, instead of after. This approach is known as "Pre-Layer Normalization" and is an alternative to the original "Post-Layer Normalization" design.

So, in essence, layer normalization is strategically placed after (or before, in some variants) major sub-layers in both the encoder and decoder parts of the Transformer model to assist with stabilizing the training and facilitating faster convergence.

Normalization

Normalization is a technique used in deep learning to standardize the inputs within a particular layer so that they have a mean of 0 and a standard deviation of 1. This helps in accelerating the training process by addressing the problem of internal covariate shift. Let's break down the key points in understanding RMSNorm as it's implemented in LLaMA:

What is Normalization?

Imagine the inputs to a neural network layer as a group of runners. If they start at different positions, they'll reach the finish line at different times. Normalization lines them up at the same starting point, ensuring they move in sync, which helps the network learn more efficiently.

Layer Normalization

Commonly used in transformer models, layer normalization is applied after each layer within the block. It's like realigning the runners at every checkpoint.

Root Mean Square Layer Normalization (RMSNorm):

RMSNorm is a simplified version of layer normalization, where the root mean square (RMS) is used to scale the inputs. Think of it as a more efficient way to get the runners in line by using a different calculation method.

Pre-Normalization in LLaMA:

LLaMA uses a pre-normalization variant, applying RMSNorm before the major layers in the transformer block. If layer normalization is like aligning runners after each checkpoint, pre-normalization is aligning them right before the next race starts. This ensures everything is in place before the computation begins for that layer.

Benefits of RMSNorm:

The primary advantage is training stability and generalization, with 10-50% improvement in efficiency. It's akin to a refined alignment method that lets the runners reach the finish line more swiftly and consistently.

In mathematical terms, RMSNorm is typically formulated without subtracting the mean, focusing on scaling by the RMS of the inputs. This simplification can provide comparable performance to layer normalization but with less computational overhead.

By carefully choosing the normalization method and where it is applied within the neural network, one can have significant effects on training dynamics and model performance. It's like picking the right tune-up for a sports car: the right choices can lead to smoother rides and faster race times.

PreviousScaled dot-product attention NextActivation Functions

Last updated 1 year ago

Was this helpful?