Position Wise Feed-Forward Layer

The Position Wise Feed-Forward Layer is a part of the Transformer architecture that follows each multi-head attention layer.

It is a ‘standard multi-layer perceptron’ with one hidden layer.

It consists of two fully connected layers (also known as dense layers) with a ReLU (Rectified Linear Unit) activation function in between. Given an input X, the feed-forward layer performs the following operations:

FFN(X)=ReLU(X⋅W1+b1)⋅W2+b2

Here, W1 and W2 are the weight matrices, and b1 and b2 are the bias terms. This layer is applied to each position separately and identically, i.e., it is position-wise.

So, after the self-attention layer finishes processing the sequence, the position-wise feed-forward layer takes in each position in the input sequence and processes it independently.

For each position, a fully connected layer takes in a vector representation of the token (word or subword) at that position. This vector representation is the output from the preceding self-attention layer.

The fully connected layers in this context serve to transform the input vector representations into new vector representations, which are better suited for the model to learn complex patterns and relationships between words.

During training, the transformer layer's weights are updated repeatedly to reduce the difference between the predicted output and the actual output. This is done through the backpropagation algorithm, which is like the training process for traditional neural network layers.

PreviousResidual Connections NextTransformer Feed-Forward Layers Are Key-Value Memories

Last updated 1 year ago

Was this helpful?

FFN(X)=ReLU(X⋅W1​+b1​)⋅W2​+b2​

FFN(X)=ReLU(X⋅W1+b1)⋅W2+b2