Transformer Architecture
Last updated
Last updated
Transformers were first developed to solve the problem of sequence transduction, or neural machine translation, which means they are meant to solve any task that transforms an input sequence to an output sequence.
This is why they are called “Transformers”.
This explanation of the Transformer infrastructure is one of the best we have found. Thanks to Datacamp, a fantastic educational platform:
A transformer model is a neural network that learns the context of sequential data and generates new data out of it.
To put it simply:
A transformer is a type of artificial intelligence model that learns to understand and generate human-like text by analyzing patterns in large amounts of text data.
The encoder is a fundamental component of the Transformer architecture.
The primary function of the encoder is to transform the input tokens into contextualized representations. Unlike earlier models that processed tokens independently, the Transformer encoder captures the context of each token with respect to the entire sequence.
Input Embeddings
The embedding only happens in the bottom-most encoder.
The encoder begins by converting input tokens - words or subwords - into vectors using embedding layers. These embeddings capture the semantic meaning of the tokens and convert them into numerical vectors.
All the encoders receive a list of vectors, each of size 512 (fixed-sized). In the bottom encoder, that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below them.
Positional Encoding
Positional encoding is a technique used in transformer-based models to inject information about the position of each token in the input sequence.
This is necessary because the transformer architecture itself is position-invariant, meaning it doesn't have any notion of the order of the input tokens.
In the standard transformer model, positional encoding vectors are added to the input embeddings before being fed into the transformer layers. These positional encodings have the same dimension as the input embeddings, and they are designed to uniquely represent each position in the sequence.
The most common form of positional encoding uses sine and cosine functions of different frequencies to create unique vectors for each position. These vectors have the same dimension as the input embeddings.
Adding these positional encodings to the input embeddings provides the model with a unique representation of each token's position in the sequence, which the self-attention mechanism can then use to learn position-dependent transformations.
There are also other forms of positional encoding, such as learned positional embeddings, where the embeddings for each position are learned parameters of the model, rather than being defined by a fixed function.
In your image, the positional encoding is added to the input embedding, and this combined embedding is then fed into the transformer layers, allowing the model to use the positional information during the self-attention computation and subsequent transformations.
Stack of Encoder Layers
The Transformer encoder consists of a stack of identical layers (6 in the original Transformer model).
The encoder layer serves to transform all input sequences into a continuous, abstract representation that encapsulates the learned information from the entire sequence.
This layer comprises two sub-modules:
A multi-headed attention mechanism
A fully connected network
Additionally, it incorporates residual connections around each sublayer, which are then followed by layer normalization.
Multi-Headed Self-Attention Mechanism
In the encoder, the multi-headed attention uses a specialised attention mechanism known as self-attention.
This approach enables the models to relate each word in the input with other words. For instance, in a given example, the model might learn to connect the word “are” with “you”.
This mechanism allows the encoder to focus on different parts of the input sequence as it processes each token. It computes attention scores based on:
A query is a vector that represents a specific word or token from the input sequence in the attention mechanism.
A key is also a vector in the attention mechanism, corresponding to each word or token in the input sequence
Each value is associated with a key and is used to construct the output of the attention layer. When a query and a key match well, which basically means that they have a high attention score, the corresponding value is emphasised in the output.
This first Self-Attention module enables the model to capture contextual information from the entire sequence.
Instead of performing a single attention function, queries, keys and values are linearly projected h times.
On each of these projected versions of queries, keys and values the attention mechanism is performed in parallel, yielding h-dimensional output values.
The detailed architecture goes as follows:
Matrix Multiplication (MatMul) - Dot Product of Query and Key
Once the query, key, and value vectors are passed through a linear layer, a dot product matrix multiplication is performed between the queries and keys, resulting in the creation of a score matrix.
The score matrix establishes the degree of emphasis each word should place on other words. Therefore, each word is assigned a score in relation to other words within the same time step. A higher score indicates greater focus.
This process effectively maps the queries to their corresponding keys.
Reducing the Magnitude of attention scores
The scores are then scaled down by dividing them by the square root of the dimension of the query and key vectors. This step is implemented to ensure more stable gradients, as the multiplication of values can lead to excessively large effects.
Applying Softmax to the Adjusted Scores
Subsequently, a softmax function is applied to the adjusted scores to obtain the attention weights.
This results in probability values ranging from 0 to 1. The softmax function emphasises higher scores while diminishing lower scores, thereby enhancing the model's ability to effectively determine which words should receive more attention.
Combining Softmax Results with the Value Vector
The following step of the attention mechanism is that weights derived from the softmax function are multiplied by the value vector, resulting in an output vector.
In this process, only the words that present high softmax scores are preserved. Finally, this output vector is fed into a linear layer for further processing.
Normalization and Residual Connections
Each sub-layer in an encoder layer is followed by a normalization step.
Also, each sub-layer output is added to its input (residual connection) to help mitigate the vanishing gradient problem, allowing deeper models. This process will be repeated after the Feed-Forward Neural Network too.
Feed-Forward Neural Network
The journey of the normalized residual output continues as it navigates through a pointwise feed-forward network, a crucial phase for additional refinement.
Picture this network as a duo of linear layers, with a ReLU activation nestled in between them, acting as a bridge. Once processed, the output embarks on a familiar path: it loops back and merges with the input of the pointwise feed-forward network.
This reunion is followed by another round of normalization, ensuring everything is well-adjusted and in sync for the next steps.
Output of the Encoder
The output of the final encoder layer is a set of vectors, each representing the input sequence with a rich contextual understanding. This output is then used as the input for the decoder in a Transformer model.
This careful encoding paves the way for the decoder, guiding it to pay attention to the right words in the input when it's time to decode.
Think of it like building a tower, where you can stack up N encoder layers. Each layer in this stack gets a chance to explore and learn different facets of attention, much like layers of knowledge. This not only diversifies the understanding but could significantly amplify the predictive capabilities of the transformer network.
The decoder's role centers on crafting text sequences. Mirroring the encoder, the decoder is equipped with a similar set of sub-layers. It boasts two multi-headed attention layers, a pointwise feed-forward layer, and incorporates both residual connections and layer normalization after each sub-layer.
These components function in a way akin to the encoder's layers, yet with a twist: each multi-headed attention layer in the decoder has its unique mission.
The final of the decoder's process involves a linear layer, serving as a classifier, topped off with a softmax function to calculate the probabilities of different words.
The Transformer decoder has a structure specifically designed to generate this output by decoding the encoded information step by step.
It is important to notice that the decoder operates in an autoregressive manner, kickstarting its process with a start token. It cleverly uses a list of previously generated outputs as its inputs, in tandem with the outputs from the encoder that are rich with attention information from the initial input.
This sequential dance of decoding continues until the decoder reaches a pivotal moment: the generation of a token that signals the end of its output creation.
Output Embeddings
At the decoder's starting line, the process mirrors that of the encoder. Here, the input first passes through an embedding layer
Positional Encoding
Following the embedding, again just like the decoder, the input passes by the positional encoding layer. This sequence is designed to produce positional embeddings.
These positional embeddings are then channeled into the first multi-head attention layer of the decoder, where the attention scores specific to the decoder’s input are meticulously computed.
Stack of Decoder Layers
The decoder consists of a stack of identical layers (6 in the original Transformer model). Each layer has three main sub-components:
Masked Self-Attention Mechanism
This is similar to the self-attention mechanism in the encoder but with a crucial difference: it prevents positions from attending to subsequent positions, which means that each word in the sequence isn't influenced by future tokens.
For instance, when the attention scores for the word "are" are being computed, it's important that "are" doesn't get a peek at "you", which is a subsequent word in the sequence.
Encoder-Decoder Multi-Head Attention or Cross Attention
In the second multi-headed attention layer of the decoder, we see a unique interplay between the encoder and decoder's components.
Here, the outputs from the encoder take on the roles of both queries and keys, while the outputs from the first multi-headed attention layer of the decoder serve as values.
This setup effectively aligns the encoder's input with the decoder's, empowering the decoder to identify and emphasize the most relevant parts of the encoder's input.
Following this, the output from this second layer of multi-headed attention is then refined through a pointwise feedforward layer, enhancing the processing further.
In this sub-layer, the queries come from the previous decoder layer, and the keys and values come from the output of the encoder.
This allows every position in the decoder to attend over all positions in the input sequence, effectively integrating information from the encoder with the information in the decoder.
Feed-Forward Neural Network
Similar to the encoder, each decoder layer includes a fully connected feed-forward network, applied to each position separately and identically.
The journey of data through the transformer model culminates in its passage through a final linear layer, which functions as a classifier.
The size of this classifier corresponds to the total number of classes involved (number of words contained in the vocabulary). For instance, in a scenario with 1000 distinct classes representing 1000 different words, the classifier's output will be an array with 1000 elements.
This output is then introduced to a softmax layer, which transforms it into a range of probability scores, each lying between 0 and 1. The highest of these probability scores is key,its corresponding index directly points to the word that the model predicts as the next in the sequence.
Normalization and Residual Connections
Each sub-layer (masked self-attention, encoder-decoder attention, feed-forward network) is followed by a normalization step, and each also includes a residual connection around it.
Output of the Decoder
The final layer's output is transformed into a predicted sequence, typically through a linear layer followed by a softmax to generate probabilities over the vocabulary.
The decoder, in its operational flow, incorporates the freshly generated output into its growing list of inputs, and then proceeds with the decoding process. This cycle repeats until the model predicts a specific token, signaling completion.
The token predicted with the highest probability is assigned as the concluding class, often represented by the end token.
Again remember that the decoder isn't limited to a single layer. It can be structured with N layers, each one building upon the input received from the encoder and its preceding layers. This layered architecture allows the model to diversify its focus and extract varying attention patterns across its attention heads.
Such a multi-layered approach can significantly enhance the model’s ability to predict, as it develops a more nuanced understanding of different attention combinations.
And the final architecture is something similar like this (form the original paper)
In the attention mechanism, the network uses an attention matrix to compute a weighted sum of the input embeddings or hidden states, with the weights determined by the similarity between the current decoding state and each of the input states.
The attention matrix contains the similarity scores between the decoding state and each of the input states and is typically computed using a dot product between the decoding state and the input states.
The attention matrix can be visualized as a grid or matrix, with the rows representing the decoding states and the columns representing the input states. Each element of the matrix represents the similarity between a particular decoding state and a particular input state, and the values are typically normalized using a softmax function to produce a set of attention weights that sum to 1.
The self-attention mechanism allows the model to assign a weight to each word in the sequence, depending on how valuable it is for the prediction. This enables the model to capture the relationships between words, regardless of their distance from each other.
To see why averaging the token embeddings might be a good idea, consider what comes to mind when you see the word “flies”. You might think of annoying insects, but if you were given more context, like “time flies like an arrow”, then you would realize that “flies” refers to the verb instead.
Similarly, we can create a representation for “flies” that incorporates this context by combining all the token embeddings in different proportions, perhaps by assigning a larger weight to the token embeddings for “time” and “arrow”.
Embeddings that are generated in this way are called contextualized embeddings and predate the invention of transformers in language models like ELMo.2
Explain the three matrices: Q for query k for key and V for Value.
In the context of transformer-based architectures, the self-attention mechanism uses three matrices: Q (Query), K (Key), and V (Value). These matrices are derived from the input embeddings and play a crucial role in determining the relationships between different parts of the input sequence.
Query (Q) matrix: The Query matrix represents the current token or input element that the model is processing. It is used to compare against the Key matrix to determine how much attention should be paid to other tokens in the sequence. WHAT DO I HAVE?
Key (K) matrix: The Key matrix represents all the tokens or input elements in the sequence. It is used to compute attention scores with the Query matrix, which helps the model decide how much each token should influence the current token being processed. WHAT CAN I OFFER?
Value (V) matrix: The Value matrix represents the information associated with each token or input element in the sequence. After computing attention scores using the Query and Key matrices, these scores are used to weigh the Value matrix, resulting in a context-aware representation of the input.
WHAT IS ACTUALLY BEING OFFERED WHEN COMPUTING THE ATTENTION MECHANISM?
The philosophy behind using the Q, K, and V matrices is to enable the model to weigh the importance of different parts of the input sequence while considering the contextual information. This allows the model to capture long-range dependencies and understand the relationships between tokens effectively.
The complexity involved in computing self-attention mainly comes from the matrix multiplications and the softmax operation used to calculate attention scores.
For a sequence of length n and embedding dimension d, the complexity is O(n^2 * d). This quadratic complexity can become a bottleneck when dealing with long sequences, which is one of the reasons why transformers sometimes struggle with tasks that require processing very long input sequences.
The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Another way to formulate this is to say that given a sequence of token embeddings, self-attention produces a sequence of new embeddings where each is a linear combination of all the