# Transformer Architecture

Transformers were first developed to solve the problem of sequence transduction, or neural machine translation, which means they are meant to solve any task that transforms an input sequence to an output sequence.&#x20;

This is why they are called “Transformers”.

This explanation of the Transformer infrastructure is one of the best we have found.   Thanks to Datacamp, a fantastic educational platform:

<figure><img src="/files/QxbSAHJPoOXDQuNz0VJn" alt=""><figcaption><p>A full diagram of the Transformer Architecture</p></figcaption></figure>

{% embed url="<https://www.datacamp.com/blog>" %}

### <mark style="color:purple;">What Are Transformer Models?</mark> <a href="#what-are-transformer-models-atran" id="what-are-transformer-models-atran"></a>

A transformer model is a neural network that learns the context of sequential data and generates new data out of it.

To put it simply:

*A transformer is a type of artificial intelligence model that learns to understand and generate human-like text by analyzing patterns in large amounts of text data.*

### <mark style="color:purple;">The Encoder WorkFlow</mark> <a href="#the-encoder-workflow-theen" id="the-encoder-workflow-theen"></a>

The encoder is a fundamental component of the Transformer architecture.&#x20;

The primary function of the encoder is to transform the input tokens into contextualized representations. Unlike earlier models that processed tokens independently, the Transformer encoder captures the context of each token with respect to the entire sequence.

<mark style="color:green;">**Input Embeddings**</mark>

The embedding only happens in the <mark style="color:yellow;">bottom-most encoder.</mark>&#x20;

The encoder begins by converting input tokens - words or subwords - into vectors using embedding layers. These embeddings capture the semantic meaning of the tokens and convert them into numerical vectors.

All the encoders receive a list of vectors, each of size 512 (fixed-sized). In the bottom encoder, that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below them.

<mark style="color:green;">**Positional Encoding**</mark>

Positional encoding is a technique used in transformer-based models to inject information about the position of each token in the input sequence.&#x20;

This is necessary because the transformer architecture itself is position-invariant, meaning it doesn't have any notion of the order of the input tokens.

In the standard transformer model, <mark style="color:yellow;">positional encoding vectors are added to the input embeddings</mark> before being fed into the transformer layers.  These positional encodings have the same dimension as the input embeddings, and they are designed to uniquely represent each position in the sequence.

The most <mark style="color:yellow;">common form of positional encoding uses sine and cosine functions</mark> of different frequencies to create unique vectors for each position. These vectors have the same dimension as the input embeddings.

Adding these positional encodings to the input embeddings <mark style="color:yellow;">provides the model with a unique representation of each token's position in the sequence,</mark> which the self-attention mechanism can then use to learn position-dependent transformations.

There are also other forms of positional encoding, such as learned positional embeddings, where the embeddings for each position are learned parameters of the model, rather than being defined by a fixed function.

In your image, the positional encoding is added to the input embedding, and this combined embedding is then fed into the transformer layers, allowing the model to use the positional information during the self-attention computation and subsequent transformations.

<figure><img src="/files/aIyzeWyBqoo1YW9qJeF9" alt=""><figcaption><p>Thanks to Xuer Chen</p></figcaption></figure>

<mark style="color:green;">**Stack of Encoder Layers**</mark>

The Transformer encoder <mark style="color:yellow;">consists of a stack of identical layers</mark> (6 in the original Transformer model).

The <mark style="color:yellow;">encoder layer serves to transform all input sequences into a continuous, abstract representation that encapsulates the learned information from the entire sequence</mark>.&#x20;

This layer comprises two sub-modules:

* A multi-headed attention mechanism
* A fully connected network

Additionally, it incorporates residual connections around each sublayer, which are then followed by <mark style="color:blue;">layer normalization.</mark>

<mark style="color:green;">**Multi-Headed Self-Attention Mechanism**</mark>

In the encoder, the multi-headed attention uses a <mark style="color:yellow;">specialised attention mechanism</mark> known as <mark style="color:blue;">self-attention.</mark>&#x20;

This approach enables the models to relate each word in the input with other words. For instance, in a given example, the model might learn to connect the word “are” with “you”.

This mechanism allows the encoder to focus on different parts of the input sequence as it processes each token. It computes attention scores based on:

* A <mark style="color:blue;">query</mark> is a vector that represents a specific word or token from the input sequence in the attention mechanism.
* A <mark style="color:blue;">key</mark> is also a vector in the attention mechanism, corresponding to each word or token in the input sequence

Each value is associated with a key and is used to construct the output of the attention layer.  *<mark style="color:yellow;">**When a query and a key match well, which basically means that they have a high attention score, the corresponding value is emphasised in the output.**</mark>*

This *<mark style="color:yellow;">**first**</mark>* Self-Attention module enables the model to capture contextual information from the entire sequence.&#x20;

Instead of performing a single attention function, queries, keys and values are linearly projected <mark style="color:yellow;">h times.</mark>&#x20;

On each of these projected versions of queries, keys and values the attention mechanism is performed in parallel, yielding h-dimensional output values.

The detailed architecture goes as follows:

<figure><img src="/files/ovJXPiJ4zehrSqtR8IvW" alt=""><figcaption><p>Image from DataCamp (thank you)</p></figcaption></figure>

<mark style="color:green;">**Matrix Multiplication (MatMul) - Dot Product of Query and Key**</mark>

Once the query, key, and value vectors are passed through a linear layer, a <mark style="color:yellow;">dot product matrix multiplication is performed between the queries and keys</mark>, resulting in the creation of a score matrix.

The score matrix establishes the degree of emphasis each word should place on other words. Therefore, each word is assigned a score in relation to other words within the same time step. A higher score indicates greater focus.

This process effectively maps the queries to their corresponding keys.

<figure><img src="/files/fHqeTYuqGZVFLAc7VivH" alt=""><figcaption><p>Image from DataCamp (thank you)</p></figcaption></figure>

<mark style="color:green;">**Reducing the Magnitude of attention scores**</mark>

The scores are then scaled down by dividing them by the square root of the dimension of the query and key vectors. This step is implemented to ensure more stable gradients, as the multiplication of values can lead to excessively large effects.

<figure><img src="/files/209o9booK6nxUkG4NHxL" alt="" width="563"><figcaption><p>Image from DataCamp (thank you)</p></figcaption></figure>

<mark style="color:green;">**Applying Softmax to the Adjusted Scores**</mark>

Subsequently, a *<mark style="color:yellow;">**softmax function is applied to the adjusted scores to obtain the attention weights.**</mark>*&#x20;

This results in probability values ranging from 0 to 1. The softmax function emphasises higher scores while diminishing lower scores, thereby enhancing the model's ability to effectively determine which words should receive more attention.

<mark style="color:green;">**Combining Softmax Results with the Value Vector**</mark>

The following step of the attention mechanism is that <mark style="color:yellow;">weights derived from the softmax function are multiplied by the value vector,</mark> resulting in an output vector.

In this process, only the words that present high softmax scores are preserved. Finally, this output vector is fed into a linear layer for further processing.

<figure><img src="/files/R83mmISlyrmsRUbGP6yV" alt="" width="563"><figcaption><p>Image from DataCamp (thank you)</p></figcaption></figure>

<mark style="color:green;">**Normalization and Residual Connections**</mark>

Each sub-layer in an encoder layer is followed by a normalization step.&#x20;

Also, each sub-layer output is added to its input (residual connection) to help mitigate the vanishing gradient problem, allowing deeper models. This process will be repeated after the Feed-Forward Neural Network too.

<figure><img src="/files/L5BiVe9XUyln2TCWw5P8" alt="" width="563"><figcaption></figcaption></figure>

<mark style="color:green;">**Feed-Forward Neural Network**</mark>

The journey of the <mark style="color:yellow;">normalized residual output</mark> continues as it navigates through a pointwise feed-forward network, a crucial phase for additional refinement.

Picture this network as a <mark style="color:yellow;">duo of linear layers</mark>, with a <mark style="color:blue;">ReLU activation</mark> nestled in between them, acting as a bridge. Once processed, the output embarks on a familiar path: it loops back and merges with the input of the pointwise feed-forward network.

This reunion is followed by another round of normalization, ensuring everything is well-adjusted and in sync for the next steps.

<figure><img src="/files/eaIYF2YRFsGGNP8PVwU6" alt="" width="563"><figcaption></figcaption></figure>

<mark style="color:green;">**Output of the Encoder**</mark>

The output of the final encoder layer is a set of vectors, each representing the input sequence with a rich contextual understanding. This output is then used as the <mark style="color:yellow;">input for the decoder</mark> in a Transformer model.

This careful encoding paves the way for the decoder, guiding it to pay attention to the right words in the input when it's time to decode.

Think of it like building a tower, where you can stack up N encoder layers. Each layer in this stack gets a chance to explore and learn different facets of attention, much like layers of knowledge. This not only diversifies the understanding but could significantly amplify the predictive capabilities of the transformer network.

#### <mark style="color:green;">The Decoder WorkFlow</mark> <a href="#the-decoder-workflow-thede" id="the-decoder-workflow-thede"></a>

The decoder's role centers on crafting text sequences. Mirroring the encoder, the decoder is equipped with a similar set of sub-layers. It boasts two multi-headed attention layers, a pointwise feed-forward layer, and incorporates both residual connections and layer normalization after each sub-layer.<br>

<figure><img src="/files/NPP7OzwlvpQVWmmoBqiw" alt="" width="374"><figcaption></figcaption></figure>

These components function in a way akin to the encoder's layers, yet with a twist: <mark style="color:yellow;">each multi-headed attention layer in the decoder has its unique mission.</mark>

The final of the decoder's process involves a linear layer, serving as a classifier, topped off with a softmax function to calculate the probabilities of different words.

The Transformer decoder has a structure specifically designed to generate this output by decoding the encoded information step by step.

It is important to notice that the <mark style="color:yellow;">decoder operates in an autoregressive manner</mark>, kickstarting its process with a start token. It cleverly uses a list of previously generated outputs as its inputs, in tandem with the outputs from the encoder that are rich with attention information from the initial input.

This sequential dance of decoding continues until the decoder reaches a pivotal moment: the generation of a token that signals the end of its output creation.

<mark style="color:green;">**Output Embeddings**</mark>

At the decoder's starting line, the process mirrors that of the encoder. Here, the input first passes through an embedding layer

<mark style="color:green;">**Positional Encoding**</mark>

Following the embedding, again just like the decoder, the input passes by the positional encoding layer. This sequence is designed to produce positional embeddings.

These positional embeddings are then channeled into the first multi-head attention layer of the decoder, where the attention scores specific to the decoder’s input are meticulously computed.

<mark style="color:green;">**Stack of Decoder Layers**</mark>

The decoder consists of a stack of identical layers (6 in the original Transformer model). Each layer has three main sub-components:

1. <mark style="color:purple;">**Masked Self-Attention Mechanism**</mark>

This is similar to the self-attention mechanism in the encoder but with a crucial difference: it prevents positions from attending to subsequent positions, which means that each word in the sequence isn't influenced by future tokens.

For instance, when the attention scores for the word "are" are being computed, it's important that "are" doesn't get a peek at "you", which is a subsequent word in the sequence.

<figure><img src="/files/s3U5v7Mc9JuoF9gfw9DD" alt=""><figcaption><p>This masking ensures that the predictions for a particular position can only depend on known outputs at positions before it.</p></figcaption></figure>

2. <mark style="color:purple;">**Encoder-Decoder Multi-Head Attention or Cross Attention**</mark>

In the second multi-headed attention layer of the decoder, we see a unique interplay between the encoder and decoder's components.&#x20;

Here, the outputs from the encoder take on the roles of both queries and keys, while the outputs from the first multi-headed attention layer of the decoder serve as values.

This setup effectively aligns the encoder's input with the decoder's, empowering the decoder to identify and emphasize the most relevant parts of the encoder's input.

Following this, the output from this second layer of multi-headed attention is then refined through a pointwise feedforward layer, enhancing the processing further.

<figure><img src="/files/MmxfsawKtvXXdFv52uWb" alt="" width="563"><figcaption></figcaption></figure>

In this sub-layer, the <mark style="color:yellow;">queries come from the previous decoder layer,</mark> and the keys and values come from the output of the encoder.&#x20;

This allows every position in the decoder to attend over all positions in the input sequence, effectively integrating information from the encoder with the information in the decoder.

3. <mark style="color:purple;">**Feed-Forward Neural Network**</mark>

Similar to the encoder, each decoder layer includes a fully connected feed-forward network, applied to each position separately and identically.

### <mark style="color:purple;">**Linear Classifier and Softmax for Generating Output Probabilities**</mark>

The journey of data through the transformer model <mark style="color:yellow;">culminates in its passage through a final linear layer,</mark> which functions as a classifier.

The size of this classifier corresponds to the total number of classes involved (number of words contained in the vocabulary). For instance, in a scenario with 1000 distinct classes representing 1000 different words, the classifier's output will be an array with 1000 elements.

This <mark style="color:yellow;">output is then introduced to a softmax layer,</mark> which transforms it into a range of probability scores, each lying between 0 and 1. The highest of these probability scores is key,its corresponding index directly points to the word that the model predicts as the next in the sequence.

<figure><img src="/files/zHtM3fAIemxJNzos8NHc" alt="" width="375"><figcaption></figcaption></figure>

<mark style="color:green;">**Normalization and Residual Connections**</mark>

Each sub-layer (masked self-attention, encoder-decoder attention, feed-forward network) is followed by a normalization step, and each also includes a residual connection around it.

<mark style="color:green;">**Output of the Decoder**</mark>

The final layer's output is transformed into a predicted sequence, typically through a linear layer followed by a softmax to generate probabilities over the vocabulary.

The decoder, in its operational flow, incorporates the freshly generated output into its growing list of inputs, and then proceeds with the decoding process. This cycle repeats until the model predicts a specific token, signaling completion.

The token predicted with the highest probability is assigned as the concluding class, often represented by the end token.

Again remember that the decoder isn't limited to a single layer. It can be structured with N layers, each one building upon the input received from the encoder and its preceding layers. This layered architecture allows the model to diversify its focus and extract varying attention patterns across its attention heads.

Such a multi-layered approach can significantly enhance the model’s ability to predict, as it develops a more nuanced understanding of different attention combinations.

And the final architecture is something similar like this (form the original paper)

<figure><img src="/files/cdAmKYmp87cGdKIqbV8x" alt=""><figcaption></figcaption></figure>

### <mark style="color:blue;">Attention Mechanism</mark>

In the attention mechanism, the network uses an attention matrix to compute a weighted sum of the input embeddings or hidden states, with the weights determined by the similarity between the current decoding state and each of the input states.

The attention matrix contains the similarity scores between the decoding state and each of the input states and is typically computed using a dot product between the decoding state and the input states.

The attention matrix can be visualized as a grid or matrix, with the rows representing the decoding states and the columns representing the input states. Each element of the matrix represents the similarity between a particular decoding state and a particular input state, and the values are typically normalized using a softmax function to produce a set of attention weights that sum to 1.

The self-attention mechanism allows the model to assign a weight to each word in the sequence, depending on how valuable it is for the prediction. This enables the model to capture the relationships between words, regardless of their distance from each other.

The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Another way to formulate this is to say that given a sequence of token embeddings, self-attention produces a sequence of new embeddings  where each ![](file:///C:/Users/TIMHAN~1/AppData/Local/Temp/msohtmlclip1/01/clip_image004.png) is a linear combination of all the ![](file:///C:/Users/TIMHAN~1/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png)

<figure><img src="/files/aSClbzGBbkvrWSkvTYzg" alt=""><figcaption></figcaption></figure>

### <mark style="color:blue;">Attention Mechanism</mark>

To see why averaging the token embeddings might be a good idea, consider what comes to mind when you see the word “flies”. You might think of annoying insects, but if you were given more context, like “time flies like an arrow”, then you would realize that “flies” refers to the verb instead.&#x20;

Similarly, we can create a representation for “flies” that incorporates this context by combining all the token embeddings in different proportions, perhaps by assigning a larger weight to the token embeddings for “time” and “arrow”.

Embeddings that are generated in this way are called contextualized embeddings and predate the invention of transformers in language models like ELMo.2

Explain the three matrices: Q for query k for key and V for Value.

$$Attention⁡(Q,K,V)=softmax⁡((QK^T)/√(d\_k ))V$$

In the context of transformer-based architectures, the self-attention mechanism uses three matrices: Q (Query), K (Key), and V (Value). These matrices are derived from the input embeddings and play a crucial role in determining the relationships between different parts of the input sequence.

<mark style="color:purple;">**Query (Q) matrix:**</mark> The Query matrix represents the current token or input element that the model is processing. It is used to compare against the Key matrix to determine how much attention should be paid to other tokens in the sequence.  WHAT DO I HAVE?

<mark style="color:purple;">**Key (K) matrix:**</mark> The Key matrix represents all the tokens or input elements in the sequence. It is used to compute attention scores with the Query matrix, which helps the model decide how much each token should influence the current token being processed.   WHAT CAN I OFFER?

<mark style="color:purple;">**Value (V) matrix**</mark><mark style="color:purple;">:</mark> The Value matrix represents the information associated with each token or input element in the sequence. After computing attention scores using the Query and Key matrices, these scores are used to weigh the Value matrix, resulting in a context-aware representation of the input. &#x20;

WHAT IS ACTUALLY BEING OFFERED WHEN COMPUTING THE ATTENTION MECHANISM?

The philosophy behind using the Q, K, and V matrices is to enable the model to weigh the importance of different parts of the input sequence while considering the contextual information. This allows the model to capture long-range dependencies and understand the relationships between tokens effectively.

The complexity involved in computing self-attention mainly comes from the matrix multiplications and the softmax operation used to calculate attention scores.

For a sequence of length n and embedding dimension d, the complexity is O(n^2 \* d). This quadratic complexity can become a bottleneck when dealing with long sequences, which is one of the reasons why transformers sometimes struggle with tasks that require processing very long input sequences.

<figure><img src="/files/oDoGBZXWAZHdIrcTzJqO" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/nBT1TgD8VJTkD8HFmtoD" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/YC9jeodSXukL2M7b4MaS" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tensorrt-llm.continuumlabs.ai/transformer-architecture.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
