Residual Connections
A residual block is a collection of layers where data goes BOTH through and around using ‘residuals or skip connections’.
Transformers use these residual connections (also known as skip connections) to improve the flow of gradients during backpropagation and facilitate deeper models.
Because these connections involve adding the input of a layer directly to its output, it helps to mitigate the vanishing gradient problem and enables the model to learn more complex representations.
The primary function of a residual connection is to bypass one or more layers and perform a shortcut, adding the original input to the output of the layer(s). Mathematically, this is represented as:
Output=F(Input)+Input
where F(Input) is the transformation applied by the layer(s) being bypassed.
In the context of the Transformer model, each sub-layer (e.g., self-attention layer, position-wise feed-forward layer) has a residual connection around it.
After the sub-layer's operation, the output is added to the original input (residual connection), and then normalization is performed (Layer Normalization).
Transformers use residual connections (also known as skip connections) to improve the flow of gradients during backpropagation and facilitate deeper models. These connections involve adding the input of a layer directly to its output, which helps to mitigate the vanishing gradient problem and enables the model to learn more complex representations.
It allows gradient to flow from the loss function all the way to the first layer – and this is possible is because each module’s output is that same shape as its input.
These design choices collectively allow transformers to weigh the importance of different parts of the input without having to maintain an internal state, making them highly effective for a wide range of NLP tasks.
These design choices collectively allow transformers to weigh the importance of different parts of the input without having to maintain an internal state, making them highly effective for a wide range of NLP tasks.
Virtual Weights and the Residual Stream
One of the main features of the high-level architecture of a transformer is that each layer adds its results into what we call the “residual stream.” 2
The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel since it doesn't do any processing itself and all layers communicate through it.
The residual stream has a deeply linear structure. Every layer performs an arbitrary linear transformation to "read in" information from the residual stream at the start, and performs another arbitrary linear transformation before adding to "write" its output back into the residual stream. This linear, additive structure of the residual stream has a lot of important implications. One basic consequence is that the residual stream doesn't have a "privileged basis"; we could rotate it by rotating all the matrices interacting with it, without changing model behavior.
SUBSPACES AND RESIDUAL STREAM BANDWIDTH
The residual stream is a high-dimensional vector space. In small models, it may be hundreds of dimensions; in large models it can go into the tens of thousands. This means that layers can send different information to different layers by storing it in different subspaces. This is especially important in the case of attention heads, since every individual head operates on comparatively small subspaces (often 64 or 128 dimensions), and can very easily write to completely disjoint subspaces and not interact.
Once added, information persists in a subspace unless another layer actively deletes it. From this perspective, dimensions of the residual stream become something like "memory" or "bandwidth".
The original token embeddings, as well as the unembeddings, mostly interact with a relatively small fraction of the dimensions. 6 This leaves most dimensions "free" for other layers to store information in.
It seems like we should expect residual stream bandwidth to be in very high demand! There are generally far more "computational dimensions" (such as neurons and attention head result dimensions) than the residual stream has dimensions to move information. Just a single MLP layer typically has four times more neurons than the residual stream has dimensions.
So, for example, at layer 25 of a 50 layer transformer, the residual stream has 100 times more neurons as it has dimensions before it, trying to communicate with 100 times as many neurons as it has dimensions after it, somehow communicating in superposition! We call tensors like this "bottleneck activations" and expect them to be unusually challenging to interpret. (This is a major reason why we will try to pull apart the different streams of communication happening through the residual stream apart in terms of virtual weights, rather than studying it directly.)
Perhaps because of this high demand on residual stream bandwidth, we've seen hints that some MLP neurons and attention heads may perform a kind of "memory management" role, clearing residual stream dimensions set by other layers by reading in information and writing out the negative version. 7
Last updated