Efficient Streaming Language Models with Attention Sinks
Last updated
Last updated
This September 2023 paper addresses a crucial challenge in deploying Large Language Models (LLMs) for streaming applications that require long interactions.
Key Challenges:
Caching Key and Value (KV) states of previous tokens during decoding consumes significant memory.
LLMs have limited ability to generalize to longer texts than their training sequence length.
The authors explore existing approaches like dense attention, window attention, and sliding window with re-computation. They highlight the limitations of these methods in terms of memory usage, latency, and performance degradation when the sequence length exceeds the pre-training attention window size.
The paper introduces an interesting observation called the "attention sink."
They find that LLMs allocate a large amount of attention score to the initial tokens, regardless of their relevance to the language modeling task.
This phenomenon is attributed to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Even when the current query doesn't have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere. The initial tokens become attention sinks because they are visible to almost all subsequent tokens due to the autoregressive nature of language modeling.
Based on the attention sink insight, the authors propose StreamingLLM, a framework that enables LLMs trained with a finite attention window to work on infinitely long text without fine-tuning.
StreamingLLM keeps the attention sink tokens' KV (just 4 initial tokens) together with the sliding window's KV to anchor the attention computation and stabilise the model's performance.
The paper demonstrates that StreamingLLM enables models like Llama-2, MPT, Falcon, and Pythia to reliably model up to 4 million tokens and potentially more.
Compared to the sliding window with re-computation baseline, StreamingLLM achieves up to 22.2× speedup, making it suitable for streaming applications.
The authors further confirm their attention sink hypothesis by showing that language models can be pre-trained to require only a single attention sink token for streaming deployment.
By adding an extra learnable token at the beginning of all training samples as a designated attention sink, they demonstrate that pre-trained 160-million parameter language models can maintain performance in streaming cases with just this single sink token.
Conclusion:
his paper introduces a novel concept of attention sinks in LLMs and proposes StreamingLLM, an efficient framework for deploying LLMs in streaming applications. The authors provide valuable insights into the behavior of attention in LLMs and demonstrate the effectiveness of their approach on various models. The attention sink phenomenon and the StreamingLLM framework have significant implications for the practical use of LLMs in long-sequence generation tasks.
The paper is well-structured, and the experiments are comprehensive, covering a range of models and showcasing the performance gains achieved by StreamingLLM. The attention sink visualization and the pre-training with a designated attention sink token further strengthen the paper's contributions.
Overall, this work addresses a critical challenge in deploying LLMs for streaming applications and offers a practical solution that can greatly enhance the efficiency and performance of LLMs in long-sequence generation tasks.
The related work section of the "Efficient Streaming Language Models with Attention Sinks" paper focuses on three main areas of research in applying Large Language Models (LLMs) to lengthy texts: Length Extrapolation, Context Window Extension, and Improving LLMs' Utilization of Long Text. The authors highlight that progress in one direction does not necessarily lead to progress in the others.
This area aims to enable language models trained on shorter texts to handle longer ones during testing. The primary focus is on developing relative position encoding methods for Transformer models.
Rotary Position Embeddings (RoPE) (Su et al., 2021)
RoPE transforms the queries and keys in every attention layer to integrate relative position information.
However, subsequent research (Press et al., 2022; Chen et al., 2023) indicates that RoPE underperforms on text that exceeds the training window.
ALiBi (Press et al., 2022)
ALiBi biases the query-key attention scores based on their distance, introducing relative positional information.
While ALiBi shows improved extrapolation, tests on MPT models reveal a breakdown when the text length is vastly greater than the training length.
Despite these efforts, current methodologies have not achieved infinite length extrapolation, making existing LLMs unfit for streaming applications.
This area focuses on expanding the LLMs' context window to process more tokens in a single forward pass. The main challenge is the quadratic complexity of attention computation during training, which poses computational and memory challenges.
FlashAttention (Dao et al., 2022; Dao, 2023) accelerates attention computation and reduces memory footprint.
Approximate attention methods (Zaheer et al., 2020b; Beltagy et al., 2020; Wang et al., 2020; Kitaev et al., 2020):
These methods trade model quality for efficiency by approximating the attention computation.
Extending pre-trained LLMs with RoPE (Chen et al., 2023; kaiokendev, 2023; bloc97, 2023; Peng et al., 2023):
These approaches involve position interpolation and fine-tuning to extend the context window of pre-trained LLMs.
However, these techniques only extend LLMs' context window to a limited extent, which is insufficient for handling limitless inputs.
This area aims to optimize LLMs to better capture and employ the content within the context rather than merely taking them as inputs. Liu et al. and Li et al. highlight that success in the previously mentioned two directions does not necessarily translate to competent utilization of lengthy contexts. Addressing the effective usage of prolonged contexts within LLMs remains a challenge.
The StreamingLLM framework primarily falls under the Length Extrapolation category, where LLMs are applied to text significantly exceeding the pre-training window size, potentially even of infinite length. The authors do not focus on expanding the attention window size of LLMs or enhancing the model's memory and usage on long texts. Instead, they concentrate on stably harnessing the most recent tokens, enabling the seamless streaming application of LLMs.
In summary, the related work section provides an overview of the current research landscape in applying LLMs to lengthy texts. It highlights the limitations of existing approaches and positions the StreamingLLM framework as a novel solution for enabling LLMs to handle infinite-length inputs, which is essential for streaming applications.