- Emergent Behavior
- Posts
- 2024-04-15: Infini-Attention
2024-04-15: Infini-Attention
LLMs with Infinite Memory
🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Here’s today at a glance:
⚠️ Infini-Attention
Google designed a memory system for transformers that allows storing of state. This compressive memory system scales better for very long sequences. A compressive memory maintains a fixed number of parameters to store and recall information with a bounded storage and computation costs. In the compressive memory, new information is added to the memory by changing its parameters, with the objective that this information can be recovered back later on.
What’s happening is that a summary of the information to date (a compressed version of it) is kept in the memory store, allowing the model to not compute anew for each successive token. This potentially breaks transformers out of the quadratic increase in computation with increasing length problem.
This system is similar in some ways to the other non-Transformer architectures, such as Mamba, RWKW, etc.
Anyway, we should all be thankful, Google is publishing again!
Paper Title:
Who:
Researchers at Google: Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal.
Why:
Motivation: Traditional Transformer-based Large Language Models (LLMs) have limited context windows due to the quadratic complexity of the attention mechanism, making it challenging to handle long sequences efficiently.
Gaps addressed: The research aimed to address the limitations of existing LLMs in processing long sequences and the high memory and computational costs associated with scaling them.
Impact: This study has the potential to enable LLMs to efficiently handle much longer contexts, opening up possibilities for various applications, including long-form document understanding, summarization, and question answering.
How:
Infini-attention: The researchers introduce a novel attention mechanism called Infini-attention that combines a compressive memory with the standard attention mechanism.
Compressive Memory: This memory efficiently stores key-value pairs from previous segments of the input sequence, allowing the model to access a much larger context window without significantly increasing memory usage.
Infini-attention combines local and global attention to efficiently process long sequences.Infini-attention has an additional compressive memory with linear attention for processing infinitely long contexts. {KV}s−1 and {KV}s are attention key and values for current and previous input segments, respectively and Qs the attention queries. PE denotes position embeddings.
Local and Global Attention: Infini-attention utilizes both local attention within a segment and global attention across segments retrieved from the compressive memory, enabling the model to capture both short and long-range dependencies effectively.
Infini-attention combines local and global attention to efficiently process long sequences. Infini-Transformer (top) has an entire context history whereas Transformer-XL (bottom) discards old contexts since it caches the KV states for the last segment only
Datasets: The researchers evaluated their approach on long-context language modeling benchmarks like PG19 and Arxiv-math, as well as tasks like passkey retrieval and book summarization.
Models: They experimented with different sizes of LLMs (1B and 8B parameters) and compared their approach with baseline models like Transformer-XL and Memorizing Transformers.
What did they find:
Improved Performance: Infini-Transformer models achieved better perplexity scores O(n) long-context language modeling benchmarks compared to baseline models, demonstrating their effectiveness in capturing long-range dependencies.
Memory Efficiency: The compressive memory in Infini-attention enabled significant memory savings, achieving up to 114x compression ratio compared to models with explicit memory storage.
Length Generalization: Infini-Transformers exhibited impressive length generalization capabilities, successfully solving tasks with input lengths of up to 1 million tokens even when trained on shorter sequences.
State-of-the-art Summarization: A 8B LLM with Infini-attention achieved new state-of-the-art results on the BookSum dataset for long-form book summarization, highlighting the potential for real-world applications.
What are the limitations and what's next:
Limitations: The paper does not provide a detailed analysis of the computational complexity of Infini-attention during training and inference, which could be important for practical deployment.
Future Research: Exploring alternative compressive memory techniques, investigating the trade-off between memory compression and model performance, and applying Infini-attention to other tasks beyond language modeling.
Why it matters:
Significant Contribution: This research proposes a novel and efficient approach to enable LLMs to handle infinitely long contexts with bounded memory and computation, addressing a critical limitation of current models.
Potential Impact: Infini-attention has the potential to revolutionize the field of natural language processing by allowing LLMs to process and understand much longer sequences of text, leading to advancements in various applications such as document summarization, question answering, and machine translation.
Additional notes:
This research paper has been submitted to a conference or journal for publication.
The code and models are not yet publicly available.
TL;DR: Slice the input to segments, train one by one. The attention attends the current segment & "memory matrix" updated from previous KVs. (forcing it to store information for future use)
Implementation is easy:
1. A simple dot-product attention is applied within each segment.
2. Before moving to the next segment, previous KVs are used to update the memory matrix using a simple linear update.
3. This forces the memory matrix to gradually accumulate information from all segments seen so far.
Some impressive results:
1. 1B model solves the passkey retrieval task at 1M tokens context length.
2. 8B model reached SOTA results on 500K length book summarization.
Thoughts: (my own opinions, might be wrong)
0. This is basically an improvement of TransformerXL:
- In TransformerXL you simply use the previous KV on each segment hoping it would compress.
- Here you update another special matrix based on all previous segments but attend to it only from the current.
- Clever & feels like "the natural extension" of self-attention.
1. Unlike many other attempts at infinite context, this one is easy to try or even extend a pretrained model with.
2. The results are very very impressive: Smaller models struggle exactly with longer contexts in comparison to larger models so a 1B model extrapolating to 1M context is VERY impressive.
3. BUT. Unpopular opinion: My own gut feeling is that transformers should extrapolate out of the box. There is something we don't understand about the influence of casual masks on transformers in my opinion. (I might be wrong)
4. Another unpopular opinion: We would never be able to fully remove the everything-to-everything multiplication of transformers. You can never know "in advance" what to "remember" (compress) when generating because your number of parameters is limited while the context is not. (But we might be able to approximate it to such an extent that practically the result would be the same)
5. This might be "it". TransformerXL always felt different but it was never widely used. This might be "the missing piece" for making it's ideas reach their full potential.
Really cool paper! I hope someone implements it soon since I got no time to try it myself at the moment..
🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!
Or send them the below subscription link:
🖼️ AI Artwork Of The Day
They Won! - U/armand_roulinn
That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply