2024-04-12: Matrix Minds: RNNs Level Up

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

💊 Matrix Minds: RNNs Level Up

The Transformer architecture, the bedrock of AI since 2017, is starting to show its limitations. While undeniably powerful, its reliance on self-attention mechanisms leads to a critical bottleneck: quadratic scaling of computational resources as the input sequence length increases. This means that processing longer pieces of text, code, or any sequential data requires much more compute power, energy, and time.

How to calculate Big O notation time complexity

Enter Recurrent Neural Networks (RNNs), an older architecture that had fallen out of favor due to challenges in training and handling long sequences. RNNs are in competition with state space models like Mamba, Flash Attention, and Infini-Attention for the post-Transformer crown. Two notable advancements in RNN architectures are RWKV’s Eagle/Finch and Google’s Hawk/Griffin. Both approaches tackle the efficiency and performance issues that previously hindered RNNs:

  • Tackling the Efficiency Bottleneck: RNNs used to have a sequential bottleneck, ie you needed the results of computation from one token to begin computation of the next. Both new architectures employ clever techniques to parallelize computations and optimize memory usage.

    • Eagle/Finch utilizes custom CUDA implementations and explores parallelization methods like associative scan.

    • Hawk/Griffin leverages model parallelism and custom kernels for efficient training on TPUs. These optimizations significantly improve training and inference speeds, making RNNs competitive with Transformers.

  • Conquering Long Sequences: The Achilles' heel of traditional RNNs is the vanishing gradient problem. Succinctly, RNNs lose visibility of the first word as they get deep into a sequence of text. The new architectures address this through certain innovations.

    • Eagle/Finch uses a limited time-step window for their recurrence mechanism, enabling handling of unbounded sequence lengths.

    • Hawk/Griffin employ fixed-size hidden states and local attention, avoiding the quadratic memory growth of Transformers.

  • Additionally, both architectures demonstrate a remarkable ability to extrapolate to longer sequences than they were trained on.

These advancements have propelled RNNs back into the spotlight, demonstrating their potential for various applications:

  • Long Document Processing: Analyzing lengthy texts, legal documents, or codebases becomes feasible with RNNs' efficient handling of long sequences.

  • Real-Time Applications: RNNs' lower latency and higher throughput during inference make them suitable for real-time tasks like chatbots and live translation.

  • Resource-Constrained Environments: RNNs' lower computational requirements make them attractive for deployment on edge devices or in situations with limited resources.

I’ve been working with Recursal AI, the corporate entity founded to champion the RWKV architecture, and helping to shepherd the team through the early stages of product market fit. These posts are my effort to get up to speed on what they’ve been up to.

🗒️ The Eagle/Finch Paper

Title

Who

A collaborative effort between researchers from the RWKV Project (under Linux Foundation AI & Data), EleutherAI, and various universities, including Ohio State University, the University of California Santa Barbara, and the University of British Columbia.

Why

The research aims to address the limitations of traditional transformer architectures, specifically their quadratic time complexity with respect to input sequence length. The goal is to improve the efficiency and performance of LLMs, making them more suitable for tasks involving long sequences.

How

  • Experiment Design: The researchers developed two new RNN architectures: Eagle (RWKV-5) and Finch (RWKV-6).

  • Key Variables & Models: The study focuses on the impact of multi-headed matrix-valued states and a dynamic recurrence mechanism on LLM performance.

  • Datasets: A new multilingual corpus with 1.12 trillion tokens, "RWKV World v2", was created to train the models.

  • Techniques & Innovations: The research introduces data-dependent functions for time-mixing and token-shift modules and utilizes the Low-Rank Adaptation (LoRA) function for context-dependent weight adjustments.

What did they find

  • Eagle and Finch models achieve competitive performance on various benchmarks, including multilingual and English-focused language tasks, associative recall, music modeling, and vision-language tasks.

  • Finch demonstrates exceptional accuracy in multi-query associative recall (MQAR), an indicator of the effectiveness of in-context learning and hence critical for new architecture development, surpassing other non-transformer architectures.

  • Both architectures show an improved loss on long sequence tasks compared to RWKV-4.

Eagle and Finch extrapolating to 100k tokens for free! Figure 5: Loss along sequence offset for 3B RWKV-4 World, Eagle and Finch on PG19 dataset. All models were pretrained with context length 4096.

What are the limitations and what's next

  • Limitations: The models struggle with embedding tasks and exhibit some ChatGPT-like behavior due to the training data containing synthetic data from GPT-3.5 and ChatGPT.

  • Future Work: Expanding the training corpus size and diversity, training larger versions of Finch, and exploring Mixture of Experts (MoE) for further efficiency gains are planned.

Why it matters

This research demonstrates the potential of RNNs as a competitive alternative to transformers in LLMs, offering comparable performance while maintaining efficient inference and training. The development of Eagle and Finch architectures, along with the release of pre-trained models and an open-source training pipeline, contributes to the advancement of more efficient and accessible AI models.

Additional Notes

  • The models and training code are available on Hugging Face and GitHub under the Apache 2.0 license.

Disclosure: Entities affiliated with Ate-A-Pi have commercial advisory relationships entities affiliated with researchers mentioned above

🗒️ The Hawk/Griffin Paper

Title

Who

Researchers from Google DeepMind, led by Soham De and Samuel L. Smith, explore advancements in Recurrent Neural Networks (RNNs) for Language Modeling.

Why

The research aims to address the limitations of Transformer architectures in handling long sequences efficiently due to their quadratic complexity. The goal is to demonstrate that RNNs can achieve comparable or even superior performance while maintaining efficient inference and training.

How

  • Experiment Design: The researchers developed two RNN models: Hawk, a pure RNN with gated linear recurrences, and Griffin, a hybrid model combining gated recurrences with local attention.

  • Key Variables & Models: The study focuses on the Real-Gated Linear Recurrent Unit (RG-LRU) layer and its impact on model performance and efficiency.

  • Datasets: The models were trained on the MassiveText dataset and evaluated on various downstream tasks and synthetic copy/retrieval tasks.

  • Techniques & Innovations: The research introduces the RG-LRU layer, a novel gated linear recurrent layer, and explores its integration with local attention for efficient long sequence modeling. It also develops a custom Pallas kernel for efficient implementation on TPUs.

What did they find

  • Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, similar to Transformers, with Griffin achieving lower loss at all model scales.

Figure 1 | a) Hawk, Griffin and our MQA Transformer baseline all show power law scaling between held-out loss and training FLOPs, with Griffin achieving the lowest held-out loss at all FLOPs budgets.

  • Hawk-3B outperforms Mamba-3B on downstream tasks despite using half the training tokens.

  • Griffin-7B and Griffin-14B match the performance of Llama-2 while being trained on significantly fewer tokens (around 7 times less).

  • Both models exhibit lower latency and higher throughput during inference compared to MQA Transformers, especially on long sequences.

  • Griffin shows better extrapolation capabilities than Transformers on sequences longer than the training set and effectively learns copying and retrieval tasks.

This figure illustrates the models' ability to utilize longer contexts and extrapolate on long sequences compared to Transformer baselines, emphasizing the effectiveness of RNNs for long-range dependencies. Figure 5 | Performance of various 1B parameter models on a held-out evaluation set of books.

What are the limitations and what's next

  • Limitations: Hawk and Griffin perform less well than Transformers on copying and retrieval tasks without fine-tuning, suggesting further research is needed in this area.

  • Future Work: Exploring the use of complex-valued recurrences, investigating different gating mechanisms, and further optimizing the models for different hardware and tasks are potential avenues for future research.

Why it matters

This research provides strong evidence that RNNs, particularly those with gated linear recurrences and local attention, can be a viable and efficient alternative to Transformers for language modeling. Hawk and Griffin demonstrate impressive performance, efficiency, and extrapolation capabilities, offering a promising direction for future LLM development.

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🖼️ AI Artwork Of The Day

Batman Akira Torayama - u/CALEBr16 from r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.