The Hawk/Griffin Paper

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Title

Who

Researchers from Google DeepMind, led by Soham De and Samuel L. Smith, explore advancements in Recurrent Neural Networks (RNNs) for Language Modeling.

Why

The research aims to address the limitations of Transformer architectures in handling long sequences efficiently due to their quadratic complexity. The goal is to demonstrate that RNNs can achieve comparable or even superior performance while maintaining efficient inference and training.

How

  • Experiment Design: The researchers developed two RNN models: Hawk, a pure RNN with gated linear recurrences, and Griffin, a hybrid model combining gated recurrences with local attention.

  • Key Variables & Models: The study focuses on the Real-Gated Linear Recurrent Unit (RG-LRU) layer and its impact on model performance and efficiency.

  • Datasets: The models were trained on the MassiveText dataset and evaluated on various downstream tasks and synthetic copy/retrieval tasks.

  • Techniques & Innovations: The research introduces the RG-LRU layer, a novel gated linear recurrent layer, and explores its integration with local attention for efficient long sequence modeling. It also develops a custom Pallas kernel for efficient implementation on TPUs.

What did they find

  • Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, similar to Transformers, with Griffin achieving lower loss at all model scales.

Figure 1 | a) Hawk, Griffin and our MQA Transformer baseline all show power law scaling between held-out loss and training FLOPs, with Griffin achieving the lowest held-out loss at all FLOPs budgets.

  • Hawk-3B outperforms Mamba-3B on downstream tasks despite using half the training tokens.

  • Griffin-7B and Griffin-14B match the performance of Llama-2 while being trained on significantly fewer tokens (around 7 times less).

  • Both models exhibit lower latency and higher throughput during inference compared to MQA Transformers, especially on long sequences.

  • Griffin shows better extrapolation capabilities than Transformers on sequences longer than the training set and effectively learns copying and retrieval tasks.

This figure illustrates the models' ability to utilize longer contexts and extrapolate on long sequences compared to Transformer baselines, emphasizing the effectiveness of RNNs for long-range dependencies. Figure 5 | Performance of various 1B parameter models on a held-out evaluation set of books.

What are the limitations and what's next

  • Limitations: Hawk and Griffin perform less well than Transformers on copying and retrieval tasks without fine-tuning, suggesting further research is needed in this area.

  • Future Work: Exploring the use of complex-valued recurrences, investigating different gating mechanisms, and further optimizing the models for different hardware and tasks are potential avenues for future research.

Why it matters

This research provides strong evidence that RNNs, particularly those with gated linear recurrences and local attention, can be a viable and efficient alternative to Transformers for language modeling. Hawk and Griffin demonstrate impressive performance, efficiency, and extrapolation capabilities, offering a promising direction for future LLM development.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.