Emergent Behavior
Posts
The Hawk/Griffin Paper

The Hawk/Griffin Paper

Prakash Ate-A-Pi
April 11, 2024

Title

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Who

Researchers from Google DeepMind, led by Soham De and Samuel L. Smith, explore advancements in Recurrent Neural Networks (RNNs) for Language Modeling.

Why

The research aims to address the limitations of Transformer architectures in handling long sequences efficiently due to their quadratic complexity. The goal is to demonstrate that RNNs can achieve comparable or even superior performance while maintaining efficient inference and training.

How

Experiment Design: The researchers developed two RNN models: Hawk, a pure RNN with gated linear recurrences, and Griffin, a hybrid model combining gated recurrences with local attention.
Key Variables & Models: The study focuses on the Real-Gated Linear Recurrent Unit (RG-LRU) layer and its impact on model performance and efficiency.
Datasets: The models were trained on the MassiveText dataset and evaluated on various downstream tasks and synthetic copy/retrieval tasks.
Techniques & Innovations: The research introduces the RG-LRU layer, a novel gated linear recurrent layer, and explores its integration with local attention for efficient long sequence modeling. It also develops a custom Pallas kernel for efficient implementation on TPUs.

What did they find

Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, similar to Transformers, with Griffin achieving lower loss at all model scales.

Figure 1 | a) Hawk, Griffin and our MQA Transformer baseline all show power law scaling between held-out loss and training FLOPs, with Griffin achieving the lowest held-out loss at all FLOPs budgets.

Hawk-3B outperforms Mamba-3B on downstream tasks despite using half the training tokens.
Griffin-7B and Griffin-14B match the performance of Llama-2 while being trained on significantly fewer tokens (around 7 times less).
Both models exhibit lower latency and higher throughput during inference compared to MQA Transformers, especially on long sequences.
Griffin shows better extrapolation capabilities than Transformers on sequences longer than the training set and effectively learns copying and retrieval tasks.

This figure illustrates the models' ability to utilize longer contexts and extrapolate on long sequences compared to Transformer baselines, emphasizing the effectiveness of RNNs for long-range dependencies. Figure 5 | Performance of various 1B parameter models on a held-out evaluation set of books.

What are the limitations and what's next

Limitations: Hawk and Griffin perform less well than Transformers on copying and retrieval tasks without fine-tuning, suggesting further research is needed in this area.
Future Work: Exploring the use of complex-valued recurrences, investigating different gating mechanisms, and further optimizing the models for different hardware and tasks are potential avenues for future research.

Why it matters

This research provides strong evidence that RNNs, particularly those with gated linear recurrences and local attention, can be a viable and efficient alternative to Transformers for language modeling. Hawk and Griffin demonstrate impressive performance, efficiency, and extrapolation capabilities, offering a promising direction for future LLM development.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.