VideoMamba

on post transformer architectures

Prakash Ate-A-Pi
March 18, 2024

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

VideoMamba
Things happen
AI artwork of the day

🎥 VideoMamba

The Shanghai AI Lab has a paper on VideoMamba. Recall that Mamba is the leading state space model architecture, the best explanation of which is:

Transformers are the workhorse of modern sequence modeling, achieving remarkable performance on a variety of tasks, but they have unavoidable inefficiencies. Specifically, the memory and compute used for generating every output token grows linearly with the input length. This means that generating n tokens requires O(n^2) compute, making training with long sequence lengths practically impossible.

Recently, State Space Models (SSMs) have emerged as a challenger to the Transformer architecture. These models can be interpreted as a type of recurrent neural networks (RNNs), which use a fixed-size memory that does not grow with the sequence length. This makes training and inference on long sequences much more efficient, opening up the possibility of feeding extremely long inputs, such as entire libraries, audio samples or DNA sequences, directly into the model.

Repeat After Me: Transformers are Better than State Space Models at Copying

Transformers:

Takes a sequence of tokens as input and maps each token to a vector representation with some hidden dimension d
then alternates between token-level operations (represented with an MLP) and token-mixing operations (the attention layers).
for an input of length n, the output of each block is of size d x n
If we generate text auto-regressively token-by-token, then the size of the memory for storing the activations grows linearly with the number of generated tokens.

State space models:

“Compress” their inputs into a fixed-size latent state, instead of performing operations over all previously observed tokens
This latent state is passed from one iteration to the next, but importantly, it does not grow in size when generating longer sequences.
much more efficient when processing long inputs.

Now obviously, the advantage of lower memory usage is going to be particularly useful when it comes to videos. Video understanding is already a very useful skill, given the moderation needs for TikTok and Instagram.

In summary, VideoMamba:

is scalable in the visual domain
is sensitive to short-term action recognition for example, opening and closing a door
is state-of-the-art, long-term video understanding is 6× faster and demands 40× less GPU memory for 64-frame videos than the leading TimeSformer

The VideoMamba team also states, “Due to resource constraints, we have not yet fully validated the scalability of VideoMamba.“ One wonders if the US chip embargo is starting to bite in China, especially given their strong focus on image and video AI over language models (too political!).

Share this story

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

NVIDIA GTC kicks off in San Jose today. Jensen is going to appear on stage will all the Transformer paper authors.

Can't wait to see Jensen's GTC Keynote tomorrow, the biggest NVIDIA festival of the year! Make sure you stay till the end! Our newly founded GEAR Lab has something special to share ;)
You can watch live online or attend in person at the SAP Center:
nvidia.com/gtc/keynote
— Jim Fan (@DrJimFan)
2:49 PM • Mar 17, 2024

🖼️ AI Artwork Of The Day

Music artists as professional athletes - u/FrankieGS from r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.