Emergent Behavior
Posts
Google Wherefore Art Thou

Google Wherefore Art Thou

The first GPT-4 beating model

Prakash Ate-A-Pi
February 21, 2024

Here’s today at a glance:

Google Gemini Pro 1.5 Roundup
Things happen
AI artwork of the day

🧬 Google Gemini Pro 1.5 Roundup

Just to be clear, Gemini Pro 1.5 (hereafter referred to as “Gemini”) is not Bard and is also not Gemini Ultra, the subscription ChatGPT competitor launched last week. Instead, Gemini is currently a developer preview API of a very competent multi-modal language model.

TLDR

Gemini can take in lots of data—text, audio, and video—into its context window
Can retrieve single facts from the data near flawlessly
Can retrieve and make sense of multiple facts from data quite well
Given its strength in long context, you can give it lots of examples on how to do something, and its corresponding performance increases
It’s better than GPT-4 for many business use cases
It’s slow (up to 90 seconds for inference)
Most of its strengths and weaknesses stem from the long context

Let’s examine Gemini in detail below:

Long Context - Needle in a Haystack

Gemini has near-perfect recall in needle-in-a-haystack tests of up to 10 million tokens.

What does this mean?

You can drop an easter egg, like “Captain Picard loves Klingon marmalade,” into text the length of 120 paperback novels, upload to Gemini, and it will be able to find the egg and respond to queries like “What kind of marmalade does Captain Picard like?”

Now this is not actually a terribly useful skill, given you could use a simple keyword search more efficiently, but it has become a marketing metric for language models, and so we must take into account Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.“

Again, NIAH is a naive evaluation. Recall drops to ~60% when there are 100 "needles".
This is all before long-form question answering that might involve logic and reasoning.
— Han (@HanchungLee)
8:17 AM • Feb 17, 2024

Reasoning Across Long Context

Gemini pulled together a multiple-page reasoned proposal from a 350-page rulebook. Now this looks impressive. Starting to replace the typical analyst information processing role.

What does this mean?

Once upon a time (last 6 months), we built complex methods of loading small 1-page chunks of text into AI context windows and then asked them specific questions. So many developers worked on Retrieval Augmented Generation, optimizing to find and populate exactly the right page(s) of text into that window.

And now, an increase in long context intelligence, and all of that work... is unnecessary.

This performance is equivalent to roughly one standard deviation above human median intelligence, so about 110 IQ (Ate’s non-scientific dead reckoning).

MBAs, lawyers, doctors, accountants, and generally professionals at junior levels are at risk. Anyone who writes simple reports.

On the bright side, if you are someone who likes consuming reports or has too much to read and wants someone else to prepare a Spark’s Notes for anything you have to read, you’re in luck.

MultiModal

Together with its long context, Gemini can accept audio of up to 22 hours and video of up to three hours, and again, has a near-perfect recall. Now the amazing thing is that this actually seems to work the way it should in practice!

In roughly 90 seconds:

Shit, Google wasn't kidding.
Gemini 1.5 Pro just went straight from a full movie to a summary in seconds.
No transcription, no intermediate steps. Just visual tokens -> summary.
Next up, validating the haystack tests.
— Matt Shumer (@mattshumer_)
12:30 AM • Feb 20, 2024

What does this mean?

You can upload a picture, video, audio, or any number of books and just ask questions, and even if it’s not that smart, Gemini will at least know what you’re talking about.

Long Codebases

Combining the long context with intelligence and Gemini makes an excellent code navigator:

What does this mean?

You will just load the documentation into the navigator at some point, and use that to tackle the code. Instead of spending hours in endless search loops. Someone actually did a proper test of this, and Gemini is very good, superior to GPT-4.

Testing Results

I've put a complex codebase into a single 120K-token prompt, and asked 7 questions GPT-4 and Gemini 1.5. Here are the resultsDiscussion

I'm the author of HVM1, which is currently being updated to HVM2. These are 2 complex codebases that implement a parallel inet runtime; basically, hard compiler stuff. User @SullyOmarr on X, who gained Gemini 1.5 access, kindly offered me a prompt. So, I've concatenated both HVM codebases into a single 120K-token file, and asked 7 questions to both Gemini and GPT-4. Here are the complete results.

Breakdown:

1. Which was based in a term-like calculus, and which was based on raw interaction combinators?

This is a basic information, repeated in many places, so it shouldn't be hard. Indeed, both got it right. Tie.

2. How did the syntax of each work? Provide examples.

Gemini got HVM1's syntax perfectly right. It is a familiar, Haskell-like syntax, so, no big deal; but Gemini also understood the logic behind HVM2's raw-inet IR syntax, which is mind-blowing, since it is alien and unlike anything it could've seen during training. The inet sample provided was wrong, though, but that wasn't explicitly demanded (and would be quite AGI level, tbh). GPT-4 got both syntaxes completely wrong and just hallucinated, even though it does well on smaller prompts. I guess the long context overwhelmed it. Regardless, astromonic win for Gemini.

3. How would λf. λx. (f x) be stored in memory, on each? Write an example in hex, with 1 64-bit word per line. Explain what each line does.

Gemini wrote a reasonable HVM1 memdump, which is insane: this means it found the memory-layout tutorial in the comments, learned it, and applied to a brand new case. The memdump provided IS partially wrong, but, well, it IS partially right! Sadly, Gemini couldn't understand HVM2's memory layout, which would be huge, as there is no tutorial in comments, so that'd require understanding the code. Not there yet. As for GPT-4, it just avoided both questions, and then proceeded to lie about the information not being present (it is). Huge win for Gemini.

4. Which part of the code was responsible for beta-reduction, on both? Cite it.

Gemini nailed the location for HVM1, but hallucinated uglily for HVM2, disappointingly. GPT-4 Turbo avoided answering for HVM1, but provided a surprisingly well-reasoned guess for HVM2. Tie.

5. HVM1 had a garbage collect bug, that isn't present in HVM2. Can you reason about it, and explain why?

Gemini provided a decent response, which means it found, read and understood the comment describing the issue (on HVM1). It didn't provide a deeper reasoning for why it is fixed on HVM2, but that isn't written anywhere and would require deep insight about the system. GPT-4 just bullshitted. Win for Gemini.

6. HVM1 had a concurrecy bug, that has been solved on HVM2. How?

Gemini nailed what HVM1's bug was, and how HVM2 solved it. This answer is not written in a single specific location, but can be found in separate places, which means Gemini was capable of connecting information spread far apart in the context. GPT-4 missed the notes completely, and just bullshited. Win for Gemini.

7. There are many functions on HVM1 that don't have correspondents on HVM2. Name some, and explain why it has been removed.

Gemini answered the question properly, identifying 2 functions that were removed, and providing a good explanation. GPT-4 seems like it was just bullshitting nonsense and got one thing or another right by accident. Also, this was meant to be an easy question (just find a Rust function on HVM1 but not on HVM2), but Gemini answered a "harder interpretation" of the question, and identified an HVM1 primitive that isn't present on HVM2. Clever. Win for Gemini.

Verdict

In the task of understanding HVM's 120K-token codebase, Gemini 1.5 absolutely destroyed GPT-4-Turbo-128K. Most of the questions that GPT-4 got wrong are ones it would get right in smaller prompts, so, the giant context clearly overwhelmed it, while Gemini 1.5 didn't care at all. I'm impressed. I was the first one to complain about how underwhelming Gemini Ultra was, so, credit where credit is due, Gemini 1.5 is really promising. That said, Gemini still can't create a complete mental model of the system, and answer questions that would require its own deeper reasoning, so, no AGI for now; but it is extremely good at locating existing information, making long-range connections and doing some limited reasoning on top of it. This was a quite rushed test too (it is 1am...) so I hope I can make a better one and try it again when I get access to it (Google execs: hint hint)

u/SrPeixinho from Reddit

The above seems to be a pretty big vote of confidence. Gemini isn’t quite there yet, but it seems to have superseded GPT-4 in coding, which is a pretty dramatic development.

Many Shot Performance

Gemini’s performance improves with more examples. Now GPT-4 does too, but one of the issues with GPT-4 is the model’s tendency to get confused by long context. In practice, both for this reason and for cost, most GPT-4 API users have tended to keep context short to obtain higher reliability.

Gemini seems to not get confused in long context, meaning that you can pile on examples and get improvement in answers:

I've been testing Gemini 1.5 Pro's reasoning capabilities this morning.
It's near-GPT-4-level with regular prompts.
BUT Gemini's performance improves as I add dozens of examples. There doesn't seem to be an upper limit.
Many-example prompting is the new fine-tuning.
— Matt Shumer (@mattshumer_)
4:40 PM • Feb 20, 2024

Their metrics, bear this out, note how they prefer to compare the 10-shot or multiple-shot (each shot being an example in the context window).

What does this mean?

Your instructions to Gemini can contain numerous examples, which Gemini will use to learn, in context, how to respond correctly.

Summary

Gemini is finally an OpenAI beating model. We still don’t have a sense of pricing… a 90-second inference time indicates it’s going to be expensive. Gemini will remain a proof of concept for a while. But Google is certainly back.

Share this story

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

Khan Academy’s Khanmigo GPT-4-based tutor fails in simple arithmetic tests in a Wall Street Journal test. This is disappointing. As launch partners, they’ve had access to a fine-tuneable GPT-4 instance for more than a year. They also spent an enormous amount of time on UI, constraining inputs and outputs, etc.

🖼️ AI Artwork Of The Day

Preparing for World War(Hammer) III - u/pikeymikey22 in r/MidJourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.

Google Wherefore Art Thou

The first GPT-4 beating model

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

🧬 Google Gemini Pro 1.5 Roundup

TLDR

Long Context - Needle in a Haystack

What does this mean?

Reasoning Across Long Context

What does this mean?

MultiModal

What does this mean?

Long Codebases

What does this mean?

Testing Results

I've put a complex codebase into a single 120K-token prompt, and asked 7 questions GPT-4 and Gemini 1.5. Here are the resultsDiscussion

Many Shot Performance

What does this mean?

Summary

🗞️ Things Happen

🖼️ AI Artwork Of The Day

Reply