Stealing Models

on intelligence theft

Prakash Ate-A-Pi
March 13, 2024

Here’s today at a glance:

Stealing Models
Things happen
AI artwork of the day

🕵️ Stealing Models

Nicholas Carlini’s team at Google Deepmind successfully “steals“ part of the production model from several closed Large Language Models including OpenAI’s GPT3.5 and 4.

How did they do it?

OpenAI and other models have a parameter called logit bias that you can pass in.

Example 1: Remove 'time'

If we call the Completions endpoint with the prompt “Once upon a,” the completion is very likely going to start with “ time.”

The word “time” tokenizes to the ID 2435 and the word “ time” (which has a space at the start) tokenizes to the ID 640. We can pass these through logit_bias with -100 to ban them from appearing in the completion, like so:

OpenAI

completion = client.chat.completions.create( 
  model="gpt-3.5-turbo", 
  messages=[{"role": "system", "content": "You finish user's sentences."},
             "role": "user", "content": "Once upon a"} ] 
  logit_bias={2435:-100, 640:-100}
)

Now, the prompt “Once upon a” generates the completion “midnight dreary, while I pondered, weak and weary.”

Notice that the word “time” is nowhere to be found, because we’ve effectively banned that token using logit_bias.

OpenAI

The Google attack relies on querying the LLM and extracting the probabilities of each potential output, which most LLMs provide as logprobs, in order to allow users to consider potential alternative outputs and how likely they were to be generated.

Knowing this result allowed the attacker to know the details of the hidden layer prior to the output, thus revealing the number of dimensions of that layer.

They then expanded the attack:

able to extract the full layer
if logprobs were not disclosed, using the temperature=0 setting to get a reading for an alternative logprobs scenario, such that it was sufficient to extract the layer anyway
less than 5000 queries were used by any one attack, so the attack would cost less than 20 dollars to execute.

It is an amazing, interesting, novel and previously unknown security flaw. I’m sure there will be many more before we’re done.

While it seems trivial to "protect" against this particular attack, information as such tends to be quite leaky in a lot of respects.
I expect the future to get very interesting/confusing with a wide range of information-theoretical attacks/defenses both at data and model level.
— Christian Szegedy (@ChrSzegedy)
11:10 AM • Mar 12, 2024

Share this story

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

MidJourney releases character consistency. Only somewhat consistent so far, but it will get better

Midjourney just released the "Character Reference" feature.
How it works
- Type --cref URL after your prompt with a URL to an image of a character
- You can use --cw to modify reference 'strength' from 100 to 0
- strength 100 (--cw 100) is default and uses the face, hair, and… twitter.com/i/web/status/1…
— Tatiana Tsiguleva (@ciguleva)
10:50 PM • Mar 11, 2024

🖼️ AI Artwork Of The Day

Tattoo regret - u/InkSlinger1983 from r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.