Stealing Models

on intelligence theft

đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

🕵️ Stealing Models

Nicholas Carlini’s team at Google Deepmind successfully “steals“ part of the production model from several closed Large Language Models including OpenAI’s GPT3.5 and 4.

How did they do it?

OpenAI and other models have a parameter called logit bias that you can pass in.

Example 1: Remove 'time'

If we call the Completions endpoint with the prompt “Once upon a,” the completion is very likely going to start with “ time.”

The word “time” tokenizes to the ID 2435 and the word “ time” (which has a space at the start) tokenizes to the ID 640. We can pass these through logit_bias with -100 to ban them from appearing in the completion, like so:

completion = client.chat.completions.create( 
  model="gpt-3.5-turbo", 
  messages=[{"role": "system", "content": "You finish user's sentences."},
             "role": "user", "content": "Once upon a"} ] 
  logit_bias={2435:-100, 640:-100}
)

Now, the prompt “Once upon a” generates the completion “midnight dreary, while I pondered, weak and weary.”

Notice that the word “time” is nowhere to be found, because we’ve effectively banned that token using logit_bias.

The Google attack relies on querying the LLM and extracting the probabilities of each potential output, which most LLMs provide as logprobs, in order to allow users to consider potential alternative outputs and how likely they were to be generated.

Knowing this result allowed the attacker to know the details of the hidden layer prior to the output, thus revealing the number of dimensions of that layer.

They then expanded the attack:

  • able to extract the full layer

  • if logprobs were not disclosed, using the temperature=0 setting to get a reading for an alternative logprobs scenario, such that it was sufficient to extract the layer anyway

  • less than 5000 queries were used by any one attack, so the attack would cost less than 20 dollars to execute.

It is an amazing, interesting, novel and previously unknown security flaw. I’m sure there will be many more before we’re done.

🌠 Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!

Or send them the below subscription link:

🗞️ Things Happen

  • MidJourney releases character consistency. Only somewhat consistent so far, but it will get better

🖼️ AI Artwork Of The Day

Tattoo regret - u/InkSlinger1983 from r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.