- Emergent Behavior
- Posts
- Stealing Models
Stealing Models
on intelligence theft
đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Here’s today at a glance:
🕵️ Stealing Models
Nicholas Carlini’s team at Google Deepmind successfully “steals“ part of the production model from several closed Large Language Models including OpenAI’s GPT3.5 and 4.
How did they do it?
OpenAI and other models have a parameter called logit bias that you can pass in.
Example 1: Remove 'time'
If we call the Completions endpoint with the prompt “Once upon a,” the completion is very likely going to start with “ time.”
The word “time” tokenizes to the ID 2435 and the word “ time” (which has a space at the start) tokenizes to the ID 640. We can pass these through logit_bias with -100 to ban them from appearing in the completion, like so:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "You finish user's sentences."},
"role": "user", "content": "Once upon a"} ]
logit_bias={2435:-100, 640:-100}
)
Now, the prompt “Once upon a” generates the completion “midnight dreary, while I pondered, weak and weary.”
Notice that the word “time” is nowhere to be found, because we’ve effectively banned that token using logit_bias
.
The Google attack relies on querying the LLM and extracting the probabilities of each potential output, which most LLMs provide as logprobs, in order to allow users to consider potential alternative outputs and how likely they were to be generated.
Knowing this result allowed the attacker to know the details of the hidden layer prior to the output, thus revealing the number of dimensions of that layer.
They then expanded the attack:
able to extract the full layer
if logprobs were not disclosed, using the temperature=0 setting to get a reading for an alternative logprobs scenario, such that it was sufficient to extract the layer anyway
less than 5000 queries were used by any one attack, so the attack would cost less than 20 dollars to execute.
It is an amazing, interesting, novel and previously unknown security flaw. I’m sure there will be many more before we’re done.
While it seems trivial to "protect" against this particular attack, information as such tends to be quite leaky in a lot of respects.
I expect the future to get very interesting/confusing with a wide range of information-theoretical attacks/defenses both at data and model level.
— Christian Szegedy (@ChrSzegedy)
11:10 AM • Mar 12, 2024
🌠Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!
Or send them the below subscription link:
🗞️ Things Happen
MidJourney releases character consistency. Only somewhat consistent so far, but it will get better
Midjourney just released the "Character Reference" feature.
How it works
- Type --cref URL after your prompt with a URL to an image of a character
- You can use --cw to modify reference 'strength' from 100 to 0
- strength 100 (--cw 100) is default and uses the face, hair, and… twitter.com/i/web/status/1…— Tatiana Tsiguleva (@ciguleva)
10:50 PM • Mar 11, 2024
🖼️ AI Artwork Of The Day
Tattoo regret - u/InkSlinger1983 from r/midjourney
That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply