- Emergent Behavior
- Posts
- Uncanny Valley Recession
Uncanny Valley Recession
AI Waifus and Husbandos are on the way
đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Here’s today at a glance:
đź“· The Pictures Learn To Talk
The AI team at Alibaba came up with a milestone innovation: generating realistic talking people from just a single portrait image and audio. Some samples:
And the full paper via video:
Let’s break down what they’ve done:
You send the model
a single portrait image of a face
an audio file of a single person talking, speaking, or singing
The model generates a video of
the face on the image talking, speaking, or singing
near perfect lipsync in multiple languages
true emotion, eyes, eyebrows, jaw, cheeks, and facial angles
including natural face movement, hair follow-through
It’s an amazing, state-of-the-art piece of technology that will change the world the moment it’s fully released. The implications for YouTube, TikTok, Instagram, customer service videos, and the like are vast. Unreal Metahuman just got smoked. It is also likely to compete with other older (lol like 1 year at most) AI avatar players like Hey Gen and Synthesia.
How was it built?
Really an amalgamation of other techniques with some innovative thinking and hacks sprinkled in (like all AI?):
used an image diffusion model where the next frame is generated based on the last frame, and the audio signal
but there can be uncertainty between the audio and the image: many different facial features can represent one sound
this creates facial distortions and instability
fix this with new hyperparameters
a) a head speed controller
b) facial region controller
ensure character consistency by using their earlier AnimateAnyone work ReferenceNet (a spatial attention block) to create a frame-to-frame consistency attention block
Older methods used things like blendshapes, dividing the face into hundreds of little polygons, and tracking them (this is what the iPhone’s ARKit does). But these methods limited how expressive the face could be: your face is more than just a few hundred crude polygons.
Why it’s cool
Besides the actual visual aspects:
Emotion seems real
The emotion was captured from the audio
This implies that the audio of someone speaking has enough information
to infer how the muscles in their facial features must be moving
to produce both the sound and the feeling
heard in the audio
That’s literally mind-blowing. Also, it was a relatively small training run:
250 hours of video
150 million images
This is hardly all the YouTube clips in the world. There is scope to scale this up 1000x.
Notably
They probably won’t release this to the public, same as their previous work, AnimateAnyone, which allowed you to make any human image dance or walk.
AnimateAnyone was not pushed to production, so it is either a safety issue or the drawbacks, cost, latency, etc have not been figured out
Open source will get there eventually, so within 12–18 months, this should be widely available.
AI waifus/husbandos here we come.
🌠Enjoying this edition of Emergent Behavior? Send this web link with a friend to help spread the word of technological progress and positive AI to the world!
Or send them the below subscription link:
🗞️ Things Happen
Ideogram, the image generation startup founded by Google departees, launches its 1.0 model with “state-of-the-art text rendering, unprecedented photorealism, and prompt adherence“. It’s OK?
Google Deepmind published Concordia, “a library for building agents that leverage language models to simulate human behavior with a high degree of detail and realism. The agents can reason, plan, and communicate in natural language, interacting with each other in grounded physical, social, or digital environments.“ It’s basically a Role Playing Game with an AI Game Master.
Sarah Guo at Conviction VC has a call for AI startups, where they outline the sectors they’d like to see ideas in. It’s interesting, namely because one wonders at the market size for all of them over a 10-year period.
🖼️ AI Artwork Of The Day
All work and no play makes Jack a dull boy - u/AdolfGomez from r/MidJourney
That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply