No Examples Necessary

Getting close to the holy grail of social image creation

đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Left to Right, Top to Bottom, Yann LeCun in The Flinstones, Mortal Kombat, Mario, Peanuts, Junji Ito Manga, Family Guy, Simpsons, One Piece - X:@cocktailpeanut

InstantId is a landmark project, which is able to generate stylized images from a single photo of a Subject. The project comes from a Chinese team, variously from Peking University and tech firms Xiaohongshu and Kuaishou amongst others.

We are getting ever closer to the holy grail of social image creation: photos of you hanging with your friends at the pyramids in Egypt…without ever having to go there. This would remove the creation bottleneck to social media photography. InstantID is a dramatic improvement over the older methods like Low-Rank Adaption, a fine-tuning method that it itself is only 2 years old. Even LoRA required dozens of images in its latest iteration, while InstantID drops the number of required images to just one. Meanwhile the currently in vogue IP Adapter method only produces weakly similar images by identifying features from the original image to preserve at the inference stage..

Notable features of the InstantID model:

  • Strongly preserves identity during image generation

  • Using a specialized ID embedding for semantic face information

  • This is a specialized face encoder with strong semantic (ie face characteristics) and weak spatial (loosen restrictions on where and how the face is portrayed)

  • Notably, they use the ID embeddings to describe the face, instead of relying on the text description from the user

  • the face embedding model guides the diffusion image generation

  • it works as a plugin to existing Stable Diffusion pipelines

  • no fine-tuning necessary, but outperforms techniques such as LoRA which fine-tune with multiple images

Comparison of InstantID with other methods conditioned on different characters and styles.

It is curious to see China, which has been generally disapproving of social technology, to outperform in image generation merely because Large Language Models may be harmful to one’s political health.

Besides that, all I have to say is that they created an embeddings model that knows human faces very well. And this is the worst it’s ever going to be.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.