China Pulls Ahead in Image Processing

Language models are too politically dangerous

đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Here’s today at a glance:

🆔 No Examples Necessary

Left to Right, Top to Bottom, Yann LeCun in The Flinstones, Mortal Kombat, Mario, Peanuts, Junji Ito Manga, Family Guy, Simpsons, One Piece - X:@cocktailpeanut

InstantId is a landmark project, which is able to generate stylized images from a single photo of the a Subject. The project comes from a Chinese team, variously from Peking University and tech firms Xiaohongshu and Kuaishou amongst others.

We are getting ever closer to the holy grail of social image creation: photos of you hanging with your friends at the pyramids in Egypt…without ever having to go there. This would remove the creation bottleneck to social media photography. InstantID is a dramatic improvement over the older methods like Low-Rank Adaption, a fine-tuning method that it itself is only 2 years old. Even LoRA required dozens of images in its latest iteration, while InstantID drops the number of required images to just one. Meanwhile the currently in vogue IP Adapter method only produces weakly similar images by identifying features from the original image to preserve at the inference stage.

Notable features of the InstantID model:

  • Strongly preserves identity during image generation

  • Using a specialized ID embedding for semantic face information

  • This is a specialized face encoder with strong semantic (ie face characteristics) and weak spatial (loosen restrictions on where and how the face is portrayed)

  • Notably, they use the ID embeddings to describe the face, instead of relying on the text description from the user

  • the face embedding model guides the diffusion image generation

  • it works as a plugin to existing Stable Diffusion pipelines

  • no fine-tuning necessary, but outperforms techniques such as LoRA which fine-tune with multiple images

Comparison of InstantID with other methods conditioned on different characters and styles.

It is curious to see China, which has been generally disapproving of social technology, to outperform in image generation merely because Large Language Models may be harmful to one’s political health.

Besides that, all I have to say is that they created an embeddings model that knows human faces very well. And this is the worst it’s ever going to be.

đź‘Ś Elon Wuz Right

Take a single photo, and get an understanding of the 3D positioning of the objects in the photo… better than LiDAR. That’s the promise of this work from TikTok, which was done during Lihe Yang’s PhD internship (!) at the company, In a tribute to Meta, titled Depth Anything.

Results on various datasets which the model was not trained for: for parking, home automation, gaming, driving, furniture placement, architecture

Elon of course spotted this way early:

While no one believed Elon when he said Tesla didn’t need LiDAR or radar, and images alone would be sufficient, it seems like the TikTok team has proven him correct.

Features of this work:

  • Goal was to build a foundation model for depth estimation from a single image

  • Did not use the classical method of getting accurate ground truth measured depth maps to train the model on

  • Instead obtained a large (62 mil) image unlabelled dataset, which would form the basis of the “student” model

  • Then built an annotation model to label this dataset

  • Annotation model was built from a labeled 1.5 mil image dataset, the “teacher” model

  • This worked because of scale! They had many failures along the way

The exciting part of all of this is that it looks like vision alone is enough for a lot of tasks in the physical world.

🗞️ Things Happen

  • Human players get better at Go after AlphaGo. Are players actually getting better… or have they picked up some bad habits from the chess players?

  • The first AI ultrasound was approved by the FDA.. in a limited capacity for post-liver cancer treatment imaging, but still a milestone. One must remember that they are just using 5-year-old image recognition tech.. wait till they get something more current in there.

  • Physical processes like nucleation maybe performing calculations similar to neural networks. Gives hope to those of us that believe that the universe itself is fundamentally intelligent. Onwards!

🖼️ AI Artwork of the Day

Eminem vs M&M from the series Rappers Challenge Their Namesakes To A Rap Battle - u/ShaneKaiGlenn, r/midjourney

That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.