- Emergent Behavior
- Posts
- China Pulls Ahead in Image Processing
China Pulls Ahead in Image Processing
Language models are too politically dangerous
đź”· Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.
Here’s today at a glance:
🆔 No Examples Necessary
Left to Right, Top to Bottom, Yann LeCun in The Flinstones, Mortal Kombat, Mario, Peanuts, Junji Ito Manga, Family Guy, Simpsons, One Piece - X:@cocktailpeanut
InstantId is a landmark project, which is able to generate stylized images from a single photo of the a Subject. The project comes from a Chinese team, variously from Peking University and tech firms Xiaohongshu and Kuaishou amongst others.
We are getting ever closer to the holy grail of social image creation: photos of you hanging with your friends at the pyramids in Egypt…without ever having to go there. This would remove the creation bottleneck to social media photography. InstantID is a dramatic improvement over the older methods like Low-Rank Adaption, a fine-tuning method that it itself is only 2 years old. Even LoRA required dozens of images in its latest iteration, while InstantID drops the number of required images to just one. Meanwhile the currently in vogue IP Adapter method only produces weakly similar images by identifying features from the original image to preserve at the inference stage.
Notable features of the InstantID model:
Strongly preserves identity during image generation
Using a specialized ID embedding for semantic face information
This is a specialized face encoder with strong semantic (ie face characteristics) and weak spatial (loosen restrictions on where and how the face is portrayed)
Notably, they use the ID embeddings to describe the face, instead of relying on the text description from the user
the face embedding model guides the diffusion image generation
it works as a plugin to existing Stable Diffusion pipelines
no fine-tuning necessary, but outperforms techniques such as LoRA which fine-tune with multiple images
Comparison of InstantID with other methods conditioned on different characters and styles.
It is curious to see China, which has been generally disapproving of social technology, to outperform in image generation merely because Large Language Models may be harmful to one’s political health.
Besides that, all I have to say is that they created an embeddings model that knows human faces very well. And this is the worst it’s ever going to be.
đź‘Ś Elon Wuz Right
Take a single photo, and get an understanding of the 3D positioning of the objects in the photo… better than LiDAR. That’s the promise of this work from TikTok, which was done during Lihe Yang’s PhD internship (!) at the company, In a tribute to Meta, titled Depth Anything.
Results on various datasets which the model was not trained for: for parking, home automation, gaming, driving, furniture placement, architecture
Elon of course spotted this way early:
@WholeMarsBlog Sensors are a bitstream and cameras have several orders of magnitude more bits/sec than radar (or lidar).
Radar must meaningfully increase signal/noise of bitstream to be worth complexity of integrating it.
As vision processing gets better, it just leaves radar far behind.
— Elon Musk (@elonmusk)
8:23 AM • Apr 10, 2021
While no one believed Elon when he said Tesla didn’t need LiDAR or radar, and images alone would be sufficient, it seems like the TikTok team has proven him correct.
Features of this work:
Goal was to build a foundation model for depth estimation from a single image
Did not use the classical method of getting accurate ground truth measured depth maps to train the model on
Instead obtained a large (62 mil) image unlabelled dataset, which would form the basis of the “student” model
Then built an annotation model to label this dataset
Annotation model was built from a labeled 1.5 mil image dataset, the “teacher” model
This worked because of scale! They had many failures along the way
The exciting part of all of this is that it looks like vision alone is enough for a lot of tasks in the physical world.
🗞️ Things Happen
Human players get better at Go after AlphaGo. Are players actually getting better… or have they picked up some bad habits from the chess players?
The first AI ultrasound was approved by the FDA.. in a limited capacity for post-liver cancer treatment imaging, but still a milestone. One must remember that they are just using 5-year-old image recognition tech.. wait till they get something more current in there.
Physical processes like nucleation maybe performing calculations similar to neural networks. Gives hope to those of us that believe that the universe itself is fundamentally intelligent. Onwards!
🖼️ AI Artwork of the Day
Eminem vs M&M from the series Rappers Challenge Their Namesakes To A Rap Battle - u/ShaneKaiGlenn, r/midjourney
That’s it for today! Become a subscriber for daily breakdowns of what’s happening in the AI world:
Reply