Elon Wuz Right

TikTok team proves Elon correct - images alone would be sufficient

Prakash Ate-A-Pi
January 25, 2024

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

Take a single photo, and get an understanding of the 3D positioning of the objects in the photo… better than LiDAR. That’s the promise of this work from TikTok, which was done during Lihe Yang’s PhD internship (!) at the company. In a tribute to Meta, titled Depth Anything.

Results on various datasets which the model was not trained for: for parking, home automation, gaming, driving, furniture placement, architecture

Elon of course spotted this way early:

@WholeMarsBlog Sensors are a bitstream and cameras have several orders of magnitude more bits/sec than radar (or lidar).
Radar must meaningfully increase signal/noise of bitstream to be worth complexity of integrating it.
As vision processing gets better, it just leaves radar far behind.
— Elon Musk (@elonmusk)
8:23 AM • Apr 10, 2021

While no one believed Elon when he said Tesla didn’t need LiDAR or radar, and images alone would be sufficient, it seems like the TikTok team has proven him correct.

Features of this work:

Goal was to build a foundation model for depth estimation from a single image
Did not use the classical method of getting accurate ground truth measured depth maps to train the model on
Instead obtained a large (62 mil) image unlabelled dataset, which would form the basis of the “student” model
Then built an annotation model to label this dataset
Annotation model was built from a labeled 1.5 mil image dataset, the “teacher” model
This worked because of scale! They had many failures along the way

The exciting part of all of this is that it looks like vision alone is enough for a lot of tasks in the physical world.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.