HNPWA with Next.js

VGGT: Visual Geometry Grounded Transformer

xnx | 190 points

I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.

I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.

Or in the words of the authors, from the paper:

> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.

Fantastic to have this. But it feels.. yes, somewhat bitter.

[The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)

[data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."

[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

w-m | 3 months ago

More info and demos:

https://vgg-t.github.io/

Workaccount2 | 3 months ago

I really wish someone would take this and combine it with true photogrammetry to supplement the photogrammetry rather than just try to replace traditional photogrammetry.

This type of thing would be the killer app for phone based 3d scanners. You don't have to have a perfect scan because this will fill in the holes for you.

sgnelson | 3 months ago

I'd love to hear what the use cases are for this. I was looking at Planet's website yesterday and although the technology is fascinating, I do sometimes struggle to understand what people actually do (commercially or otherwise) with the data? (Genuinely not snark, this stuff's just not my field!)

davedx | 3 months ago

I'm a little suspicious of many of the outdoor examples given though. They are of famous places that are likely in the training set:

- Egyptian pyramids

- Roman Colosseum

These are the most iconic and most photographed things in the world.

That said, there are other examples are there more novel. I am just going to focus on those to judge its quality.

bhouston | 3 months ago

It is cool to see recent research doing this to reconstruct scenes from fewer images, essentially using a transformer to guess what the scene structure is. Previously, you needed a ton of images and had to use COLMAP. All the fancy papers like NERF and Gaussian Splatting used COLMAP in the backend, and while it does a great job in terms of accuracy, it is slow and requires a lot of images with known calibration.

porphyra | 3 months ago

I'd really like to see this coupled with some SLAM techniques to essentially allow really accurate, long-range outdoor scene mapping with nothing but a cell phone.

A small panning video of city street, right now, can generate a pretty damn accurate (for some use cases) pointcloud, but the position accuracy falls off as you try to go any large distance away from the start point, dude to the dead-reckoning drift that essentially happens here. But if you could pipe real GPS and synthesized heading (from gyros/accel/megnetometers) from the phone the images were captured on into the transformer with the images, it would instantly and greatly improve the resultant accuracy since it would now have those camera parameters 'ground truth'd'.

I think then this technique could nearly start to rival what you need a $3-10k LIDAR camera to do right now. There are a lot of 'archival' and architecture study fields where absolute precision isn't as important as just getting 'full' scans of an area without missing patches, and speed is a factor. Walking around with a LIDAR camera can really suck compared to just a phone, and this technique would have no problem with multiple people using multiple phones to generate the input.

mk_stjames | 3 months ago

Interesting idea, I applaud it.

However I just tried it on Huggingface and the result was ... mediocre at best:

The resulting point cloud missed about half the features from the input image.

jdthedisciple | 3 months ago

Looking at the output, which is impressive, I want to see this pipeline applied to splats. Dense point clouds lose a bunch of color and directional information needed for high quality splats, but it seems easy to imagine this method would work well for splats. I wonder if the architecture could be fine tuned for this or if you’d need to retrain an entire model.

vessenes | 3 months ago

I feel agi will be a patchwork of models melded together. Something like this would constitute a single model in the "perception" area.

ninetyninenine | 3 months ago

[deleted]

| 3 months ago

Can it be used to build Google Earth like 3D scenes ?

maelito | 3 months ago

We need camera poses in dynamic scenes

richard___ | 3 months ago

video or it didn't happen.

fallingmeat | 3 months ago

Please stop using keywords from electrical engineering.

amelius | 3 months ago