Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

GaggiX | 270 points

This guy is a genius; for those who don’t know he also brought us ControlNet.

This is the first decent video generation model that runs on consumer hardware. Big deal and I expect ControlNet pose support soon too.

Jaxkr | 20 days ago

Funny how it really wants people to dance. Even the guy sitting down for an interview just starts dancing sitting down.

IshKebab | 20 days ago

Wow, the examples are fairly impressive and the resources used to create them are practically trivial. Seems like inference can be run on previous generation consumer hardware. I'd like to see throughput stats for inference on a 5090 too at some point.

ZeroCool2u | 20 days ago

Could you do this spatially as well? E.g. generate the image top-down instead of all at once

WithinReason | 20 days ago

Could this be used for video interpolation instead of extrapolation?

modeless | 20 days ago

Amazing. If you have more RAM or something, can it go faster? Can you get even more speed on an H100 or H200?

ilaksh | 20 days ago

looks like the only motion it can do...is to dance

fregocap | 20 days ago