Show HN: Infinity – Realistic AI characters that can speak

lcolucci | 481 points

As soon as I saw the "Gnome" face option I gnew exactly what I gneeded to do: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

EDIT: looks like the model doesn't like Duke Nukem: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Cropping out his pistol only made it worse lol: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

A different image works a little bit better, though: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

yellowapple | 4 months ago
squarefoot | 4 months ago

Hi Lina, Andrew and Sidney, this is awesome.

My go-to for checking the edges of video and face identification LLMs are Personas right now -- they're rendered faces done in a painterly style, and can be really hard to parse.

Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Source image from: https://personacollective.ai/persona/1610

Overall, crazy impressive compared to competing offerings. I don't know if the mouth size problems are related to the race of the portrait, the style, the model, or the positioning of the head, but I'm looking forward to further iterations of the model. This is already good enough for a bunch of creative work, which is rad.

vessenes | 4 months ago

Damn - I took an (AI) image that I "created" a year ago that I liked and then you animated it AND let it sing Amazing Grace. Seeing IS believing this technology pretty much means video evidence ain't necessarily so.

PerilousD | 4 months ago

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

It’s astounding that 2 sentences generated this. (I used text-to-image and the prompt for a space marine in power armour produced something amazing with no extra tweaks required).

shitloadofbooks | 4 months ago

There is prior art here, e.g. Emo from alibaba research (https://humanaigc.github.io/emote-portrait-alive/), but this is impressive and also actually has a demo people can try, so that's awesome and great work!

advael | 4 months ago

I tried making this short clip [0] of Baron Vladimir Harkonnen announcing the beginning of the clone war, and it's almost fine, but the last frame somehow completely breaks.

[0]: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Andrew_nenakhov | 4 months ago

Tried to make this meme [1] a reality and the source image was tough for it.

Heads up, little bit of language in the audio.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

[1] https://i.redd.it/uisn2wx2ol0d1.jpeg

zach_miller | 4 months ago

Well, I don't know what to think about this, I don't know where we are going. I should read some scifi from back then about conversational agents maybe ?

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

johnchristopher | 4 months ago

Tried my hardest to push this into the uncanny valley. I did, but it was pretty hard. Seems robust.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

marginalia_nu | 4 months ago

> It often inserts hands into the frame.

Looks like too much Italian training data

ardrak | 4 months ago

Have to say, whilst this tech has some creepy aspects, just playing about with this my family have had a whole sequence of laughs out loud moments - thank you!

RobinL | 4 months ago

Is it similar to https://loopyavatar.github.io/. I was reading about this today and even the videos are exactly the same.

I am curious if you are anyway related to this team?

naveensky | 4 months ago

I am actively working in this area from a wrapper application perspective. In general, tools that generate video are not sufficient on their own. They are likely to be used as part of some larger video-production workflow.

One drawback of tools like runway (and midjourney) is the lack of an API allowing integration into products. I would love to re-sell your service to my clients as part of a larger offering. Is this something you plan to offer?

The examples are very promising by the way.

zoogeny | 4 months ago

For such models, is it possible to fine-tune models with multiple images of the main actor?

Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.

naveensky | 4 months ago

Breathtaking!

First, your (Lina's) intro is perfect in honestly and briefly explaining your work in progress.

Second, the example I tried had a perfect interpretation of the text meaning/sentiment and translated that to vocal and facial emphasis.

It's possible I hit on a pre-trained sentence. With the default manly-man I used the phrase, "Now is the time for all good men to come to the aid of their country."

Third, this is a fantastic niche opportunity - a billion+ memes a year - where each variant could require coming back to you.

Do you have plans to be able to start with an existing one and make variants of it? Is the model such that your service could store the model state for users to work from if they e.g., needed to localize the same phrase or render the same expressivity on different facial phenotypes?

I can also imagine your building different models for niches: faces speaking, faces aging (forward and back); outside of humans: cartoon transformers, cartoon pratfalls.

Finally, I can see both B2C and B2B, and growth/exit strategies for both.

w10-1 | 4 months ago

It's incredibly good - bravo. Only thing missing for this to be immediately useful for content creation, is more variety in voices, or ideally somehow specifying a template sound clip to imitate.

johnyzee | 4 months ago

oh this made my day: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

!NWSF --lyrics by Biggy$malls

artur_makly | 4 months ago

This is amazing and another moment where I question what the future of humans will look like. So much potential for good and evil! It's insane.

max4c | 4 months ago

Quite impressive - I tried to confuse it with things it would not generally see and it avoided all the obvious confabulations https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

svieira | 4 months ago

It's awesome for very short texts. Like a single long sentence. For even a bit longer sequences it seems to be losing adherence to the initial photo and also venture into uncanny valley with exaggerated facial expressions.

A product that might be build on top of this could split the input into reasonable chunks, generate video for each of them separately and stitch them with another model that can transition from one facial expression into another in a fraction of a second.

Additional improvement might be feeding the system not with one image but with a few expressing different emotional expressions. Then the system could analyze the split input to find out in which emotional state each part of the video should be started on.

On unrelated note ... generated expressions seem to be relevant to the content of the input text. So either text to speech might understand language a bit or the video model itself.

scotty79 | 4 months ago

Very cool, thanks for the play.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Managed to get it working with my doggo.

siffin | 4 months ago

Out of curiosity, where are you training all this ? aka where do you find the money to support such training

snickmy | 4 months ago

WOW this is very good!!

I have an immediate use case for this. Can you stream via AI to support real time chat this way?

Very very good!

Jonathan

founder@ixcoach.com

We deliver the most exceptional simulated life coaching, counseling and personal development experiences in the world through devotion to the belief that having all the support you need should be a right, not a privilege.

Test our capacity at ixcoach.com for free to see for yourself.

IXCoach | 4 months ago

you need a slider for how animated the facial expression are.

sharemywin | 4 months ago

i wonder how long would it take for this technology to advance to a point where nice people from /r/freefolk would be able to remake seasons 7 and 8 of Game of Thrones to have a nice proper ending? 5 years, 10?

Andrew_nenakhov | 4 months ago

The website is pretty lightweight and easy-to-use. The service also holds up pretty well, specially if the source image is high-enough resolution. The tendency to "break" at the last frame happens with low resolution images it seems.

My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

archon1410 | 4 months ago

Max headroom hack x hacker's manifesto! I'm impressed with the head movement dynamism on this one.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

parkaboy | 4 months ago

I need to create a bunch of 5-7 minute talking head videos. What's your timeline for capabilities that would help with this?

nickfromseattle | 4 months ago

Does anybody know about the legality of using Eminem's "Gozilla" as promotional material[1] for this service?

I thought you had to pay artists for a license before using their work in promotional material.

[1] https://infinity.ai/videos/setA_video3.mp4

WaffleIronMaker | 4 months ago

I look forward to movies that are dubbed moving the face+lips to the dubbed text. Also using the original actors voice.

sroussey | 4 months ago

I have uploaded an image and then used text to image, and both videos were not animated but the audio was included

ladidahh | 4 months ago

Putting Drake as a default avatar is just begging to be sued. Please remove pictures of actual people!

LarsDu88 | 4 months ago

The e2e diffusion transformer approach is super cool because it can do crazy emotions which make for great memes (like Joe Biden at Live Aid! https://youtu.be/Duw1COv9NGQ)

Edit: Duke Nukem flubs his line: https://youtu.be/mcLrA6bGOjY

zaptrem | 4 months ago
[deleted]
| 4 months ago

Oh, this is amazing! I've been having so much fun with it.

One small issue I've encountered is that sometimes images remain completely static. Seems to happen when the audio is short - 3 to 5 seconds long.

SlackingOff123 | 4 months ago

If you had a $500k training budget, why not buy 2 DGX machines?

doctorpangloss | 4 months ago

This is surprisingly very intelligent and awesome, any plan for research paper or full grown project with pricing or open source?

AnnaMere | 4 months ago

So good it feels like I think maybe I can read their lips

dhbradshaw | 4 months ago

It would be amazing to be able to drive this with an API.

ilaksh | 4 months ago

After much user feedback, we removed the Infinity watermark from the generated videos. Thanks for the feedback. Enjoy!

sidneyprimas | 4 months ago

Thank you for no signup, it's very impressive, especially the physics of the head movement relating to vocal intonation.

I feel like I accidentally made an advert for whitening toothpaste:

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

I am sure the service will get abused, but wish you lots of success.

whitehexagon | 4 months ago

Won't be long before it's real time. The first company to launch video calling with good AI avatars is going to take off.

modeless | 4 months ago

I'd love to enable Keltar, the green guy in the ceramic cup, to do this www.molecularReality/QuestionDesk

kemmishtree | 4 months ago

can this achieve real-time performance or how far are we from a real-time model?

billconan | 4 months ago

This is great. is it open source? is there an api and what is the pricing?

android521 | 4 months ago

It completely falls apart on longer videos for me, unusable over 10 seconds.

bufferoverflow | 4 months ago

Hi, there is a mistake in the headline, you wrote "realistic".

dvfjsdhgfv | 4 months ago

Rudimentary, but promising.

lofaszvanitt | 4 months ago
vadiml | 4 months ago

Sadly wouldnt animate an image of shodan from system shock 2

protocolture | 4 months ago

Is it fairly trained?

strogonoff | 4 months ago

Awesome, any plans for an API and, if so, how soon?

jadbox | 4 months ago

Is there any limitation on the video length?

naveensky | 4 months ago

Amazing work! This technology is only going to improve. Soon there will be an infinite library of rich and dynamic games, films, podcasts, etc. - a totally unique and fascinating experience tailored to you that's only a prompt away.

I've been working on something adjacent to this concept with Ragdoll (https://github.com/bennyschmidt/ragdoll-studio), but focused not just on creating characters but producing creative deliverables using them.

bschmidt1 | 4 months ago

super nice. why does it degrade quality of image so much, makes it looks obviously AI-generated rapidly.

fsndz | 4 months ago

Any details yet on pricing or too early?

DevX101 | 4 months ago

This is so impressive. Amazing job.

aagha | 4 months ago

Talking pictures. Talking heads!

barrenko | 4 months ago

Can I get a pricing quote?

siscia | 4 months ago

This is super funny.

atum47 | 4 months ago

accidentally clicked the generate button twice.

sharemywin | 4 months ago

what is the TTS model you are using

deisteve | 4 months ago

Nice

la64710 | 4 months ago

can we choose our own voices?

toisanji | 4 months ago

great job Andrew and Sidney!

slt2021 | 4 months ago

Dayum

bosky101 | 4 months ago

and mow a word from our..

Log_out_ | 4 months ago

quite slow btw

dorianmariefr | 4 months ago

The actor list you have is so... cringe. I don't know what it is about AI startups that they seem to be pulled towards this kind of low brow overly online set of personalities.

I get the benefit of using celebrities because it's possible to tell if you actually hit the mark, whereas if you pick some random person you can't know if it's correct or even stable. But jeez... Andrew Tate in the first row? And it doesn't get better as I scroll down...

I noticed lots of small clips so I tried a longer script, and it seems to reset the scene periodically (every 7ish seconds). It seems hard to do anything serious with only small clips...?

ianbicking | 4 months ago
[deleted]
| 4 months ago

Given that I don't agree with many of Yann LeCun's stances on AI, I enjoyed making this:

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Hello I'm an AI-generated version of Yann LeCoon. As an unbiased expert, I'm not worried about AI. ... If somehow an AI gets out of control ... it will be my good AI against your bad AI. ... After all, what does history show us about technology-fueled conflicts among petty, self-interested humans?

xpe | 4 months ago

Quick tangent: Does anybody know why many new companies have this exact web design style? Is it some new UI framework or other recent tool? The design looks sleek, but they all appear so similar.

aramndrt | 4 months ago

I tried with the drake and drake saying some stuff and while its cool, its still lacking, like his teeth are disappearing partially :S

cchance | 4 months ago

[flagged]

genaiguy | 4 months ago

[flagged]

mjlbach | 4 months ago

[flagged]

skjason | 4 months ago

[flagged]

howard86 | 4 months ago

[flagged]

dqrknight | 4 months ago

[flagged]

dqrknight | 4 months ago

[flagged]

Darulquran-123 | 4 months ago

Say I’m a politician who gets caught on camera doing or saying something shady. Will your service do anything to prevent me from claiming the incriminating video was just faked using your technology? Maybe logging perceptual hashes of every output could prove that a video didn’t come from you?

jl6 | 4 months ago