HNPWA with Next.js

Helix: A vision-language-action model for generalist humanoid control

Philpax | 303 points

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

porphyra | 5 months ago

I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.

yurimo | 5 months ago

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

causal | 5 months ago

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

Symmetry | 5 months ago

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

wwwtyro | 5 months ago

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

ziofill | 5 months ago

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

plipt | 5 months ago

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

verytrivial | 5 months ago

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?

I now they claim there’s no special coding but did they practice this task? Special training?

Even if this video is totally legit I’m but burned out by all the hype videos in general.

bilsbie | 5 months ago

[deleted]

| 5 months ago

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

aerodog | 5 months ago

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

pr337h4m | 5 months ago

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws

sandis | 5 months ago

Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.

ge96 | 5 months ago

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

sottol | 5 months ago

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

kla-s | 5 months ago

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

bhouston | 5 months ago

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.

Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.

/rant

andiareso | 5 months ago

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

traverseda | 5 months ago

At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.

I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.

swalsh | 5 months ago

Why do they make “eye contact” after every hand off? Feels oddly forced.

ramenlover | 5 months ago

To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:

  A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz

Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.

bbor | 5 months ago

"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."

If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?

Animats | 5 months ago

I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.

The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.

It seems like it would be possible to train a multimodal that can do that with low level actuator commands.

bilsbie | 5 months ago

This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.

I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.

ripped_britches | 5 months ago

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

ianamo | 5 months ago

Is there a paper? I think I get how they did their training, but I'd like to understand it more.

Does anyone know if this trained model would work on a different robot at all, or would it need retraining?

exe34 | 5 months ago

It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”

That’s how I feel about LLMs and code.

the_other | 5 months ago

[deleted]

| 5 months ago

Anyone have a link to their paper?

kingkulk | 5 months ago

I don't suppose this is open research and I can read about their model architecture?

IAmNotACellist | 5 months ago

There's no way this is 100% real though. No startup demo ever is.

ein0p | 5 months ago

They should have made them talk. It’s a little dehumanizing otherwise.

bilsbie | 5 months ago

It's kinda eerie how they look at each other after handover

butifnot0701 | 5 months ago

Very impressive

Why make such sinister-looking robots though...?

anentropic | 5 months ago

Wow! This is something new.

kubb | 5 months ago

Wake me when robots can make a peanut butter sandwich

dr_dshiv | 5 months ago

Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.

abraxas | 5 months ago