I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.
I'm always wondering at the safety measures on these things. How much force is in those motors?
This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.
So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?
Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.
There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.
The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?
The article mentions that the system in each robot uses two ai models.
S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data
and the other S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.
It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.What part of this system understands 3 dimensional space of that kitchen?
How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?
How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?
Figure robots, each equipped with dual low-power-consumption embedded GPUs
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)
This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?
I now they claim there’s no special coding but did they practice this task? Special training?
Even if this video is totally legit I’m but burned out by all the hype videos in general.
Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/
Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/
YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws
Wonder what their vision stack is like. Depth via sensors or purely visual and the distance estimating of objects and inverse kinematics/proprioception, anyway it looks impressive.
Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.
Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?
This whole thread is just people who didn’t read the technical details or immediately doubt the video’s honesty.
I’m actually fairly impressed with this because it’s one neural net which is the goal, and the two system paradigm is really cool. I don’t know much about robotics but this seems like the right direction.
Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.
Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.
/rant
When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?
What is the interface from the top level to the motors?
I feel it can not just be a neural network all the way down, right?
"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?
At this point, this is enough autonomy to have a set of these guys man a howitzer (read as old stockpiles of weapons we already have). Kind of a scary thought. On one hand, I think the idea of moving real people out of danger in war is a good idea, and as an American i'd want Americans to have an edge... and we can't guarantee our enemies won't take it if we skip it, on the other hand I have a visceral reaction to machines killing people.
I think we're at an inflection point now where AI and robotics can be used in warfare, and we need to start having that conversation.
"Pick up anything: Figure robots equipped with Helix can now pick up virtually any small household object, including thousands of items they have never encountered before, simply by following natural language prompts."
If they can do that, why aren't they selling picking systems to Amazon by the tens of thousands?
Why do they make “eye contact” after every hand off? Feels oddly forced.
I get the impression there’s a language model sending high level commands to a control model? I wonder when we can have one multimodal model that controls everything.
The latest models seemed to be fluidly tied in with generating voice; even singing and laughing.
It seems like it would be possible to train a multimodal that can do that with low level actuator commands.
Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?
It’s funny… there a lot of comments here asking “why would anyone pay for this, when you could learn to do the thing, or organise your time/plans yourself.”
That’s how I feel about LLMs and code.
Anyone have a link to their paper?
It's kinda eerie how they look at each other after handover
I don't suppose this is open research and I can read about their model architecture?
They should have made them talk. It’s a little dehumanizing otherwise.
Very impressive
Why make such sinister-looking robots though...?
Wake me when robots can make a peanut butter sandwich
Wow! This is something new.
To focus on something other than the obviously-terrifying nature of this and the skepticism that rightfully entails on our part:
A fast reactive visuomotor policy that translates the latent semantic representations produced by S2 into precise continuous robot actions at 200 Hz
Why 200Hz...? Any experts in here on robotics? Because to this layman that seems really often to update motor controls.There's no way this is 100% real though. No startup demo ever is.
Is there a paper? I think I get how they did their training, but I'd like to understand it more.
Does anyone know if this trained model would work on a different robot at all, or would it need retraining?
Is this even reality or CGI? They really should show these things off in less sterile environemtns because this video has a very CGI feel to it.
It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?