Yep, I was thinking exactly that as I was watching, more along the lines of "it's over. it's just a matter of time". I am actually not an AI doomer, but I couldn't really help it, perhaps all the dystopian Hollywood stuff has done its job. Anyway, if anything happens, at least it'll be more interesting/exciting than dying of old age, or some cumulative effect of random age related diseases.
Why? Hundreds of thousands of people are working on it since decades now. Most are smart. If you focus so much brain power and now money at one topic, you will get results.
"And, before you ask me to rotate a pod, I am very much aware of your fragility as well as the mass, aerodynamical properties and hardness index of these plates in front of me."
@@harmless6813 Yes, of course, and if this was 10 years ago, or even five years ago, there would be every reason to assume this was BS, but the technologies shown here already exist; especially at Open AI which is at the absolute cutting edge, inconsistently ahead of everyone else in AI.
I think most people are under the assumption that software engineer jobs and other "information only" jobs are going to go away, but skilled trades jobs are safe. Well, its only partially right. Office type jobs are going to go first, but I think mostly everything else is not far behind. This is going to WRECK the job market - we REALLY need to get a handle on how to transition the economy, or we are screwed.
I honestly think they don’t care about who is going to be without jobs. The economy is going to change completely and people are trying to see the future with an outdated model. Billionaires are building bunkers, hotels in space and spaceships. I think they’re planing to scape and let us die. They will own everything and won’t need consumerism. But, I have a very pessimistic mind so I might be completely wrong.
Everyone wants a Jarvis AI like Tony Stark. How many humans are like Tony Stark? Humans with original thoughts like Steve Jobs & Sam Altman will be fine. Average humans won’t be.
They're not hobos, they're "illegals" ...or...if the leftists win this time, Stalin already had a model calling "undesirables" "criminals" ...heck, we've been doing that in the USSA since the 1914 Harrison Narcotics Act of 1914 was passed...
My understanding of the explanation of loading action weights onto the GPU is this (and it's quite interesting!): the main OpenAI model does not control the robot directly. Instead, there is an action controller network, and a separate variation of that controller is trained for every action the robot can perform. (So for instance, perhaps there's a version for "pick up object" actions, a version for "place object" actions, a version for "hand off object" actions, etc.) They're all re-trained variants of the same network which takes in images at 10fps and a task description, and outputs robot control "keyframes" at 200/sec. When the main OpenAI model decides it needs to perform an action, it hot-swaps out the current action model for a relevant one -- so if "pick up" is currently loaded and it wants to now place the object down, it'll unload the "pick up" model and load up the "place object" model instead. These controller keyframes get passed to the baseline stability and control model which determines how to move each motor/servo/etc. to achieve the desired keyframe, at 1,000 adjustments per second. In other words, the main LMM decides *what* to do, then loads up a "procedural memory", basically (i.e. a pretrained action model) that looks at the world and quickly decides the *actions* to achieve that task, which then get passed to the control model to decide *how* to perform those actions. I think the incredible speed and reaction time is a result of the fact that action models are separate from the main LMM, meaning they can be smaller (since they only have to operate with a more limited output space) and don't get interference from the main model while it's thinking (which is also why it can talk to you while it's doing things). A huge benefit to this approach is that it would be modular. You can just train up new variants of the action model for any action you want the robot to learn to perform, and just plug-and-play it into the bot's library of known skills. The LMM is already general enough to be able to pick the new model when appropriate, based probably on just a name and description of it in the library; and the models all output the same format of action keyframes, so they're all compatible with the base control model by default. Another cool thing about this approach is that it's very similar to how a human brain works. When we learn to perform actions, we do so by forming procedural memories, sometimes called muscle memories. These are stored directly in the motor cortex, the part of the brain that plans and controls movements. In fact, the premotor cortex plans the movements -- like these hot-swapped models -- and the rest of the motor cortex executes them. By storing them this way, we're able to -- once we've learned to do something -- perform the task without thinking much about it, leaving us open to think about other things. It's why you can walk and talk, or tie your shoes and wonder what's for dinner, or drive while jamming out to your favorite music, etc. So in a way, the action models are procedural memories in the premotor cortex, and the base control model is like a mix of the motor cortex executing those actions and the cerebellum keeping balance. Meanwhile, the main LMM can see stuff (like your visual cortex), process language (like Wernicke's Area of a brain), and generate new language (like Broca's Area). If they throw in a vector database to store episodic memories (i.e. memories of personal experiences) like a hippocampus, then it's crazy cool to think about how we're effectively building a brain with the *structure* of a human brain.
A year or two ago, I saw a video with Elon Musk and some of his senior people talking about a myriad corner cases, and trying to account for them all i.e. bicycle on the back of a car vs on the road. Is it the same problem and would the solution be similar? A few months ago Musk talked about replacing a lot of C++ code with neural nets. Your comment makes me wonder how close we really are to a general purpose humanoid robot.
One of the reasons we get along with dogs is because they have the ability to recognize what pointing means... So good point, I bet early on that will be key for interaction with AI droids.
I'm a futurist but even this gave me a slight uneasy feeling. It literally sounds like a human, waking up into a new world and discovering his/her surroundings.
Because AI helped them build newer AI faster than they ever could have by themselves. What you’re noticing is what is happening when we say that we are seeing progress trend exponentially now.
About the gpu. I wrote a paper for my cybersecurity class. The idea was to limit the possibility of a probabilistic perturbance in the tech that could cause it to gain self-awareness using fine tuning or RAG etc. The plausibility of a power intensive consumer grade gpu's to compute an entire neural network weights for AGI present thermodynamic and computational challenges especially because we havent figured out how to only compute the necessary weights and have to compute ALL the weights each time unless it is running on a cloud somewhere, (which is a great technicality because it prevents superintelligence from being out in the wild because it would take too many resources to compute a neural net that large) the idea was to only compute portions of the neural net as a forcing function of the needed compute. Tying weights needed for an output and only passing the weights of those needed portions, lower the chance of rogue AGI. I assume that they have some type of backdoor function for that.
@@Hi98765 it’s not as sketchy as it sounds. I honestly think this is a very thoughtful solution. I would prefer it that my hardware doesn’t have the ability to become alive.
the way you word this makes it sound as if training large models will only ever take on the order of days. that's just obviously faulty logic when it comes to the rate of advancement in ai, and also when it comes to processor and memory speed advancements. something that would've taken days to render 10 or 20 years ago can be done basically realtime nowadays. that same advancement in hardware will affect ai training, especially because we are only really starting to see hardware that's custom designed spefically to train models. as that advances, and as transistors and cpus go truly 3d (just google 3d stacked moore's law to see what will be happening in the next 10-20 years).. when cpus have 5, 10 layers that's only a 5x performance gain. what happens when we figure out how to dissipate the heat from 100 layers?? 1000? a gpu that's 100-1000x as performant as the ones we currently have could compute a future language model in likely seconds or minutes. imagine having robots running around everywhere with gpus in them. the risk definitely is there that agi could easily get loose.
@@charliekelland7564 Nah that very inefficient because each computer would have to communicate. It would be much easier to have a single onboard computer with access to a cloud network. Though there may be computers in certain high friction places, knees wrists joints, so they can get better data for the main computer but they won’t be anywhere near a 4090 equivalent. Even so, I believe that would be a temporary measure until the data becomes more available over time.
GPU. Ok I didn't think I would have to write it, but I saw so many weird and/or incomplete answers. To answer what "The same model is responsible for deciding which learned, closed-loop behavior to run on the robot to fulfill a given command, loading particular neural network weights onto the GPU and executing a policy." means, I'll try my best (as a senior software developer). Your answer was pretty good IMO ("There's maybe a finite pre-trained number of actions that it has, like 'pick-up' or 'pour some liquid', or 'push a button', then the AI model selects which specific thing to run and runs it.") Just noting here a few vocabulary that can be interpreted differently depending on context. For the context here: - model: It's like an LLM, but the output is not text, it's tokens to describe movements. - learned, closed-loop behavior: Like you said, for example they *learned* "grab that bag", and it's closed-loop, because the destination is not enough. It needs to start the movement, get feedback on what it's doing, and adjusting to make sure it's heading to the right place. - weights: This is just describing how the models work. Like sliders of multiple things, you put more weight on something you want to do more. - GPU: This is just what they're using to process. It doesn't really matter, they could have just said "computing" or whatever. It's faster to run a model on a GPU than on a CPU, for example. - Policy: "Neural Network Policies for fast, dexterous manipulation", which is just a basic movement. So if we put all of this together: "deciding which learned, closed-loop behavior to run" means that from the input text, they decide, for example getting the trash, to split it in multiple sequential behaviors, like "move bag#1 from here to there" (here to there is filled from the vision input they have), and that would be split down in smaller behaviors, like "move hand towards bag until it's at the position over the bag", "open hand", "move hand down", etc. Note that these are not text, I quoted it because it's our description of it, but it's just something they know, and the model chooses which ones it needs for a particular case. then "on the robot to fulfill a given command" is pretty self-explanatory IMO. then "loading particular neural network weights onto the GPU" just means that it will ... run it. Just like ChatGPT does. Or any model that you would run locally. With weights according to the command spoken above. then "and executing a policy." is what I described above: for the trash example, they begin with "move hand towards bag until it's at the position over the bag", which (again) is not text, but something they learned already. They just need to give as input WHERE that is. So something like "move to x,y,z" in their internal coordinates, relative to something (probably the camera, or something like that) Where x,y,z is computed from the visual they have of the bag. And this is where they put the weights, to move (for example), the arm, the wrist, the whatever motor needs to be moved. And then multiple times per second, they adjust to what they see, what they sense, to make sure that it's going where it should be going (in a "closed-loop"). And when the closed-loop is done, the hand is "moved at x,y,z", then they do the next thing in the list that they made from the command (as described above). I don't know if it was clear...
"to fulfill a given command" ... That command is likely "parameterized". And, those "parameters" are mapped to the elements and entities perceived by AI models (e.g. GPT4 Vision). Even there are a finite number of these Commands and Pre Trained NNs, the combination of these Commands can attain very rich and versatile results.
I don't know why, but I'm really impressed with how well it transfers an item from one hand to the other, and will even do that to move an item from one side of its body to the other side.
Yeah robotics made massive strives back in the 2010's and we've basically almost perfected it. Though their robot is still a bit behind the other companies that have been around longer. For example their robot still walks a bit slow and can't run yet
@@kimchiman1000that's the sus part, the ai obviously has to take a second to recognize the visual data, how did it do that? In fact, it looks like guy just caught the apple, and the ai just randomly dropped it
@@therainman7777Why though? Just accept to find a personal purpose in life. There will be no jobs. There will be no scarcity. There will be fun. Okay, sure, or we all die, but that's fine, everyone dies so there's no FOMO. Our thoughts will forever life inside our robot overlords.
You read that section correctly, Wes. The large multimodal modal is like a central “brain” that processes the input scene (verbal instructions from the human, its video feed, etc.), and uses the generalized. intelligence of that model to determine what type of action(s) need to be performed. Once that is determined, a pre-trained set of neural network weights are loaded, presumably from some sort of action library that they have trained in advance, and computation through this neural network is performed on the GPU (where neural network computation is usually performed, due to massive parallelization which produces huge efficiency gains). For example, they may have trains a neural network specifically on the action “cut apple.” If the human asks it to cut up an apple, the large model determines that that’s the action that needs to be performed, loads the weights for that action onto the GPU, and then the computation needed for that action are carried out, the output of which are commands that get sent to the robot’s actuators.
The vocal segregate of "uh", "and brief stutter it uses for talking is an interesting choice to humanize interaction. I wonder how much of that is trained/learned, or artificially programmed?
Interesting question, I think that the answer lies in whether or not the voice model was based on a real person's voice, like Siri, or if it was completely made from scratch.
@@sunsu1049May depend on how it was trained. If the benchmark was that the voice should sound "human" it seems reasonable to assume that an error would be lowered by inserting these "uh..."s and "ah..."s
Lol Holy shit. I’m most impressed with how fluidly the servos moved the damn robots arms around. It looked so uncanny to me for moments it almost looked fake. Pretty mind blowing stuff. Super exciting.
"closed loop" meaning the individual movements and postures. you can kinda see it, like how it moves the hand in position, opens all fingers, closes all fingers slowly, but only uses the index and thumb to grip the apple, then the arm swing, the rotation of the wrist, and the drop. it looks fluid altogether but upon closer inspection is a series of individual movements. not all that different from how we move, but a bit over flourished in order to appear natural.
I love these new robots and all AI models. They're amazing! Thanks for sharing their amazing achievements and development. 😅 I'd be proud to have these as ancestors, that's for sure. ❤️❤️❤️
"Sir Isaac Newton once observed and Dr. Albert Einstein explained that the gravity of planet Earth is bending the space-time around it so that a massive object such as this apple, while maintaining uniform straight motion, still follows the bent geodesics and to observers such as me and you it looks as if the apple is falling down. So I decided to recontextualize this observation to invent a mode of transportation of apples from my hand to your hand, provided that my hand containing the apple is higher up in the gravity field of planet Earth than your hand, which is the case at the moment since I had strategically placed my hand with the apple above your hand, which was without an apple just a moment ago but holds the apple right now."
That’s how they win. Only half joking. AI LLMs are being trained to be pleasant to interact with. Add a body that can gesture in ways that manipulate our mammalian emotions, and we will be putty in their hands.
From ChatGPT4: ( GPU ) ~~~~~ In the context of the robotic AI system you described, "closed-loop behavior" refers to the robot's ability to execute tasks or actions based on feedback from its environment, continuously adjusting its actions based on this feedback to achieve a desired outcome. This process forms a loop: the robot performs an action, receives new data from its environment through sensors (like cameras and microphones), processes this data to evaluate the outcome of its action, and then decides on the next action based on this evaluation. The "closed-loop" aspect emphasizes the continuous, self-regulating nature of this process, where the output or response of the system directly influences its next action, creating a feedback loop. This is in contrast to "open-loop" systems, where actions are not adjusted based on feedback or outcomes but are pre-defined and follow a set sequence without real-time adaptation. Closed-loop behaviors in robotics allow for more adaptive, responsive, and intelligent interactions with dynamic environments. ~~~~~
Here in Canada, we had a recent test run of a UBI style system during COVID-19 with a cheque every two weeks for CAD$1000.00 for the duration of the pandemic.
No. 3 years on the good timeline. 4 years and a revolution on the bad tims. 5 years and bad stuff on the worst timeline I'll acknowledge. 7 years and then nobody is around anymore in the timeline we don't talk about 👍
It's interesting to see the speed and fluidity of the robot as that's not often seen at the moment. It also doesn't seem to have a jerking motion with movement which is good. I would say that response time with the reply needs to be made better as we live in a busy world and so people won't have the patience for a delay in response, plus it doesn't feel so natural. The difficulty I see there, though, is that it will need to know when it is ok to have its turn speaking as opposed to continuing to listen as the person may have other things to say. I'm sure such things will be improved, but the inflexions in the speech feel natural.
This delay issue has been tackled by individuals before, to partial succes. I am excited to see what actual big companies will do about it. And you've guessed it, the biggest issue is to know when human stops speaking. We do it by picking up subtle clues like intonation and sentence structure. Considering that it's implied Figire 01 uses speech to text, intonation is probably lost. From what I've seen, best solutions for now are to have a minimal delay + ability to abort speech if they are interrupted. Those have their own issues. For example, AI might stop talking mid-sentence if someone coughs or speaks to someone else in immediate vicinity
Always find your videos interesting. Sometimes I wonder if I’m the only one who is struggling to keep up with the speed of advancement. I could spend all day every day learning about this tech and I would still feel behind the curve. I work in tech, I’m 51, spent my life being ahead of the curve until now…. I guess my time being the expert in the room is sunsetting, I feel the next gen team will have to work harder and smarter than my gen did. Not sure how I feel about this, not sure how I feel about it actually matters anymore.
This is the most wonderfully self aware thing I've read online for a long while. Thank you. Also I agree, thankfully we now have AI and soon brain chips and robots to help us think and work harder and smarter.
@@hellblazerjj thanks, you put a smile on my face. What a world we live in, perhaps being a passenger this time around will be more fun than flying the plane.
I'm on the low end of the tech knowledge curve and watch these types of videos to help me keep up with the real world changes. It seems humans will soon be outpaced by AI machines and transhuman evololution will be the only way to keep up with our creations. Of course the wealthiest will will be the first to recieve such augmentation. God help us!
Explanation of the GPU Model Text at 12:30 - As far as I understand it... Assume pre-training of certain actions for the robot that would be useful. The compute model that processes the stored memory, speech and images also decides a response. The same model is used to pick (from interpretation of it's own response) which of it's learned actions to use, combined with the object info specific to the scene, to figure the movements, speed and other physical forces for proper interaction with the environment. Conceptually, as a bystander, we see no difference in this part of the functionality, both Figure01 and RT-2 can watch-to-learn skills. However, it's a matter of style; Applying closed-loop behavior weights to the GPU before the 'policy'(the robotic movement code) is like conforming the learned skill to the task at hand. This may do something like cause a pick-up or drop action to use more finesse for a delicate item like a tomato or for a deposit action to give the paper trash a small toss into the basket if it's not too far but also not within reach. Let me know if I'm wrong or missed something.
Wow 😮 finally getting there! Can’t wait in the next few months to see this massive improvements as we knew would happen since everything is duplicating at 10x faster than 2023! Next year agi will become self aware. Exciting and scary if it falls to the wrongs hands
This is so awesome. I've been hoping for robots that can help around the house for a long time. In my experience if you can't do everything yourself it can suck to have to ask for help from humans for a bunch of reasons, but a robot would be always there. And they don't just do things around the house, you can also have a conversation with them! I love it!
The thing about BD is for example Atlas has the craziest articulation, agility etc. What they can do is only possible because of its actuators etc. Image when they load one of these up on him! Or the dog.. I mean we are talking next lvl.
*Summary* The demonstration highlights several impressive capabilities of this robot: - It can perceive and describe its surroundings through vision and natural language understanding models. - It can engage in full conversations, comprehending context and prompts like "Can I have something to eat?" - It uses common sense reasoning to plan actions, like providing an apple since it's the only edible item available. - It has short-term memory to follow up on commands like "Can you put them there?" by recalling previous context. - The robot's movements are driven by neural network policies that map camera pixels directly to dexterous manipulations like grasping deformable objects. - A whole-body controller allows stable dynamics like maintaining balance during actions. The key innovation is combining OpenAI's advanced AI models for vision, language, and reasoning with Figure AI's expertise in robotics hardware and controllers. Figure AI is actively recruiting to further scale up this promising approach to general, multi-modal robotics through leveraging large AI models. Companies and researchers effectively combining the cutting-edge in large language models with advanced robotic hardware/controls are emerging as leaders in pushing embodied AI capabilities forward rapidly. There is a sense of optimism that general, multi-purpose robots displaying intelligent behavior are now within closer reach through neural network approaches rather than classic programming paradigms.
I think what they mean by all neural network is like how a hand isn’t the brain but it’s connected so a part of the neural network without the data or programing in it specifically. Like how the brain doesn’t hold the info but is the info in relation to the body and its differences.
To be a bit more accurate, I don't think it's quite "to see if the human is done speaking"; I think that's the delay from the time it takes to actually transcribe the speech into text first. There's a framework built on top of Whisper that allows you to constantly stream the transcription live (instead of waiting for the entire audio chunk first), and even that has a 3-4 second delay between an audio chunk being streamed and the transcription of it being done.
My guess is that they have something like an Jetson Orin onboard, which is able to dynamically select from what is effectively a finetuned RT1 for the behavior library. That GPU would be low-power and capable of running the smaller RT1 model at the rate they claimed (the RT2 model would be too large). RT1 showed that if the tasks are simple enough, you don't need a particularly large model (could even use a LoRA-type approach on the transformer stem). Alternatively, they could be a larger library running on an external server grade GPU (notice that the robot is still tethered, though the update rate would likely work with wireless too). The motion planning from the VLA model would then be similar to how the RT platforms do it, using a real-time kinematics control model. DeepMind already figured out a working solution, so why re-invent the wheel when you could simply stick GPT4V ontop of it.
Because we're more likely to buy one for our household if it acts familiar, like a human. I think it's about making the public accept them into society
I contend that the compute necessary to run an AGI robot cannot be produced within the small space that is available within a humanoid robot. This means that without an umbilical cord (which this one has) the controlling computer must communicate via a very fast bidirectional line, 5G comes to mind. The data rate for live video and more will be very high. The cost for robot plus data transmission plus compute in a supercomputer will perhaps be prohibitive.
That GPU thing you were looking at appears to indicate that the physical behaviour is pre-programmed/'learned' but that the ai model is choosing from a list of actions. ie. It has been trained to stack dishes, hand apples, and pick up rubbish. The ai on the inside though is deciding independantly whether to push the 'pixkup rubbish' button or the 'hand over apple' button depending on context. Not quite a truck rolling downhill but certainly less impressive than my initial reaction watching that robot video.
It still demonstrates that we can have NOW robots that can be trained to do specific tasks, and can replace humans in a lot of jobs if they're trained for that.
Probably not a long shot from "I think I did pretty well; the apple found its new owner; the trash was removed " to "I think I did pretty well; the Earth found its new owners; the humans were removed."
FINALLY! The real thing ready to fit in any smart rich person's house AND get the real work job done! No more investors dream-on. No more 'reasonably working' prototypes & lab models. This is it. The starting gun for a robot world of the tomorrow-upon-us-now.
The model that runs in the GPU selects the in/outputs for joints based on some vector representation that is mapped with the openai prompt return unpacking and subsequent policy activation that translates to dynamic parameter inputs based on a trained approximation model, which is Figure01 side basically.
Would be wise to make consumer models will have to be strength limited, possibly plastic parts that break before too much force is applied. But once they make themselves, it's over haha
1:09 “So I gave you the apple, because it’s the only, uh, edible item I could provide you with from the table.” It used “, uh,” in its speech, and contractions. Wow, even Lt Cmdr Data couldn’t do that! 2:08 “I.. I think I did pretty well” And it can stutter when flustered! And it has a vocal fry! Amazing!
Contractions aren't anything new for LLMs. And while not ubiquitous, speech disfluencies (um, uh, stuttering, etc.) are present in other previous TTS models as well. (Not so much in any ElevenLabs model as far as I've heard, though.) So while this is definitely a very high quality TTS model, those features that make it sound real aren't anything new.
It’s already here. It just has the intelligence of an 8 y.o boy and it’s only increasing. The only reason we keep “ re-inventing “ agi is because we keep moving the goal post every time we see it
@@Souleater7777If there is a machine with the intelligence of an 8 year old, i've not seen it. I doubt we have anything as smart as a rat, but i tend to think of intelligence as being the ability to learn and adapt. so far most of these machines have to be pre-trained, so they don't learn after. It's just a method of programming them using ML instead of hand crafted code. AGI is defined as having the ability to generalize to any task, so far everything we have is narrow. sure, chatGPT can generate any kind of text, but it's still just doing the one thing, generating text. I think if we do currently have proto AGI, it's one of the video game playing programs, but i wouldn't say they are as smart as a human child just yet. I do think , though, when we have something as smart as a dog, we'll have something truly useful in lots of ways, hence the generalization.
@@Souleater7777nothing yet has the ability to learn new tasks, that's what the General in AGI stands for. I would say that the smartest things we have right now aren't quite as smart as a rat in that regard. But when we have something as smart as a dog, then we'll have something that can be useful in lots of ways, and is generalizable. No need to move goalposts until we reach that first one. I suspect it will be one of the programs that plays video games, or something that comes from that. Especially if one of those agents is adapted to use robots.
That robot which cooks was not teleoperated, operator taught it and after that it generalized that knowledge. Quite practical scenario. But learning by videos even better.
@@13371138sure did but it feels silly to suggest, there are so many reasons why that would be dumb as f to cheat on..but yeah, i don't want to put on my foil hat but it's a skeptical aspect. I also didn't like the final reply where his voice went up as he said his goodbye, that's something humans do but i figure trained voices can't balance that waveform yet.
The only thing that makes me question whether or not this is real or not is how it just dropped the Apple into his hand as if he didn't expect it to hand it to him... Sure the voice could be synthetic but after that Google video where they got in trouble for faking stuff I have a hard time believing this.
…why would it be faked though? Figure01 may have been instructed explicitly to drop the apple into his hand rather than place it for the flair it adds to the video
Google is google, openai is a different company. If google thinks its language models are too dangerous for the public. Is faking half the stuff, you can be sceptical to them. Currently openai has the benefit of the doubt. Until they mess up.
What do you mean why would it be faked? What a ridiculous statement... For the same reason any company would fake something, fake it till you make it... get that funding & hype. Google did the same thing@@eIicit
GPU. It basically means it takes things in through both text and image. So it interprets based on both sight and concept using ai. It creates files to be stored locally on a memory in the robot that’s called a weight. So a weight is like a skill. It then loads the correct weight or skill to run to complete a policy or series of actions like the rotation and bending of arms to move in three dimensions. Those weights are loaded into the gpu and run like loading with tradition memory ram gpu set up. The ai regulates what files have become faulty and maintains its actions sort of like maintaining versions of software and runs the correct program for the correct action. For laymen’s it’s a computer in a robot layered with ai to do several different things to make it all work seamlessly. I’ve been waiting for them to do this. Been thinking of starting my own project.
It makes it sound more human. I like it. Same with the cute "the apple found its new owner" line. You want your robot overlords to have a sense of humour don't you?
When Google demoed their whatever-it-was-called that makes phone calls and reservations on your behalf, it did the same. Conversational voice synthesis AI is trained to imitate human imperfections to sound natural.
I’m just imagining all the hal 9000 scenes from the Space Odyssey done with the same voice: “I’m sorry, uh, Dave! I, uhm, am afraid I can’t do that!” He wouldn’t be such a charmer with these staters and uhm’s!
To the GPU sentence: I think "the same model" is like the minecraft voyager Paper architecture, that the model is dissecting its actions into a sequence of subtasks. Each simle task is writtten in code and if it works, then it saves it into the skill library and is just being reused. The Neural Network Policies are presumably a kind of Reinforcement Learning environment like in OPEN AI's gym. Have a look at the pyhon module stable-baselines-3 and you'll see what the Network policies are about. But I'm also just guessing by experience.
@@ossianravn I think he is guessing based on the abilities of the AI, especially since they are partnered with OpenAI it wouldn't be out of the realm of possibility that they gave early access to GPT-5
Seems at least partly scripted to me. When the guy says 'What do you see?' The robot knows to say what it says on the table only. Not in the entire room. Also the guy says "Pick up the trash' and the robot knows to pick up trash AND put it in the basket. Presentation seems a little off to me.
This is very fascinating! While its cool to see what it can do already, it would be even better to see what the robot cant do yet or which tasks it still fails at. Nontheless, thanks for sharing!
What we see is of course a demo. What I saw with these AI's that have a resemblance of memory, is that they go crazy. It depends on data that they were trained on, but they are almost guaranteed to do seizure-speak every once in a while. Imagine if Figure 01 prompts it's body to go into seizure. They seem to have filters in place, but those could fail
There is actually 2 cups on the scene, the robot saw it better than you 🤣 Now seriously, the cup inside is included in what the vision model is labeling, nobody told the robot "don't include in the description the objects inside drying rack".
In the thumbnails for suggestions shown after the video ends are a couple of _The Late Show_ with Stephen Colbert. Makes me think, next we'll be seeing a robot like this as a guest on a late-night talk show. Then, it will be the guest host of a late-night talk show. Oh, imagine it interviewing (and bantering with) Neil Degrasse Tyson?
gpu closed loop... does that mean that if Corey was to move his hand whilst the robot passed him the apple. The apple would fall to the table. As the planning/action/feedback loop is simple not fast enough to pick the correct action to place the apple in a suddenly moving outstretched palm ?
Sounds like the robot has a set of learned, visual/motor behaviors stored in the form of parameter weights. These only function when loaded into a neural network, which is to say the physical substrate of the GPU. Which of these behaviors gets loaded is decided in the same way a chatbot may decide to call on some tool, like a browser or calculator, in responding to the chatbot user. (My best guess)
"The apple found it's new owner."
That's adorable... We're all going to die, aren't we?
+ The way he walked away while it was still talking to him... We are fucking dead.
We probably will live better.
@@Chavagnatze yeah, seemed kinds of rude!! The robot will remember.
Yep, I was thinking exactly that as I was watching, more along the lines of "it's over. it's just a matter of time". I am actually not an AI doomer, but I couldn't really help it, perhaps all the dystopian Hollywood stuff has done its job. Anyway, if anything happens, at least it'll be more interesting/exciting than dying of old age, or some cumulative effect of random age related diseases.
nah, robot waifus will make it worth it
I am astonished at the breakthroughs being announced almost daily at this point.
In a year from now, it will just be a live stream where the updates come as continual Breaking News.
Why? Hundreds of thousands of people are working on it since decades now. Most are smart. If you focus so much brain power and now money at one topic, you will get results.
@@ich3601the significance and frequency of breakthroughs is increasing and most people dont keep up
Most of it is just clickbait tbh
@@NickDrinksWaterexactly 💯
Going from gpt 3 to this and sora in a year is nothing
"Great, can you explain why you did what you just did while you pick up this trash?"
"I'm sorry Dave, I'm afraid I can't do that"
*Picks up the human and drags him to the dumpster while saying, "because you commanded me to, you asshole"*
I was waiting for it to say that!
"And, before you ask me to rotate a pod, I am very much aware of your fragility as well as the mass, aerodynamical properties and hardness index of these plates in front of me."
"Can I have something to eat?" Figure 01: "No." -End of presentation 😂
My robot is gonna be like, "Haven't you had enough today fatty?"
Like a bad wife 😂😂
"I only have a 5 hour battery life, so I have to eat every 5 hours to live. What's your excuse, human?"
"Hey jeeves go make me a sandwich" Figure 01: No.
I'd buy that for a dollar.
I'll be genuinely impressed when this is done live in front of free press. Until then it's a promising but well rehearsed demo.
That’s fair but considering how much a group like open ai has invested I think it’s fair to take them on good faith
😂 I bet you genuinely impressed with teslas remote control bot.
@@ALFTHADRADDADNah. We've had plenty of large companies pulling bs.
@@harmless6813 Yes, of course, and if this was 10 years ago, or even five years ago, there would be every reason to assume this was BS, but the technologies shown here already exist; especially at Open AI which is at the absolute cutting edge, inconsistently ahead of everyone else in AI.
A dog can do most of that too if rehearsed. Except speak, and the apple will have bite marks on it.
Here I am, brain the size of a planet and you have me doing basic kitchen chores.
Well, at least you're not opening doors...😉
@@SajuukLike the yellow dog one with Carl issues?
Sounds like job satisfaction par excellence.
And it still barely manages.
Get back to work, Marvin.
I would be more impressed if it asked, "You want me to put the plate that had trash on it back in the rack with the clean plates?"
I would be more impressed if it had boobs and said: clean your trash yourself!
you realize how stupid that question is right?
@@JohnnysaidWhat and you can't identify a sentence meant to entertain. Are we all caught up?
I hope people realise they gave it Sam Altman's voice
I'm curious why it picked up the apple with its right hand before transferring to its left hand to pass to the man.
I think most people are under the assumption that software engineer jobs and other "information only" jobs are going to go away, but skilled trades jobs are safe. Well, its only partially right. Office type jobs are going to go first, but I think mostly everything else is not far behind. This is going to WRECK the job market - we REALLY need to get a handle on how to transition the economy, or we are screwed.
Maybe we do it as we did last time - introduce even more bullshit jobs without purpose.
Take a look at “moores law for everything pdf” on google. Wes has made a video about that. Interesting points there
UBI and universal free education.
I honestly think they don’t care about who is going to be without jobs. The economy is going to change completely and people are trying to see the future with an outdated model. Billionaires are building bunkers, hotels in space and spaceships. I think they’re planing to scape and let us die. They will own everything and won’t need consumerism. But, I have a very pessimistic mind so I might be completely wrong.
Everyone wants a Jarvis AI like Tony Stark. How many humans are like Tony Stark? Humans with original thoughts like Steve Jobs & Sam Altman will be fine. Average humans won’t be.
Reality is merging with science fiction. I guess I'll finally have time to learn the flute.
In the news: "Man murdered by robot that got angry at bad play and shoved a flute up the man's ass."
I wish to learn... How to bang a desk properly like those in my childhood back in days
Figure 01, hand me ear plugs.
Ok Is this reference to Captain Picard - The Inner Light ?
To entertain your new masters
"Hey figure 01, pick up all these hobos and put them in the soilent green hopper."
Hmmm, cookies!
💀
They're not hobos, they're "illegals" ...or...if the leftists win this time, Stalin already had a model calling "undesirables" "criminals" ...heck, we've been doing that in the USSA since the 1914 Harrison Narcotics Act of 1914 was passed...
Hey, figure 01, where do you think the human goes next?
@@KanedaSyndrome
"to hell"
My understanding of the explanation of loading action weights onto the GPU is this (and it's quite interesting!): the main OpenAI model does not control the robot directly. Instead, there is an action controller network, and a separate variation of that controller is trained for every action the robot can perform. (So for instance, perhaps there's a version for "pick up object" actions, a version for "place object" actions, a version for "hand off object" actions, etc.)
They're all re-trained variants of the same network which takes in images at 10fps and a task description, and outputs robot control "keyframes" at 200/sec. When the main OpenAI model decides it needs to perform an action, it hot-swaps out the current action model for a relevant one -- so if "pick up" is currently loaded and it wants to now place the object down, it'll unload the "pick up" model and load up the "place object" model instead. These controller keyframes get passed to the baseline stability and control model which determines how to move each motor/servo/etc. to achieve the desired keyframe, at 1,000 adjustments per second.
In other words, the main LMM decides *what* to do, then loads up a "procedural memory", basically (i.e. a pretrained action model) that looks at the world and quickly decides the *actions* to achieve that task, which then get passed to the control model to decide *how* to perform those actions.
I think the incredible speed and reaction time is a result of the fact that action models are separate from the main LMM, meaning they can be smaller (since they only have to operate with a more limited output space) and don't get interference from the main model while it's thinking (which is also why it can talk to you while it's doing things).
A huge benefit to this approach is that it would be modular. You can just train up new variants of the action model for any action you want the robot to learn to perform, and just plug-and-play it into the bot's library of known skills. The LMM is already general enough to be able to pick the new model when appropriate, based probably on just a name and description of it in the library; and the models all output the same format of action keyframes, so they're all compatible with the base control model by default.
Another cool thing about this approach is that it's very similar to how a human brain works. When we learn to perform actions, we do so by forming procedural memories, sometimes called muscle memories. These are stored directly in the motor cortex, the part of the brain that plans and controls movements. In fact, the premotor cortex plans the movements -- like these hot-swapped models -- and the rest of the motor cortex executes them. By storing them this way, we're able to -- once we've learned to do something -- perform the task without thinking much about it, leaving us open to think about other things. It's why you can walk and talk, or tie your shoes and wonder what's for dinner, or drive while jamming out to your favorite music, etc.
So in a way, the action models are procedural memories in the premotor cortex, and the base control model is like a mix of the motor cortex executing those actions and the cerebellum keeping balance. Meanwhile, the main LMM can see stuff (like your visual cortex), process language (like Wernicke's Area of a brain), and generate new language (like Broca's Area).
If they throw in a vector database to store episodic memories (i.e. memories of personal experiences) like a hippocampus, then it's crazy cool to think about how we're effectively building a brain with the *structure* of a human brain.
I mark your comment for later inspection
Leaving a comment to trace the grail of knowledge u shared
Wow, thanks for this comment! So insightful!
+1
A year or two ago, I saw a video with Elon Musk and some of his senior people talking about a myriad corner cases, and trying to account for them all i.e. bicycle on the back of a car vs on the road. Is it the same problem and would the solution be similar? A few months ago Musk talked about replacing a lot of C++ code with neural nets. Your comment makes me wonder how close we really are to a general purpose humanoid robot.
Wes 6 months ago: "these robots are kinda dumb"
Wes after seeing the latest Figure 1 robot in action: "I, for one, welcome our new robot overlords"
One of the reasons we get along with dogs is because they have the ability to recognize what pointing means... So good point, I bet early on that will be key for interaction with AI droids.
Most of the inbred dogs today have no idea what pointing means. Even the stupid pointer that lives with me cant figure it out.
I'm a futurist but even this gave me a slight uneasy feeling. It literally sounds like a human, waking up into a new world and discovering his/her surroundings.
You're a biological machine, so you're a robot too. This isn't surprising or uneasy for neuroscientists.
Uncanny valley territory.
Ok now i got acctually shocked for real. I didn’t not expect this for at least a year to be honest,
At that speed it will probably turn out that OpenAI accidentally created god and that is what Q-star is.
They Probably Have Real Life T-850s In Their Lab Helping Them Out.
Because AI helped them build newer AI faster than they ever could have by themselves. What you’re noticing is what is happening when we say that we are seeing progress trend exponentially now.
Yeah. But its good that its already here though. But the thing is, many things get demonstrated but dont get to public and public use for way long.
This one actually did SHOCK THE ENTIRE INDUSTRY!
Was that a simulation of Sam Altman's voice?
Doesn't sound at all like Altman's voice to me.
Suunded kinda like the same dude talking to it
I thought it was more like Jordan Peterson. 🤷
Nah y'all are all wrong. That Obama.
@@RhumpleOriginalgood ear u right
About the gpu. I wrote a paper for my cybersecurity class. The idea was to limit the possibility of a probabilistic perturbance in the tech that could cause it to gain self-awareness using fine tuning or RAG etc. The plausibility of a power intensive consumer grade gpu's to compute an entire neural network weights for AGI present thermodynamic and computational challenges especially because we havent figured out how to only compute the necessary weights and have to compute ALL the weights each time unless it is running on a cloud somewhere, (which is a great technicality because it prevents superintelligence from being out in the wild because it would take too many resources to compute a neural net that large) the idea was to only compute portions of the neural net as a forcing function of the needed compute. Tying weights needed for an output and only passing the weights of those needed portions, lower the chance of rogue AGI. I assume that they have some type of backdoor function for that.
@@Hi98765 it’s not as sketchy as it sounds. I honestly think this is a very thoughtful solution. I would prefer it that my hardware doesn’t have the ability to become alive.
Does that mean the compute is not distributed? I always assumed robots would have a GPU or equivalent in each limb etc, a la Rodney Brooks ...
the way you word this makes it sound as if training large models will only ever take on the order of days. that's just obviously faulty logic when it comes to the rate of advancement in ai, and also when it comes to processor and memory speed advancements. something that would've taken days to render 10 or 20 years ago can be done basically realtime nowadays. that same advancement in hardware will affect ai training, especially because we are only really starting to see hardware that's custom designed spefically to train models. as that advances, and as transistors and cpus go truly 3d (just google 3d stacked moore's law to see what will be happening in the next 10-20 years).. when cpus have 5, 10 layers that's only a 5x performance gain. what happens when we figure out how to dissipate the heat from 100 layers?? 1000? a gpu that's 100-1000x as performant as the ones we currently have could compute a future language model in likely seconds or minutes. imagine having robots running around everywhere with gpus in them. the risk definitely is there that agi could easily get loose.
@@gewgleplussuux5756 by then I assume we would have more safety mechanisms in place. This is just one of many steps needed to contain the risk.
@@charliekelland7564 Nah that very inefficient because each computer would have to communicate. It would be much easier to have a single onboard computer with access to a cloud network. Though there may be computers in certain high friction places, knees wrists joints, so they can get better data for the main computer but they won’t be anywhere near a 4090 equivalent. Even so, I believe that would be a temporary measure until the data becomes more available over time.
GPU. Ok I didn't think I would have to write it, but I saw so many weird and/or incomplete answers. To answer what "The same model is responsible for deciding which learned, closed-loop behavior to run on the robot to fulfill a given command, loading particular neural network weights onto the GPU and executing a policy." means, I'll try my best (as a senior software developer).
Your answer was pretty good IMO ("There's maybe a finite pre-trained number of actions that it has, like 'pick-up' or 'pour some liquid', or 'push a button', then the AI model selects which specific thing to run and runs it.")
Just noting here a few vocabulary that can be interpreted differently depending on context.
For the context here:
- model: It's like an LLM, but the output is not text, it's tokens to describe movements.
- learned, closed-loop behavior: Like you said, for example they *learned* "grab that bag", and it's closed-loop, because the destination is not enough. It needs to start the movement, get feedback on what it's doing, and adjusting to make sure it's heading to the right place.
- weights: This is just describing how the models work. Like sliders of multiple things, you put more weight on something you want to do more.
- GPU: This is just what they're using to process. It doesn't really matter, they could have just said "computing" or whatever. It's faster to run a model on a GPU than on a CPU, for example.
- Policy: "Neural Network Policies for fast, dexterous manipulation", which is just a basic movement.
So if we put all of this together:
"deciding which learned, closed-loop behavior to run"
means that from the input text, they decide, for example getting the trash, to split it in multiple sequential behaviors, like "move bag#1 from here to there" (here to there is filled from the vision input they have), and that would be split down in smaller behaviors, like "move hand towards bag until it's at the position over the bag", "open hand", "move hand down", etc.
Note that these are not text, I quoted it because it's our description of it, but it's just something they know, and the model chooses which ones it needs for a particular case.
then "on the robot to fulfill a given command"
is pretty self-explanatory IMO.
then "loading particular neural network weights onto the GPU"
just means that it will ... run it. Just like ChatGPT does. Or any model that you would run locally. With weights according to the command spoken above.
then "and executing a policy." is what I described above: for the trash example, they begin with "move hand towards bag until it's at the position over the bag", which (again) is not text, but something they learned already.
They just need to give as input WHERE that is.
So something like "move to x,y,z" in their internal coordinates, relative to something (probably the camera, or something like that)
Where x,y,z is computed from the visual they have of the bag.
And this is where they put the weights, to move (for example), the arm, the wrist, the whatever motor needs to be moved.
And then multiple times per second, they adjust to what they see, what they sense, to make sure that it's going where it should be going (in a "closed-loop").
And when the closed-loop is done, the hand is "moved at x,y,z", then they do the next thing in the list that they made from the command (as described above).
I don't know if it was clear...
"to fulfill a given command" ... That command is likely "parameterized". And, those "parameters" are mapped to the elements and entities perceived by AI models (e.g. GPT4 Vision).
Even there are a finite number of these Commands and Pre Trained NNs, the combination of these Commands can attain very rich and versatile results.
I feel like I'm watching a sci fi movie.
Disney had this stuff since the late 60's
I feel like we are inside of a sci-fi movie.
@@jhunt5578Yep, or a George Orwell/Philip K Dick novel!
@@jhunt5578 "You Can Definitely Say That Again" And I Just Found Out That Katy Perry Song "Harley's In Hawaii" Is Literally About ai 🤯🤔
Terminator: Sarah Connor Chronicles?
I don't know why, but I'm really impressed with how well it transfers an item from one hand to the other, and will even do that to move an item from one side of its body to the other side.
Yeah robotics made massive strives back in the 2010's and we've basically almost perfected it. Though their robot is still a bit behind the other companies that have been around longer. For example their robot still walks a bit slow and can't run yet
I noticed also how it seemed to drop the apple into his hand, as opposed to merely placing it there. Hadn't expected that.
@@kimchiman1000yeah that took me by surprise too. Felt very natural
@@kimchiman1000that's the sus part, the ai obviously has to take a second to recognize the visual data, how did it do that? In fact, it looks like guy just caught the apple, and the ai just randomly dropped it
This new technology is coming at humanity at an alarming rate.
As long as it doesn't leave at alarming rate I am ok with it lol
We are most likely fucked. I’ve largely accepted it.
@@therainman7777Why though? Just accept to find a personal purpose in life. There will be no jobs. There will be no scarcity. There will be fun. Okay, sure, or we all die, but that's fine, everyone dies so there's no FOMO. Our thoughts will forever life inside our robot overlords.
You read that section correctly, Wes. The large multimodal modal is like a central “brain” that processes the input scene (verbal instructions from the human, its video feed, etc.), and uses the generalized. intelligence of that model to determine what type of action(s) need to be performed. Once that is determined, a pre-trained set of neural network weights are loaded, presumably from some sort of action library that they have trained in advance, and computation through this neural network is performed on the GPU (where neural network computation is usually performed, due to massive parallelization which produces huge efficiency gains). For example, they may have trains a neural network specifically on the action “cut apple.” If the human asks it to cut up an apple, the large model determines that that’s the action that needs to be performed, loads the weights for that action onto the GPU, and then the computation needed for that action are carried out, the output of which are commands that get sent to the robot’s actuators.
Good 😊
The vocal segregate of "uh", "and brief stutter it uses for talking is an interesting choice to humanize interaction.
I wonder how much of that is trained/learned, or artificially programmed?
Interesting question, I think that the answer lies in whether or not the voice model was based on a real person's voice, like Siri, or if it was completely made from scratch.
Yes.
@@sunsu1049May depend on how it was trained. If the benchmark was that the voice should sound "human" it seems reasonable to assume that an error would be lowered by inserting these "uh..."s and "ah..."s
I think all answers were textual and someone just read them out loud. "They wouldn't lie to us"... hehe.. remember the Gemini video?
That's how it learned, it's not programmed to do that
Lol Holy shit. I’m most impressed with how fluidly the servos moved the damn robots arms around. It looked so uncanny to me for moments it almost looked fake. Pretty mind blowing stuff. Super exciting.
almost?
"closed loop" meaning the individual movements and postures. you can kinda see it, like how it moves the hand in position, opens all fingers, closes all fingers slowly, but only uses the index and thumb to grip the apple, then the arm swing, the rotation of the wrist, and the drop. it looks fluid altogether but upon closer inspection is a series of individual movements. not all that different from how we move, but a bit over flourished in order to appear natural.
0:41 "cups and a plate" with 3 plates and 1 cup visible on the drying rack
I hope it doesn't hallucinate any movements
I love these new robots and all AI models. They're amazing! Thanks for sharing their amazing achievements and development. 😅
I'd be proud to have these as ancestors, that's for sure. ❤️❤️❤️
New type of AI generated art: giving this robot a paint brush
Mr. Roth,
Enjoying these videos.
The word exciting, doesn't quite describe it.
Thank you.
I was surprised when it 'dropped' the apple in his hand, it did not place it in his hand
"Sir Isaac Newton once observed and Dr. Albert Einstein explained that the gravity of planet Earth is bending the space-time around it so that a massive object such as this apple, while maintaining uniform straight motion, still follows the bent geodesics and to observers such as me and you it looks as if the apple is falling down. So I decided to recontextualize this observation to invent a mode of transportation of apples from my hand to your hand, provided that my hand containing the apple is higher up in the gravity field of planet Earth than your hand, which is the case at the moment since I had strategically placed my hand with the apple above your hand, which was without an apple just a moment ago but holds the apple right now."
Its getting alot smoother. More fluidity in the movements. Pivot points. Motor skills.😮🤯❤️🔥😎
Smoother than C-3PO 😁 should have used Elevenlabs to clone Anthony Daniels. Missed opportunity. Or Darth Vader for the laugh.
The voice reminded me of RFK Jr, for some unknown reason.
Thanks, man! Your updates are super helpful, smart, and well-produced.
This might sound weird, but I felt compassion and pity for the robot while watching the video.
That’s how they win.
Only half joking. AI LLMs are being trained to be pleasant to interact with. Add a body that can gesture in ways that manipulate our mammalian emotions, and we will be putty in their hands.
I totally agree. Wanna give him a hug lol
Yeah peeps like you will be the trouble makers in the future. You'll start arguing for AI and robot rights because you've got nothing better to do.
that is so much better than I expected. And i mean by miles. Speechless. It's here.
Ok so today was the first day a real robot existed
I have that feeling too.
From ChatGPT4: ( GPU )
~~~~~
In the context of the robotic AI system you described, "closed-loop behavior" refers to the robot's ability to execute tasks or actions based on feedback from its environment, continuously adjusting its actions based on this feedback to achieve a desired outcome. This process forms a loop: the robot performs an action, receives new data from its environment through sensors (like cameras and microphones), processes this data to evaluate the outcome of its action, and then decides on the next action based on this evaluation.
The "closed-loop" aspect emphasizes the continuous, self-regulating nature of this process, where the output or response of the system directly influences its next action, creating a feedback loop. This is in contrast to "open-loop" systems, where actions are not adjusted based on feedback or outcomes but are pre-defined and follow a set sequence without real-time adaptation. Closed-loop behaviors in robotics allow for more adaptive, responsive, and intelligent interactions with dynamic environments.
~~~~~
We will all be on UBI by next year at this rate.
Pray. Or we are starving
Here in Canada, we had a recent test run of a UBI style system during COVID-19 with a cheque every two weeks for CAD$1000.00 for the duration of the pandemic.
No. 3 years on the good timeline.
4 years and a revolution on the bad tims.
5 years and bad stuff on the worst timeline I'll acknowledge.
7 years and then nobody is around anymore in the timeline we don't talk about 👍
The promise of UBI is more valuable to the masters than its realization.
@@focusedeyeI wouldn't call it UBI since only those without a job received it. Anyways, still great initiative from all countries that did it
The dexterity is really impressive. Even if it was a remote avatar, the movement is really good.
It's game over, man. Game Over!
GG 😁
It's interesting to see the speed and fluidity of the robot as that's not often seen at the moment.
It also doesn't seem to have a jerking motion with movement which is good.
I would say that response time with the reply needs to be made better as we live in a busy world and so people won't have the patience for a delay in response, plus it doesn't feel so natural.
The difficulty I see there, though, is that it will need to know when it is ok to have its turn speaking as opposed to continuing to listen as the person may have other things to say.
I'm sure such things will be improved, but the inflexions in the speech feel natural.
This delay issue has been tackled by individuals before, to partial succes. I am excited to see what actual big companies will do about it. And you've guessed it, the biggest issue is to know when human stops speaking. We do it by picking up subtle clues like intonation and sentence structure. Considering that it's implied Figire 01 uses speech to text, intonation is probably lost. From what I've seen, best solutions for now are to have a minimal delay + ability to abort speech if they are interrupted. Those have their own issues. For example, AI might stop talking mid-sentence if someone coughs or speaks to someone else in immediate vicinity
"...The apple found its new owner." (...when he leaves the table...) "The trash is gone." LOL
yea, game over, we are done
Maybe we could build a fire. Sing a couple of songs, huh? Why don't we try that?
Combat robots within 2 years.
Always find your videos interesting. Sometimes I wonder if I’m the only one who is struggling to keep up with the speed of advancement. I could spend all day every day learning about this tech and I would still feel behind the curve. I work in tech, I’m 51, spent my life being ahead of the curve until now…. I guess my time being the expert in the room is sunsetting, I feel the next gen team will have to work harder and smarter than my gen did. Not sure how I feel about this, not sure how I feel about it actually matters anymore.
This is the most wonderfully self aware thing I've read online for a long while. Thank you. Also I agree, thankfully we now have AI and soon brain chips and robots to help us think and work harder and smarter.
@@hellblazerjj thanks, you put a smile on my face. What a world we live in, perhaps being a passenger this time around will be more fun than flying the plane.
I'm on the low end of the tech knowledge curve and watch these types of videos to help me keep up with the real world changes.
It seems humans will soon be outpaced by AI machines and transhuman evololution will be the only way to keep up with our creations. Of course the wealthiest will will be the first to recieve such augmentation. God help us!
Explanation of the GPU Model Text at 12:30 - As far as I understand it... Assume pre-training of certain actions for the robot that would be useful. The compute model that processes the stored memory, speech and images also decides a response. The same model is used to pick (from interpretation of it's own response) which of it's learned actions to use, combined with the object info specific to the scene, to figure the movements, speed and other physical forces for proper interaction with the environment. Conceptually, as a bystander, we see no difference in this part of the functionality, both Figure01 and RT-2 can watch-to-learn skills. However, it's a matter of style; Applying closed-loop behavior weights to the GPU before the 'policy'(the robotic movement code) is like conforming the learned skill to the task at hand. This may do something like cause a pick-up or drop action to use more finesse for a delicate item like a tomato or for a deposit action to give the paper trash a small toss into the basket if it's not too far but also not within reach.
Let me know if I'm wrong or missed something.
Wow 😮 finally getting there! Can’t wait in the next few months to see this massive improvements as we knew would happen since everything is duplicating at 10x faster than 2023! Next year agi will become self aware. Exciting and scary if it falls to the wrongs hands
It's already in the wrong hands.
Yeah, I'm excited for losing my job!
This is so awesome. I've been hoping for robots that can help around the house for a long time. In my experience if you can't do everything yourself it can suck to have to ask for help from humans for a bunch of reasons, but a robot would be always there. And they don't just do things around the house, you can also have a conversation with them! I love it!
How are you going to pay for the mortgage. The great replacement is a posibilty now.
What do you mean?@@tearlelee34
The thing about BD is for example Atlas has the craziest articulation, agility etc. What they can do is only possible because of its actuators etc. Image when they load one of these up on him! Or the dog.. I mean we are talking next lvl.
It probably has already happened internally and they just did not present it yet.
The line, "This is the worst it will ever be..." Keeps ringing in my ears. Starting to really see the outlines of the near future here.
This reminded me of the cool bartender robot in the movie passengers
Sameee!!! It's so similar
*Summary*
The demonstration highlights several impressive capabilities of this robot:
- It can perceive and describe its surroundings through vision and natural language understanding models.
- It can engage in full conversations, comprehending context and prompts like "Can I have something to eat?"
- It uses common sense reasoning to plan actions, like providing an apple since it's the only edible item available.
- It has short-term memory to follow up on commands like "Can you put them there?" by recalling previous context.
- The robot's movements are driven by neural network policies that map camera pixels directly to dexterous manipulations like grasping deformable objects.
- A whole-body controller allows stable dynamics like maintaining balance during actions.
The key innovation is combining OpenAI's advanced AI models for vision, language, and reasoning with Figure AI's expertise in robotics hardware and controllers.
Figure AI is actively recruiting to further scale up this promising approach to general, multi-modal robotics through leveraging large AI models.
Companies and researchers effectively combining the cutting-edge in large language models with advanced robotic hardware/controls are emerging as leaders in pushing embodied AI capabilities forward rapidly.
There is a sense of optimism that general, multi-purpose robots displaying intelligent behavior are now within closer reach through neural network approaches rather than classic programming paradigms.
LOL so quick. Nice this year i think them will walk :) Next year will be fun.
I love that you press play in a wes roth vid and its content
Looking forward to 99.9% of humans adjusting to reservation life.
It's been so good for Native Americans.
I think what they mean by all neural network is like how a hand isn’t the brain but it’s connected so a part of the neural network without the data or programing in it specifically. Like how the brain doesn’t hold the info but is the info in relation to the body and its differences.
AI won't destroy us; we'll destroy ourselves with AI.
There is always a golden era before massive destruction
I think we should be investing in personal EMP devices, if these things are going to be running around.
Most of the delay is while it is waiting to see if the human is done speaking. The same thing happens on my phone when I talk to chat gpt
To be a bit more accurate, I don't think it's quite "to see if the human is done speaking"; I think that's the delay from the time it takes to actually transcribe the speech into text first. There's a framework built on top of Whisper that allows you to constantly stream the transcription live (instead of waiting for the entire audio chunk first), and even that has a 3-4 second delay between an audio chunk being streamed and the transcription of it being done.
We're so much closer to a Skynet event now...soooo excited....yaaaay...
oh well :) it was nice being alive while it lasted
My guess is that they have something like an Jetson Orin onboard, which is able to dynamically select from what is effectively a finetuned RT1 for the behavior library. That GPU would be low-power and capable of running the smaller RT1 model at the rate they claimed (the RT2 model would be too large). RT1 showed that if the tasks are simple enough, you don't need a particularly large model (could even use a LoRA-type approach on the transformer stem). Alternatively, they could be a larger library running on an external server grade GPU (notice that the robot is still tethered, though the update rate would likely work with wireless too). The motion planning from the VLA model would then be similar to how the RT platforms do it, using a real-time kinematics control model. DeepMind already figured out a working solution, so why re-invent the wheel when you could simply stick GPT4V ontop of it.
Why would an AI model imitate human speech hesitations ?
To make it feel more human.
Because we're more likely to buy one for our household if it acts familiar, like a human. I think it's about making the public accept them into society
It sounded to me like a human was reading out text responses. The voice didn't sound like the clean output of text-to-speech software
I found that to be surprising, and then immediately concerning. Was that a real glitch?
probably was trained on natural speech patterns. like how midjourney imitates stray brush strokes
Fascinating and terrifying in almost equal measure.
This is probably the closest thing to AGI we've seen so far
I contend that the compute necessary to run an AGI robot cannot be produced within the small space that is available within a humanoid robot. This means that without an umbilical cord (which this one has) the controlling computer must communicate via a very fast bidirectional line, 5G comes to mind. The data rate for live video and more will be very high. The cost for robot plus data transmission plus compute in a supercomputer will perhaps be prohibitive.
That GPU thing you were looking at appears to indicate that the physical behaviour is pre-programmed/'learned' but that the ai model is choosing from a list of actions.
ie. It has been trained to stack dishes, hand apples, and pick up rubbish. The ai on the inside though is deciding independantly whether to push the 'pixkup rubbish' button or the 'hand over apple' button depending on context.
Not quite a truck rolling downhill but certainly less impressive than my initial reaction watching that robot video.
It still demonstrates that we can have NOW robots that can be trained to do specific tasks, and can replace humans in a lot of jobs if they're trained for that.
@@LaurentCassaro oh yeah, it's pretty cool regardless. But there qas a section in the video where he specifically asked about that specific section.
Probably not a long shot from "I think I did pretty well; the apple found its new owner; the trash was removed " to "I think I did pretty well; the Earth found its new owners; the humans were removed."
Idky but robot talking reminded me of sam Altman
FINALLY! The real thing ready to fit in any smart rich person's house AND get the real work job done! No more investors dream-on. No more 'reasonably working' prototypes & lab models. This is it. The starting gun for a robot world of the tomorrow-upon-us-now.
Time to start building handheld emps. Just in case…
Jk ofc 😉
That's actually not funny and a good idea except the ones you'll need them for will be hardened against such measures.
Ingeneous. I'll be your first customer.
funny you should mention, i have a design for one that basically kills all electronics in a room... the problem is the budget@@marcariotto1709
The model that runs in the GPU selects the in/outputs for joints based on some vector representation that is mapped with the openai prompt return unpacking and subsequent policy activation that translates to dynamic parameter inputs based on a trained approximation model, which is Figure01 side basically.
The rise of philosophical zombies is coming, and it is terrifying.
Would be wise to make consumer models will have to be strength limited, possibly plastic parts that break before too much force is applied. But once they make themselves, it's over haha
We will be the zombies now. Millions and millions of hungry zombies.
@@karpablaif that disease x they keep talking about is some airosolized rabies then ya, you're 100% right 😂
I'm Stunned and shocked
This one now acctually shock me.
This is really incredible.
1:09 “So I gave you the apple, because it’s the only, uh, edible item I could provide you with from the table.”
It used “, uh,” in its speech, and contractions. Wow, even Lt Cmdr Data couldn’t do that!
2:08 “I.. I think I did pretty well”
And it can stutter when flustered! And it has a vocal fry! Amazing!
It's probably using elevenlabs, you can get that effect pretty easily on there
Contractions aren't anything new for LLMs. And while not ubiquitous, speech disfluencies (um, uh, stuttering, etc.) are present in other previous TTS models as well. (Not so much in any ElevenLabs model as far as I've heard, though.) So while this is definitely a very high quality TTS model, those features that make it sound real aren't anything new.
"What is my purpose?" "You pass butter" *slumps over in sadness*
Man this is sooo cool, it's not coming fast enough, tho, lets accelerate!!!!
I find it exciting, I am not a programmer but I have so many ideas i cannot wait for this to work!🎉❤😊
Have we finally found the fabled SHOCKING news?
I love how Figure's background music is so simular to the music in Ex Machina.
anyone want to take bets on how many times 'AGI' will be invented for the first time this year?
It’s already here. It just has the intelligence of an 8 y.o boy and it’s only increasing. The only reason we keep “ re-inventing “ agi is because we keep moving the goal post every time we see it
366
@@Souleater7777If there is a machine with the intelligence of an 8 year old, i've not seen it. I doubt we have anything as smart as a rat, but i tend to think of intelligence as being the ability to learn and adapt. so far most of these machines have to be pre-trained, so they don't learn after. It's just a method of programming them using ML instead of hand crafted code. AGI is defined as having the ability to generalize to any task, so far everything we have is narrow. sure, chatGPT can generate any kind of text, but it's still just doing the one thing, generating text. I think if we do currently have proto AGI, it's one of the video game playing programs, but i wouldn't say they are as smart as a human child just yet. I do think , though, when we have something as smart as a dog, we'll have something truly useful in lots of ways, hence the generalization.
@@Souleater7777nothing yet has the ability to learn new tasks, that's what the General in AGI stands for. I would say that the smartest things we have right now aren't quite as smart as a rat in that regard. But when we have something as smart as a dog, then we'll have something that can be useful in lots of ways, and is generalizable. No need to move goalposts until we reach that first one. I suspect it will be one of the programs that plays video games, or something that comes from that. Especially if one of those agents is adapted to use robots.
@@jameshughes3014 we’ll see about that ,
Naysayers take heed , the time is coming , and is already here .
Meet back here in 1 year .
that is going to be producing a ton of data for future LLM models, This could be a wright brothers moment in history if they are for real
That robot which cooks was not teleoperated, operator taught it and after that it generalized that knowledge. Quite practical scenario. But learning by videos even better.
Why did the ai Studder!?! at 2:15 he studder on his words.
It sounded to me like someone was reading out text responses it gave
Exactly what I thought. Weird
@@13371138sure did but it feels silly to suggest, there are so many reasons why that would be dumb as f to cheat on..but yeah, i don't want to put on my foil hat but it's a skeptical aspect. I also didn't like the final reply where his voice went up as he said his goodbye, that's something humans do but i figure trained voices can't balance that waveform yet.
I prompt DALL-E3 and I see improvement in the last year. Figure1 is the real deal for AI.
The only thing that makes me question whether or not this is real or not is how it just dropped the Apple into his hand as if he didn't expect it to hand it to him... Sure the voice could be synthetic but after that Google video where they got in trouble for faking stuff I have a hard time believing this.
…why would it be faked though? Figure01 may have been instructed explicitly to drop the apple into his hand rather than place it for the flair it adds to the video
Google is google, openai is a different company.
If google thinks its language models are too dangerous for the public.
Is faking half the stuff, you can be sceptical to them.
Currently openai has the benefit of the doubt. Until they mess up.
What do you mean why would it be faked? What a ridiculous statement... For the same reason any company would fake something, fake it till you make it... get that funding & hype. Google did the same thing@@eIicit
Why go to the trouble of faking this video?
FYI Figure is not Google
GPU. It basically means it takes things in through both text and image. So it interprets based on both sight and concept using ai. It creates files to be stored locally on a memory in the robot that’s called a weight. So a weight is like a skill. It then loads the correct weight or skill to run to complete a policy or series of actions like the rotation and bending of arms to move in three dimensions. Those weights are loaded into the gpu and run like loading with tradition memory ram gpu set up. The ai regulates what files have become faulty and maintains its actions sort of like maintaining versions of software and runs the correct program for the correct action. For laymen’s it’s a computer in a robot layered with ai to do several different things to make it all work seamlessly. I’ve been waiting for them to do this. Been thinking of starting my own project.
Why did the bot briefly stutter and insert „er“? The large language model underpinning it should not generate imperfections of speakers.
It makes it sound more human. I like it. Same with the cute "the apple found its new owner" line. You want your robot overlords to have a sense of humour don't you?
It is already decaying mentally haha
When Google demoed their whatever-it-was-called that makes phone calls and reservations on your behalf, it did the same. Conversational voice synthesis AI is trained to imitate human imperfections to sound natural.
@@hellblazerjj a steel robot doing this mimikry is deeply in uncanny valley territory
I’m just imagining all the hal 9000 scenes from the Space Odyssey done with the same voice:
“I’m sorry, uh, Dave! I, uhm, am afraid I can’t do that!”
He wouldn’t be such a charmer with these staters and uhm’s!
AGI is finally here guys!
no. no it isn't.
You got fooled and I hope they enjoy fooling you.
This is not AGI
Nah it's ChatGPT picking from a set of pretrained actions
@@karlwest437I’m sure you typing is “pre trained”, unless you just learned how to do it right now.
To the GPU sentence: I think "the same model" is like the minecraft voyager Paper architecture, that the model is dissecting its actions into a sequence of subtasks. Each simle task is writtten in code and if it works, then it saves it into the skill library and is just being reused. The Neural Network Policies are presumably a kind of Reinforcement Learning environment like in OPEN AI's gym. Have a look at the pyhon module stable-baselines-3 and you'll see what the Network policies are about. But I'm also just guessing by experience.
This is GPT-5
How would you know?
@@ossianravn I think he is guessing based on the abilities of the AI, especially since they are partnered with OpenAI it wouldn't be out of the realm of possibility that they gave early access to GPT-5
@@ossianravn nobody knows anything, it's just an educated guess.
Can't wait having this guy washing my dishes 😄
Seems at least partly scripted to me. When the guy says 'What do you see?' The robot knows to say what it says on the table only. Not in the entire room. Also the guy says "Pick up the trash' and the robot knows to pick up trash AND put it in the basket. Presentation seems a little off to me.
This is very fascinating! While its cool to see what it can do already, it would be even better to see what the robot cant do yet or which tasks it still fails at. Nontheless, thanks for sharing!
Hey figure 01, could you do my taxes?
What we see is of course a demo. What I saw with these AI's that have a resemblance of memory, is that they go crazy. It depends on data that they were trained on, but they are almost guaranteed to do seizure-speak every once in a while. Imagine if Figure 01 prompts it's body to go into seizure. They seem to have filters in place, but those could fail
When it's describing what it sees, it gets it wrong, it says "a drying rack with cups and a plate", when there's actually plates and a cup 😂
To err is robot
There is actually 2 cups on the scene, the robot saw it better than you 🤣
Now seriously, the cup inside is included in what the vision model is labeling, nobody told the robot "don't include in the description the objects inside drying rack".
@@rootor1 well going by that logic it should have said, "a drying rack, an apple, some cups and some plates" 😝
In the thumbnails for suggestions shown after the video ends are a couple of _The Late Show_ with Stephen Colbert.
Makes me think, next we'll be seeing a robot like this as a guest on a late-night talk show.
Then, it will be the guest host of a late-night talk show. Oh, imagine it interviewing (and bantering with) Neil Degrasse Tyson?
Scary
gpu closed loop... does that mean that if Corey was to move his hand whilst the robot passed him the apple. The apple would fall to the table. As the planning/action/feedback loop is simple not fast enough to pick the correct action to place the apple in a suddenly moving outstretched palm ?
Sounds like the robot has a set of learned, visual/motor behaviors stored in the form of parameter weights. These only function when loaded into a neural network, which is to say the physical substrate of the GPU. Which of these behaviors gets loaded is decided in the same way a chatbot may decide to call on some tool, like a browser or calculator, in responding to the chatbot user. (My best guess)