Robotics Transformer w/ Visual-LLM explained: RT-2
HTML-код
- Опубликовано: 20 дек 2024
- RT-2, short for Robotics Transformer 2, is a cutting-edge model that harnesses the power of a Vision-Language Models VLMs 55B to enhance robotic control. This model represents a significant leap in the field of robotics, demonstrating how web-scale pre-training can be used to improve the generalization performance of robotic systems.
Vision Language Models further fine-tuned with a robotics data set to a VLA model. A Vision-Language-Action model for advanced robotics.
#ai
#robotics
#explained
the next military drone patch looking sick
Really good summarisation of RT. I am a student researcher working on VLA and robotic arm currently, it’s quite tedious tbh. Btw, you missed that RT also uses SayCan for grounding and chunking high level command to small action sentences.
SayCan., haha Ancient (meaning nearly 2 years old!). But that was one of the first models that Used LLMs in Robotics..asdie from just Language Conditioned Reinforcement learning, or behavior cloning
@@aryangod2003 Yeah I know.
@@rishiktiwari If I wanna start learning LLM + robotics, whats the recommended learning path?
@@loserc1854 The current state of general embodied AI is very open and experimental. We all know that transformers won’t solve the robotic problems. LLMs do not play enough role in robotic control, its mostly for natural language interfacing. Multi-modal models (VLMs, VLAs) are attempted commonly in closed-loop control but most of the current implementation is hard coded rather than fuzzily generating from AI (exception RT, VC1).
I would say, look into the existing models and how can you combine them at the layer output level to get something better (most of the present research releases are just this).
Develop intuition of how AI models work, and how it happens in human brain. See if you can correlate something or come-up with new design.
In AI, programming and math are just tools to implement your intuition, remember this! Being good at math is beneficial but don’t worry if equations in paper looks scary, half of them are incomplete or BS. This is true for many papers released on arxiv or privately, they are not peer-reviewed and contain lot of flaws or tell a incomplete story.
The resources on internet are scarce, especially for someone starting new. Better to be good at reverse-engineering and asking people directly.
I would highly recommend to try implementing RNN, LSTM, CNN, TCN on your own and realise its limitation for better understanding of why transformers and attention.
Remember that transformers never know the meaning of token, they just transform on embeddings.
Also, there are three distinct groups: one who is trying to improve the AI architecture (mostly academicians), second who are improving the base models (mostly researchers & engg), and third who are putting all these models to meaningful use.
Sorry for long response but wanted to give a broad picture.
Does it have a context window? Or are all actions completely independent? Can I give it follow-up instructions? How does the feedback back to the LLM looks like? How does it know it worked? How about conditions? Like "open the drawer and pull out all blue cubes, but if there is a yellow ball in there, take only two blue cubes". What would happen?
I tried something like this with GPT3&4 it had no physical body, it was in a turn based game, every turn, the state of the game would be passed to all agents. It would work fine for like 5 moves, then it failed apart.
What are the specs for the arm they used? Is it like something insanely expensive, or can I use a lego arm? I would really give this a go. Mount it on a Roomba a let it collect socks or something :D
Great comment, I understand that I have to explain Reinforcement Learning (RL) with Transformers for Robotics. Next video will answer your question.
it makes sense that it would work with a lego arm, and since it has LLM it wouldnt really care that they arm is shorter or less accurate or that the camera is low quality. Now in terms of follow-up instructions, my guess is that it would be some system similar to the agents system, it has the LLM, memory and the tools. So you give it a prompt and that prompt is parsed to the LLM and memory so that it always has context of what its task is so that it knows when its done and what is the next step.
❤ this ; thank you for making such informative videos!😊
@@bomxacalaka2033 It is an interative Loop until some kind of End tokken is generated I think.
Thank you, this is a great video.
some one please give me the link of the code for the research as i am phd student from China