Robotics Transformer w/ Visual-LLM explained: RT-2

Поделиться
HTML-код
  • Опубликовано: 20 окт 2024

Комментарии • 13

  • @bomxacalaka2033
    @bomxacalaka2033 Год назад +3

    the next military drone patch looking sick

  • @rishiktiwari
    @rishiktiwari Год назад

    Really good summarisation of RT. I am a student researcher working on VLA and robotic arm currently, it’s quite tedious tbh. Btw, you missed that RT also uses SayCan for grounding and chunking high level command to small action sentences.

    • @aryangod2003
      @aryangod2003 5 месяцев назад

      SayCan., haha Ancient (meaning nearly 2 years old!). But that was one of the first models that Used LLMs in Robotics..asdie from just Language Conditioned Reinforcement learning, or behavior cloning

    • @rishiktiwari
      @rishiktiwari 5 месяцев назад

      @@aryangod2003 Yeah I know.

    • @loserc1854
      @loserc1854 4 месяца назад +2

      ​@@rishiktiwari If I wanna start learning LLM + robotics, whats the recommended learning path?

    • @rishiktiwari
      @rishiktiwari 4 месяца назад

      @@loserc1854 The current state of general embodied AI is very open and experimental. We all know that transformers won’t solve the robotic problems. LLMs do not play enough role in robotic control, its mostly for natural language interfacing. Multi-modal models (VLMs, VLAs) are attempted commonly in closed-loop control but most of the current implementation is hard coded rather than fuzzily generating from AI (exception RT, VC1).
      I would say, look into the existing models and how can you combine them at the layer output level to get something better (most of the present research releases are just this).
      Develop intuition of how AI models work, and how it happens in human brain. See if you can correlate something or come-up with new design.
      In AI, programming and math are just tools to implement your intuition, remember this! Being good at math is beneficial but don’t worry if equations in paper looks scary, half of them are incomplete or BS. This is true for many papers released on arxiv or privately, they are not peer-reviewed and contain lot of flaws or tell a incomplete story.
      The resources on internet are scarce, especially for someone starting new. Better to be good at reverse-engineering and asking people directly.
      I would highly recommend to try implementing RNN, LSTM, CNN, TCN on your own and realise its limitation for better understanding of why transformers and attention.
      Remember that transformers never know the meaning of token, they just transform on embeddings.
      Also, there are three distinct groups: one who is trying to improve the AI architecture (mostly academicians), second who are improving the base models (mostly researchers & engg), and third who are putting all these models to meaningful use.
      Sorry for long response but wanted to give a broad picture.

  • @delegatewu
    @delegatewu 7 месяцев назад

    Thank you, this is a great video.

  • @NeuroScientician
    @NeuroScientician Год назад +2

    Does it have a context window? Or are all actions completely independent? Can I give it follow-up instructions? How does the feedback back to the LLM looks like? How does it know it worked? How about conditions? Like "open the drawer and pull out all blue cubes, but if there is a yellow ball in there, take only two blue cubes". What would happen?
    I tried something like this with GPT3&4 it had no physical body, it was in a turn based game, every turn, the state of the game would be passed to all agents. It would work fine for like 5 moves, then it failed apart.
    What are the specs for the arm they used? Is it like something insanely expensive, or can I use a lego arm? I would really give this a go. Mount it on a Roomba a let it collect socks or something :D

    • @code4AI
      @code4AI  Год назад +2

      Great comment, I understand that I have to explain Reinforcement Learning (RL) with Transformers for Robotics. Next video will answer your question.

    • @bomxacalaka2033
      @bomxacalaka2033 Год назад

      it makes sense that it would work with a lego arm, and since it has LLM it wouldnt really care that they arm is shorter or less accurate or that the camera is low quality. Now in terms of follow-up instructions, my guess is that it would be some system similar to the agents system, it has the LLM, memory and the tools. So you give it a prompt and that prompt is parsed to the LLM and memory so that it always has context of what its task is so that it knows when its done and what is the next step.

    • @fgh680
      @fgh680 Год назад

      ❤ this ; thank you for making such informative videos!😊

    • @aryangod2003
      @aryangod2003 5 месяцев назад

      @@bomxacalaka2033 It is an interative Loop until some kind of End tokken is generated I think.

  • @TheAmazonExplorer731
    @TheAmazonExplorer731 10 месяцев назад

    some one please give me the link of the code for the research as i am phd student from China