Google DeepMind announces new VLA (Vision-Language-Action) artificial intelligence model for robotsRT-2, thereby enabling robots to make cognitive responses more intelligently.
RT-2 is similar to RT-2022, previously announced at the end of 1. It also utilizes a model architecture that allows robots to rapidly learn from experience and share this knowledge with other robots. Based on the Transformer deep learning model, which utilizes a self-attention mechanism, RT-XNUMX can be trained using text and images transmitted over the internet, enabling robots to learn and perform corresponding actions.
For example, if you want a robot to automatically throw empty Coke cans into a trash can, the traditional method is to first teach the robot what a Coke can is and how to determine whether a Coke can is empty. Then, the robot must be trained on how to pick up the empty Coke can and place it correctly in the trash can. However, when the robot is actually operating, it still does not know why the empty Coke can is thrown into the trash can.
Therefore, in the design of the "RT-2" model, the robot will be trained through a large amount of data from the Internet and taught what "trash" is. This saves the tedious steps that originally required to gradually train the robot to identify objects, judge the object's condition, and how to pick up and correctly place them in the trash can. By directly letting the robot know under what conditions an object can be considered "trash" and the actual purpose of the trash can, the robot can accelerate its learning behavior of throwing the "trash" it sees into the trash can.
相比「RT-1」模型下的執行效率為32%,透過「RT-2」的執行效率則提高為62%,幾乎足足增加將近2倍效果,因此預期能接續透過「RT-2」模型訓練機器人理解更多操作行為。

