Google DeepMind enables robots to perform novel tasks
This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.
The Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, according to Google DeepMind.
A traditional robot can pick up a ball and stumble when picking up a cube.
RT-2’s flexible approach enables a robot to train on picking up a ball and can figure out how to adjust its extremities to pick up a cube or another toy it’s never seen before.
“We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic reasoning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink),” said the DeepMind team.
Discover the stories of your interest
The latest model builds upon Robotic Transformer 1 (RT-1) that was trained on multi-task demonstrations.The team performed a series of qualitative and quantitative experiments on RT-2 models, on over 6,000 robotic trials.
“Across all categories, we observed increased generalisation performance (more than 3x improvement) compared to previous baselines,” the team said.
The RT-2 model shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data.
“RT-2 is not only a simple and effective modification over existing VLM models, but also shows the promise of building a general-purpose physical robot that can reason, problem solve, and interpret information for performing a diverse range of tasks in the real-world,” said Google DeepMind.
For all the latest Technology News Click Here