Deep Learning Models Enable Robotic Motion

June 5, 2024

42 3 minutes read

Massachusetts Institute of Technology, Cambridge, MA

In this collaged image, a robotic hand tries to scoop up red marbles and put them into another bowl while a researcher’s hand frequently disrupts it. The robot eventually succeeds. (Image: Jose-Luis Olivares, MIT. Stills courtesy of the researchers)

MIT engineers are aiming to give robots a bit of common sense when faced with situations that push them off their trained path. They’ve developed a method that connects robot motion data with the “common sense knowledge” of large language models, or LLMs.

Their approach enables a robot to logically parse many given household tasks into subtasks, and to physically adjust to disruptions within a subtask so that the robot can move on without having to go back and start a task from scratch — and without engineers having to explicitly program fixes for every possible failure along the way.

“Imitation learning is a mainstream approach enabling household robots. But if a robot is blindly mimicking a human’s motion trajectories, tiny errors can accumulate and eventually derail the rest of the execution,” said Yanwei Wang. “With our method, a robot can self-correct execution errors and improve overall task success.”

The researchers illustrate their new approach with a simple chore: scooping marbles from one bowl and pouring them into another. To accomplish this task, engineers would typically move a robot through the motions of scooping and pouring — all in one fluid trajectory. They might do this multiple times, to give the robot a number of human demonstrations to mimic.

The team realized that, while a human might demonstrate a single task in one go, that task depends on a sequence of subtasks, or trajectories.

“That level of planning is very tedious,” Wang said.

Instead, he and his colleagues found some of this work could be done automatically by LLMs. These deep learning models process immense libraries of text, which they use to establish connections between words, sentences, and paragraphs. Through these connections, an LLM can then generate new sentences based on what it has learned about the kind of word that is likely to follow the last.

“LLMs have a way to tell you how to do each step of a task, in natural language. A human’s continuous demonstration is the embodiment of those steps, in physical space,” Wang said. “And we wanted to connect the two, so that a robot would automatically know what stage it is in a task, and be able to replan and recover on its own.”

For their new approach, the team developed an algorithm to automatically connect an LLM’s natural language label for a particular subtask with a robot’s position in physical space or an image that encodes the robot state. Mapping a robot’s physical coordinates, or an image of the robot state, to a natural language label is known as “grounding.” The team’s new algorithm is designed to learn a grounding “classifier,” meaning that it learns to automatically identify what semantic subtask a robot is in — for example, “reach” versus “scoop” — given its physical coordinates or an image view.

“The grounding classifier facilitates this dialogue between what the robot is doing in the physical space and what the LLM knows about the subtasks, and the constraints you have to pay attention to within each subtask,” Wang said.

For more information, contact Abby Abazorius at This email address is being protected from spambots. You need JavaScript enabled to view it.; 617-253-2709.