Nvidia’s DrEureka outperforms humans in training robotics systems
Discover how companies are responsibly integrating AI in production. This invite-only event in SF will explore the intersection of technology and business. Find out how you can attend here.
Large language models (LLMs) can accelerate the training of robotics systems in super-human ways, according to a new study by scientists at Nvidia, the University of Pennsylvania and the University of Texas, Austin.
The study introduces DrEureka, a technique that can automatically create reward functions and randomization distributions for robotics systems. DrEureka stands for Domain Randomization Eureka. DrEureka only requires a high-level description of the target task and is faster and more efficient than human-designed rewards in transferring learned policies from simulated environments to the real world.
The implications can be great for the fast-moving world of robotics, which has recently gotten a renewed boost from the advances in language and vision models.
Sim-to-real transfer
When designing robotics models for new tasks, a policy is usually trained in a simulated environment and deployed to the real world. The difference between simulation and real-world environments, referred to as the “sim-to-real” gap, is one of the big challenges of any robotics system. Configuring and fine-tuning the policy for optimal performance usually requires a bit of back and forth between simulation and real-world environments.
Recent works have shown that LLMs can combine their vast world knowledge and reasoning capabilities with the physics engines of virtual simulators to learn complex low-level skills. For example, LLMs can be used to design reward functions, the components that steer the robotics reinforcement learning (RL) system to find the correct sequences of actions for the desired task.
However, once a policy is learned in simulation, transferring it to the real world requires a lot of manual tweaking of the reward functions and simulation parameters.
DrEureka
The goal of DrEureka is to use LLMs to automate the intensive human efforts required in the sim-to-real transfer process.
DrEureka builds on Eureka, a technique that was introduced in October 2023. Eureka takes a robotic task description and uses an LLM to generate software implementations for a reward function that measures success in that task. These reward functions are then run in simulation and the results are returned to the LLM, which reflects on the outcome and modifies it to the reward function. The advantage of this technique is that it can be run in parallel with hundreds of reward functions, all generated by the LLM. It can then pick the best functions and continue to improve them.
While the reward functions of Eureka are great for training RL policies in simulation, it does not account for the messiness of the real world and therefore requires manual sim-to-real transfer. DrEureka addresses this shortcoming by automatically configuring domain randomization (DR) parameters.
DR techniques randomize the physical parameters of the simulation environment so that the RL policy can generalize to the unpredictable perturbances it meets in the real world. One of the important challenges of DR is choosing the right parameters and range of perturbations. Adjusting parameters requires commonsense physical reasoning and knowledge of the target robot.
“These characteristics of designing DR parameters make it an ideal problem for LLMs to tackle because of their strong grasp of physical knowledge and effectiveness in generating hypotheses, providing good initializations to complex search and black-box optimization problems in a zero-shot manner,” the researchers wrote.
DrEureka uses a multi-step process to break down the complexity of optimizing reward functions and domain randomization parameters at the same time. First, an LLM generates reward functions based on a task description and safety instructions about the robot and the environment. DrEureka uses these instructions to create an initial reward function and learn a policy as in the original Eureka. The model then runs tests with the policy and reward function to determine the suitable range of physics parameters, such as friction and gravity.
The LLM then uses this information to select the optimal domain randomization configurations. Finally, the policy is retrained with the DR configurations to become robust against the noisiness of the real world.
The researchers described DrEureka as a “language-model driven pipeline for sim-to-real transfer with minimal human intervention.”
DrEureka in action
The researchers evaluated DrEureka on quadruped and dexterous manipulator platforms, although the method is general and applicable to diverse robots and tasks. Their findings show that in quadruped locomotion, policies trained with DrEureka outperform the classic human-designed systems by 34% in forward velocity and 20% in distance traveled across various real-world evaluation terrains. They also tested DrEureka on dexterous manipulation with robotic hands. Given a fixed amount of time, the best policy trained by DrEureka performed 300% more cube rotations than human-developed policies.
But the most interesting finding was the application of DrEureka on the novel task of having a robo-dog balancing and walking on a yoga ball. The LLM was able to design a reward function and DR configurations that allowed the trained policy to be transferred to the real world with no extra configurations and perform well enough on diverse indoor and outdoor terrains with minimal safety support.
Interestingly the study found that the safety instruction included in the task description plays an important role in ensuring that the LLM generates logical instructions that transfer to the real world.
“We believe that DrEureka demonstrates the potential of accelerating robot learning research by using foundation models to automate the difficult design aspects of low-level skill learning,” the researchers wrote.