Ingredients for robotics research

We’re releasing eight simulated robotics environments and a Baselines implementation of Hindsight Experience Replay, all developed for our research over the past year. We’ve used these environments to train models which work on physical robots. We’re also releasing a set of requests for robotics research.

This release includes four environments using theFetch⁠(opens in a new window)research platform and four environments using theShadowHand⁠(opens in a new window)robot. The manipulation tasks contained in these environments are significantly more difficult than the MuJoCo continuous control environments currently available in Gym, all of which are now easily solvable using recently released algorithms likePPO⁠. Furthermore, our newly released environments use models of real robots and require the agent to solve realistic tasks.

FetchReach-v0: Fetch has to move its end-effector to the desired goal position.

FetchReach-v0: Fetch has to hit a puck across a long table such that it slides and comes to rest on the desired goal.

FetchReach-v0: Fetch has to move a box by pushing it until it reaches a desired goal position.

FetchReach-v0: Fetch has to pick up a box from a table using its gripper and move it to a desired goal above the table.

HandReach-v0: ShadowHand has to reach with its thumb and a selected finger until they meet at a desired goal position above the palm.

HandReach-v0: ShadowHand has to manipulate a block until it achieves a desired goal position and rotation.

HandReach-v0: ShadowHand has to manipulate an egg until it achieves a desired goal position and rotation.

HandReach-v0: ShadowHand has to manipulate a pen until it achieves a desired goal position and rotation.

This release ships with eight robotics environments forGym⁠(opens in a new window)that use theMuJoCo⁠(opens in a new window)physics simulator. The environments are:

All of the new tasks have the concept of a “goal”, for example the desired position of the puck in the slide task or the desired orientation of a block in the hand block manipulation task. All environments by default use a sparse reward of-1 if the desired goal was not yet achieved and 0 if it was achieved (within some tolerance). This is in contrast to the shaped rewards used in the old set of Gym continuous control problems, for exampleWalker2d-v2⁠(opens in a new window)with itsshaped reward⁠(opens in a new window).

We also include a variant with dense rewards for each environment. However, we believe that sparse rewards are more realistic in robotics applications and we encourage everyone to use the sparse reward variant instead.

## Hindsight Experience Replay

Alongside these new robotics environments, we’re alsoreleasing code⁠(opens in a new window)forHindsight Experience Replay⁠(opens in a new window)(or HER for short), a reinforcement learning algorithm that can learn from failure. Our results show that HER can learn successful policies on most of the new robotics problems from only sparse rewards. Below, we also show some potential directions for future research that could further improve the performance of the HER algorithm on these tasks.

## Understanding HER

To understand what HER does, let’s look at in the context ofFetchSlide⁠(opens in a new window), a task where we need to learn to slide a puck across the table and hit a target. Our first attempt very likely will not be a successful one. Unless we get very lucky, the next few attempts will also likely not succeed. Typical reinforcement learning algorithms would not learn anything from this experience since they just obtain a constant reward (in this case:`-1`) that does not contain any learning signal.

The key insight that HER formalizes is what humans do intuitively: Even though we have not succeeded at a specific goal, we have at least achieved a different one. So why not just pretend that we wanted to achieve this goal to begin with, instead of the one that we set out to achieve originally? By doing this substitution, the reinforcement learning algorithm can obtain a learning signal since it has achieved _some_ goal; even if it wasn’t the one that we meant to achieve originally. If we repeat this process, we will eventually learn how to achieve arbitrary goals, including the goals that we really want to achieve.

This approach lets us learn how to slide a puck across the table even though our reward is fully sparse and even though we may have never actually hit the desired goal early on. We call this technique Hindsight Experience Replay since it replays experience (a technique often used in off-policy RL algorithms likeDQN⁠andDDPG⁠(opens in a new window)) with goals which are chosen in hindsight, after the episode has finished. HER can therefore be combined with any off-policy RL algorithm (for example, HER can be combined with DDPG, which we write as “DDPG +HER”).

We’ve found HER to work extremely well in goal-based environments with sparse rewards.We compare DDPG + HER and vanilla DDPG on the new tasks. This comparison includes the sparse and the dense reward versions of each environment.

Median test success rate (line) with interquartile range (shaded area) for four different configurations on HandManipulateBlockRotateXYZ-v0. Data is plotted over training epochs and summarized over five different random seeds per configuration.

DDPG + HER with sparse rewards significantly outperforms all other configurations and manages to learn a successful policy on this challenging task only from sparse rewards. Interestingly, DDPG + HER with dense reward is able to learn but achieves worse performance. Vanilla DDPG mostly fails to learn in both cases. We find this trend to be generally true across most environments and we include full results in our accompanyingtechnical report⁠(opens in a new window).

### Requests for Research: HER edition

Though HER is a promising way towards learning complex goal-based tasks with sparse rewards like the robotics environments that we propose here, there is still a lot of room for improvement. Similar to our recently publishedRequests for Research 2.0⁠, we have a few ideas on ways to improve HER specifically, and reinforcement learning in general.

You can find additional additional information and references on these proposals and on the on the new Gym environments in our accompanyingtechnical report⁠(opens in a new window).

## Using goal-based environments

Introducing the notion of a “goal” requires a few backwards-compatible changes to theexisting Gym API⁠(opens in a new window):

Here is a simple example that interacts with the one of the new goal-based environments and performs goal substitution:

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

import numpy as np import gym

env=gym.make('FetchReach-v0') obs=env.reset() done=False

def policy(observation, desired_goal): # Here you would implement your smarter policy. In this case, # we just sample random actions. return env.action_space.sample()

while not done: action=policy(obs['observation'], obs['desired_goal']) obs, reward, done, info=env.step(action)

# If we want, we can substitute a goal here and re-compute # the reward. For instance, we can just pretend that the desired # goal was what we achieved all along. substitute_goal=obs['achieved_goal'].copy() substitute_reward=env.compute_reward( obs['achieved_goal'], substitute_goal, info) print('reward is {}, substitute_reward is {}'.format( reward, substitute_reward))

view rawexample.py hosted with ❤ by GitHub

The new goal-based environments can be used with existing Gym-compatible reinforcement learning algorithms, such asBaselines⁠(opens in a new window). Use`gym.wrappers.FlattenDictWrapper`to flatten the dict-based observation space into an array:

Show hidden characters

import numpy as np import gym

env=gym.make('FetchReach-v0')

# Simply wrap the goal-based environment using FlattenDictWrapper # and specify the keys that you would like to use. env=gym.wrappers.FlattenDictWrapper( env, dict_keys=['observation', 'desired_goal'])

# From now on, you can use the wrapper env as per usual: ob=env.reset() print(ob.shape) # is now just an np.array

view rawexample.py hosted with ❤ by GitHub

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, Wojciech Zaremba

Scaling laws for reward model overoptimization Publication Oct 19, 2022

Introducing Whisper Release Sep 21, 2022

Learning to play Minecraft with Video PreTraining Conclusion Jun 23, 2022

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Ingredients for robotics research

New-Age Tech Stocks Rebound: FirstCry Leads Gains This Week

How fusion power works and the startups pursuing it

AI boom? OpenAI set to double its team by end of 2026; new hires to be deployed across these fields - Report

NeuroPause Lab Introduces 'AI Action Firewall' for Enhanced AI Safety

Latest Briefs