Spatial Action Maps for Mobile Manipulation

Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (e.g., step forward, turn left, turn right, etc.) from images of the current state (e.g., a bird's-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to learn with dense action representations defined in the same domain as the state. In this work, we present "spatial action maps," in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a local navigational endpoint at the corresponding scene location. Using ConvNets to infer spatial action maps from state images, action predictions are thereby spatially anchored on local visual features in the scene, enabling significantly faster learning of complex behaviors for mobile manipulation tasks with reinforcement learning. In our experiments, we task a robot with pushing objects to a goal location, and find that policies learned with spatial action maps achieve much better performance than traditional alternatives.


To appear at Robotics: Science and Systems (RSS), 2020.
Latest version (June 3, 2020): arXiv:2004.09141 [cs.RO] or here.


1 Princeton University             2 Google              3 Columbia University


Code is available on GitHub, including:
  • Simulation environments
  • Training code
  • Pretrained policies


  title = {Spatial Action Maps for Mobile Manipulation},
  author = {Wu, Jimmy and Sun, Xingyuan and Zeng, Andy and Song, Shuran and Lee, Johnny and Rusinkiewicz, Szymon and Funkhouser, Thomas},
  booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
  year = {2020}

Technical Summary Video (with audio)

Real Robot Results

In the following clips, we illustrate some of the interesting emergent behaviors exhibited by our trained agent. Our agents are trained in simulation (on the SmallEmpty environment) and directly executed in the real-world setup.

The most straightforward strategy for the agent is to push well-positioned objects directly into the receptacle.

However, pushing objects in this manner can be difficult, since the agent must keep the object aligned with its end effector throughout the maneuver. In some cases, the object being pushed might slip away before it reaches the receptacle. It's even more difficult if the agent is trying to push a stack of multiple objects — even if the first object is aligned with the end effector, the rest of them may not be.

It turns out though, that through training, the agent also discovers a more reliable way to push objects — by using the wall as a guide. The first step in this strategy is to push objects up against the walls.

Once objects are along the wall, it becomes much easier for the agent to push them into the receptacle, often multiple at a time. Sometimes, the objects are pushed continuously across very long distances. These long-distance multi-object maneuvers would be much less feasible without the wall as a guide.

Some objects need to be pushed past a corner on the way to the receptacle. This poses a challenge to the agent as it is easy for the objects to get stuck in the corners. To address these cases, the agent develops special techniques, such as nudging the objects from the side to get them unstuck, or backing up and adjusting its pushing angle before trying again.

We show examples of full, unedited videos below. All videos of the real robot play at 8x speed. Note that we swap out the robot with a fresh one when its battery level gets low, and we remove objects that are completely inside the boundary of the receptacle. There is no other human intervention in these experiments.

Simulation Results

Our Method

These clips (4x speed) show the qualitative behavior of our trained agent in each of the four simulation environments we use. We train our agent with several shortest path components — movement primitives, input channels, and partial rewards. These features enable the trained agent to navigate amongst obstacles with ease.


The corresponding videos at 1x speed are also available below.


Steering Commands Baseline

These clips show baseline agents trained using a more traditional action representation — an 18-dimensional action space that corresponds to selecting one of 18 possible turn angles and moving 25 cm in that direction. These agents were trained for the same number of steps as our method, but show much worse performance due to low sample efficiency. The agent trained in the LargeDivider environment is the only one of the four that learns to push objects along the wall.


The same videos at 1x speed are also available below.



We would like to thank Naveen Verma, Naomi Leonard, Anirudha Majumdar, Stefan Welker, and Yen-Chen Lin for fruitful technical discussions, as well as Julian Salazar for hardware support. This work was supported in part by the Princeton School of Engineering, as well as the National Science Foundation under IIS-1617236, IIS-1815070, and DGE-1656466.


If you have any questions, please feel free to contact Jimmy Wu.

Last update: August 31, 2020