Spatial Action Maps for Mobile Manipulation

This work proposes a new action representation for learning to perform complex mobile manipulation tasks. In a typical deep Q-learning setup, a convolutional neural network (ConvNet) is trained to map from an image representing the current state (e.g., a birds-eye view of a SLAM reconstruction of the scene) to predicted Q-values for a small set of steering command actions (step forward, turn right, turn left, etc.). Instead, we propose an action representation in the same domain as the state: "spatial action maps." In our proposal, the set of possible actions is represented by pixels of an image, where each pixel represents a trajectory to the corresponding scene location along a shortest path through obstacles of the partially reconstructed scene. A significant advantage of this approach is that the spatial position of each state-action value prediction represents a local milestone (local end-point) for the agent's policy, which may be easily recognizable in local visual patterns of the state image. A second advantage is that atomic actions can perform long-range plans (follow the shortest path to a point on the other side of the scene), and thus it is simpler to learn complex behaviors with a deep Q-network. A third advantage is that we can use a fully convolutional network (FCN) with skip connections to learn the mapping from state images to pixel-aligned action images efficiently. During experiments with a robot that learns to push objects to a goal location, we find that policies learned with this proposed action representation achieve significantly better performance than traditional alternatives.


To appear at Robotics: Science and Systems (RSS), 2020.
Latest version (April 20, 2020): arXiv:2004.09141 [cs.RO] or here.


1 Princeton University             2 Google              3 Columbia University


Code will be released on GitHub, including:
  • Simulation environments
  • Training code
  • Pretrained models


  title = {Spatial Action Maps for Mobile Manipulation},
  author = {Wu, Jimmy and Sun, Xingyuan and Zeng, Andy and Song, Shuran and Lee, Johnny and Rusinkiewicz, Szymon and Funkhouser, Thomas},
  booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
  year = {2020}

Physical Robot Results

In the following clips, we illustrate some of the interesting emergent behaviors exhibited by our trained agent. Our agents are trained in simulation (on the SmallEmpty environment) and directly executed in the real-world setup.

The most straightforward strategy for the agent is to push well-positioned objects directly into the receptacle.

However, pushing objects in this manner can be difficult, since the agent must keep the object aligned with its end effector throughout the maneuver. In some cases, the object being pushed might slip away before it reaches the receptacle. It's even more difficult if the agent is trying to push a stack of multiple objects — even if the first object is aligned with the end effector, the rest of them may not be.

It turns out though, that through training, the agent also discovers a more reliable way to push objects — by using the wall as a guide. The first step in this strategy is to push objects up against the walls.

Once objects are along the wall, it becomes much easier for the agent to push them into the receptacle, often multiple at a time. Sometimes, the objects are pushed continuously across very long distances. These long-distance multi-object maneuvers would be much less feasible without the wall as a guide.

Some objects need to be pushed past a corner on the way to the receptacle. This poses a challenge to the agent as it is easy for the objects to get stuck in the corners. To address these cases, the agent develops special techniques, such as nudging the objects from the side to get them unstuck, or backing up and adjusting its pushing angle before trying again.

We show examples of full, unedited videos below. All videos of the physical robot play at 8x speed. Note that we swap out the robot with a fresh one when its battery level gets low, and we remove objects that are completely inside the boundary of the receptacle. There is no other human intervention in these experiments.

Simulation Results

Our Method

These clips (4x speed) show the qualitative behavior of our trained agent in each of the four simulation environments we use. We train our agent with several shortest path components — movement primitives, input channels, and partial rewards. These features enable the trained agent to navigate amongst obstacles with ease.


The corresponding videos at 1x speed are also available below.


Steering Commands Baseline

These clips show baseline agents trained using a more traditional action representation — an 18-dimensional action space that corresponds to selecting one of 18 possible turn angles and moving 25 cm in that direction. These agents were trained for the same number of steps as our method, but show much worse performance due to low sample efficiency. The agent trained in the LargeDivider environment is the only one of the four that learns to push objects along the wall.


The same videos at 1x speed are also available below.



We would like to thank Naveen Verma, Naomi Leonard, Anirudha Majumdar, Stefan Welker, and Yen-Chen Lin for fruitful technical discussions, as well as Julian Salazar for hardware support. This work was supported in part by the Princeton School of Engineering, as well as the National Science Foundation under IIS-1617236, IIS-1815070, and DGE-1656466.


If you have any questions, please feel free to contact Jimmy Wu.

Last update: April 20, 2020