Leveraging Symmetry in RL-based Legged Locomotion Control

Zhi Su^{*, 2}, Xiaoyu Huang^{*, 1}, Daniel Ordoñez-Apraez³, Yunfei Li², Zhongyu Li¹, Qiayuan Liao¹, Giulio Turrisi³, Massimiliano Pontil³, Claudio Semini³, Yi Wu^{2, 4}, Koushil Sreenath¹

¹UC Berkeley

²Institute for Interdisciplinary Information Sciences, Tsinghua University

³Istituto Italiano di Tecnologia, Italy

⁴Shanghai Qi Zhi Institute

IROS 2024

arXiv Video Code

Abstract

Model-free reinforcement learning is a promising approach for autonomously solving challenging robotics control problems, but faces exploration difficulty without information about the robot's morphology. The under-exploration of multiple modalities with symmetric states leads to behaviors that are often unnatural and sub-optimal. This issue becomes particularly pronounced in the context of robotic systems with morphological symmetries, such as legged robots for which the resulting asymmetric and aperiodic behaviors compromise performance, robustness, and transferability to real hardware. To mitigate this challenge, we can leverage symmetry to guide and improve the exploration in policy learning via equivariance / invariance constraints. We investigate the efficacy of two approaches to incorporate symmetry: modifying the network architectures to be strictly equivariant / invariant, and leveraging data augmentation to approximate equivariant / invariant actor-critics. We implement the methods on challenging loco-manipulation and bipedal locomotion tasks and compare with an unconstrained baseline. We find that the strictly equivariant policy consistently outperforms other methods in sample efficiency and task performance in simulation. Additionaly, symmetry-incorporated approaches exhibit better gait quality, higher robustness and can be deployed zero-shot to hardware.

Video

Poster

Results

We compare PPOaug, PPOeqic, and a baseline PPO on four different tasks.

Training Curves

Door Pushing

Stand Turning

Slope Walking

Real World Experiments

We deploy the learned policies of Stand Turning tasks on the real-world quadrupedal robot CyberDog 2. The policy trained by PPOaug shows incredible robustness.

Background & Method

Morphological Symmetry

We will study the robot's morphological symmetry using the principles of group theory. We only focus on the reflection symmetry group, denoted as \( \mathbb{G}:=\mathbb{C}_2=\{e,g_s|g_s^2=e\} \). For any morphological configuration \( x \) of the quadrupedal robot, \( g_s \triangleright x\) gives the reflected configuration of \( x \) with respect to the sagittal plane. Such symmetric group action can be applied to the task MDP's state space, action space, and observation space similarly.

Equivariant / Invariant Functions

A function \( f: \mathcal{X} \rightarrow \mathcal{Y} \) is equivariant with respect to a group action \( g \) if \( f(g \triangleright x) = g \triangleright f(x) \) for all \( x \in \mathcal{X} \). It is invariant if \( f(g \triangleright x) = f(x) \) for all \( x \in \mathcal{X} \).

Symmetric MDP

We call an MDP \( (\mathcal{S}, \mathcal{A}, r, T, p_0) \) symmetric if there exists a group \( \mathbb{G} \) acting on the state space \( \mathcal{S} \) and action space \( \mathcal{A} \) such that the reward function \( r \), transition function \( T \) and the density of initial states \( p_0 \) are invariant with respect to the group action. \[ r(g_s \triangleright s, g_s \triangleright a) = r(s, a), \quad T(g_s \triangleright s, g_s \triangleright a, g_s \triangleright s') = T(s, a, s'), \quad p_0(g_s \triangleright s) = p_0(s) \]

Previous study has shown that symmetric MDPs possess \(\mathbb{G}\)-equivariant optimal control policies and \(\mathbb{G}\)-invariant value functions. \[ \pi(g_s \triangleright s) = g_s \triangleright \pi(s), \quad V(g_s \triangleright s) = V(s) \] We aim to leverage this property to guide the exploration in policy learning using two approaches.

PPOaug: PPO with data-augmentation.

For each online collected transition tuples \((s, a, r, s')\), we apply the group action \(g_s\) to \((s,a,s')\), and then add the augmented transition tuple \((g_s \triangleright s, g_s \triangleright a, r, g_s \triangleright s')\) to the replay buffer. The policy and value networks are trained both on the original and augmented transitions.

PPOeqic: PPO with hard equivariance / invariance symmetry constraints on network architectures.

By using the repositories escnn and MorphoSymm, we enforce the policy network to be strictly equivariant and the value network to be strictly invariant to the group action \(g_s\).

Acknowledgements

X. H., Z. L., Q. L., and K. S. acknowledge financial support from The AI Institute, InnoHK of the Government of the Hong Kong Special Administrative Region via the Hong Kong Centre for Logistics Robotics. G. T., M. P., and C. S. acknowledge financial support from PNRR MUR Project PE000013 "Future Artificial Intelligence Research", funded by the European Union - NextGenerationEU. The authors thank Prof. Xue Bin Peng for insightful discussions on this work. The authors also thank Xiaomi Inc. for providing CyberDog 2 for experiments.

BibTeX

@inproceedings{su2024leveraging,
    title={Leveraging Symmetry in RL-based Legged Locomotion Control},
    author={Su, Zhi and Huang, Xiaoyu and Ordoñez-Apraez, Daniel and Li, Yunfei and Li, Zhongyu and Liao, Qiayuan and Turrisi, Giulio and Pontil, Massimiliano and Semini, Claudio and Wu, Yi and Sreenath, Koushil},
    booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    year={2024},
    organization={IEEE}
}