System IntroductionLeveraging Symmetry in RL-based Legged Locomotion Control

1UC Berkeley
2Institute for Interdisciplinary Information Sciences, Tsinghua University
3Istituto Italiano di Tecnologia, Italy
4Shanghai Qi Zhi Institute
IROS 2024
System Introduction

Abstract

Model-free reinforcement learning is a promising approach for autonomously solving challenging robotics control problems, but faces exploration difficulty without information about the robot's morphology. The under-exploration of multiple modalities with symmetric states leads to behaviors that are often unnatural and sub-optimal. This issue becomes particularly pronounced in the context of robotic systems with morphological symmetries, such as legged robots for which the resulting asymmetric and aperiodic behaviors compromise performance, robustness, and transferability to real hardware. To mitigate this challenge, we can leverage symmetry to guide and improve the exploration in policy learning via equivariance / invariance constraints. We investigate the efficacy of two approaches to incorporate symmetry: modifying the network architectures to be strictly equivariant / invariant, and leveraging data augmentation to approximate equivariant / invariant actor-critics. We implement the methods on challenging loco-manipulation and bipedal locomotion tasks and compare with an unconstrained baseline. We find that the strictly equivariant policy consistently outperforms other methods in sample efficiency and task performance in simulation. Additionaly, symmetry-incorporated approaches exhibit better gait quality, higher robustness and can be deployed zero-shot to hardware.

Video

Poster

Results

We compare PPOaug, PPOeqic, and a baseline PPO on four different tasks.

tasks

Training Curves

tasks
Comparison of training curves of PPO, PPOaug, and PPOeqic on four tasks from left to right: Door Pushing, Dribbling, Stand Turning , and Slope Walking. Learning curves show mean episodic return and standard deviation for three seeds. PPOeqic consistently demonstrates the highest training returns and sample efficiency in all tasks.

Door Pushing

tasks
Comparison of success rates (SR) and their symmetry index on Door Pushing tasks on training-distribution and out-of-distribution scenarios. Of the three variants, PPOeqic demonstrates both higher success rate and better symmetry index in both cases, indicating a better task-level symmetric policy

Stand Turning

tasks
Comparison of command tracking error on Stand Turning task, Cost of Transport and their symmetry index on stand turning tasks for three PPO variants. PPOeqic demonstrates less error and energy consumption, indicating a more optimal policy

Slope Walking

tasks
Plots of the feet positions in the desired walk direction. We observe that vanilla PPO learns an unstable step pattern with backward steps and foot slipping, resulting in 50% slower walking speed. PPOaug improves drastically but asymmetric patterns such as foot dragging still exists. PPOeqic presents the most symmetric interweaving gait pattern and walks at the desired speed.

Real World Experiments

We deploy the learned policies of Stand Turning tasks on the real-world quadrupedal robot CyberDog 2. The policy trained by PPOaug shows incredible robustness.

Background & Method

Morphological Symmetry

We will study the robot's morphological symmetry using the principles of group theory. We only focus on the reflection symmetry group, denoted as \( \mathbb{G}:=\mathbb{C}_2=\{e,g_s|g_s^2=e\} \). For any morphological configuration \( x \) of the quadrupedal robot, \( g_s \triangleright x\) gives the reflected configuration of \( x \) with respect to the sagittal plane. Such symmetric group action can be applied to the task MDP's state space, action space, and observation space similarly.

morphological symmetry
reflection symmetry, gif borrowed from MorphoSymm

Equivariant / Invariant Functions

A function \( f: \mathcal{X} \rightarrow \mathcal{Y} \) is equivariant with respect to a group action \( g \) if \( f(g \triangleright x) = g \triangleright f(x) \) for all \( x \in \mathcal{X} \). It is invariant if \( f(g \triangleright x) = f(x) \) for all \( x \in \mathcal{X} \).

Symmetric MDP

We call an MDP \( (\mathcal{S}, \mathcal{A}, r, T, p_0) \) symmetric if there exists a group \( \mathbb{G} \) acting on the state space \( \mathcal{S} \) and action space \( \mathcal{A} \) such that the reward function \( r \), transition function \( T \) and the density of initial states \( p_0 \) are invariant with respect to the group action. \[ r(g_s \triangleright s, g_s \triangleright a) = r(s, a), \quad T(g_s \triangleright s, g_s \triangleright a, g_s \triangleright s') = T(s, a, s'), \quad p_0(g_s \triangleright s) = p_0(s) \]

Previous study has shown that symmetric MDPs possess \(\mathbb{G}\)-equivariant optimal control policies and \(\mathbb{G}\)-invariant value functions. \[ \pi(g_s \triangleright s) = g_s \triangleright \pi(s), \quad V(g_s \triangleright s) = V(s) \] We aim to leverage this property to guide the exploration in policy learning using two approaches.

PPOaug: PPO with data-augmentation.

For each online collected transition tuples \((s, a, r, s')\), we apply the group action \(g_s\) to \((s,a,s')\), and then add the augmented transition tuple \((g_s \triangleright s, g_s \triangleright a, r, g_s \triangleright s')\) to the replay buffer. The policy and value networks are trained both on the original and augmented transitions.

PPOeqic: PPO with hard equivariance / invariance symmetry constraints on network architectures.

By using the repositories escnn and MorphoSymm, we enforce the policy network to be strictly equivariant and the value network to be strictly invariant to the group action \(g_s\).

Acknowledgements

X. H., Z. L., Q. L., and K. S. acknowledge financial support from The AI Institute, InnoHK of the Government of the Hong Kong Special Administrative Region via the Hong Kong Centre for Logistics Robotics. G. T., M. P., and C. S. acknowledge financial support from PNRR MUR Project PE000013 "Future Artificial Intelligence Research", funded by the European Union - NextGenerationEU. The authors thank Prof. Xue Bin Peng for insightful discussions on this work. The authors also thank Xiaomi Inc. for providing CyberDog 2 for experiments.

BibTeX

@inproceedings{su2024leveraging,
    title={Leveraging Symmetry in RL-based Legged Locomotion Control},
    author={Su, Zhi and Huang, Xiaoyu and Ordoñez-Apraez, Daniel and Li, Yunfei and Li, Zhongyu and Liao, Qiayuan and Turrisi, Giulio and Pontil, Massimiliano and Semini, Claudio and Wu, Yi and Sreenath, Koushil},
    booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
    year={2024},
    organization={IEEE}
}