Skip to content

Latest commit

 

History

History
56 lines (56 loc) · 2.09 KB

2023-12-02-metcalf23a.md

File metadata and controls

56 lines (56 loc) · 2.09 KB
title section openreview abstract layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Poster
i84V7i6KEMd
Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that encoding environment dynamics in the reward function improves the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) encoding environment dynamics in a state-action representation $z^{sa}$ via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from $z^{sa}$, which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover $83%$ and $66%$ of ground truth reward policy performance versus only $38%$ and $21%$ without environment dynamics. The performance gains demonstrate that explicitly encoding environment dynamics improves preference-learned reward functions.
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
metcalf23a
0
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
1484
1532
1484-1532
1484
false
Metcalf, Katherine and Sarabia, Miguel and Mackraz, Natalie and Theobald, Barry-John
given family
Katherine
Metcalf
given family
Miguel
Sarabia
given family
Natalie
Mackraz
given family
Barry-John
Theobald
2023-12-02
Proceedings of The 7th Conference on Robot Learning
229
inproceedings
date-parts
2023
12
2