Skip to content

Latest commit

 

History

History
52 lines (52 loc) · 2.06 KB

2024-04-18-colaco-carr24a.md

File metadata and controls

52 lines (52 loc) · 2.06 KB
title abstract layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
Conditions on Preference Relations that Guarantee the Existence of Optimal Policies
Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies – that are characterized by recursive optimality equations – even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
colaco-carr24a
0
Conditions on Preference Relations that Guarantee the Existence of Optimal Policies
3916
3924
3916-3924
3916
false
Cola\c{c}o Carr, Jonathan and Panangaden, Prakash and Precup, Doina
given family
Jonathan
Colaço Carr
given family
Prakash
Panangaden
given family
Doina
Precup
2024-04-18
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
238
inproceedings
date-parts
2024
4
18