Skip to content

Latest commit

 

History

History
62 lines (62 loc) · 2.63 KB

2023-07-02-hu23a.md

File metadata and controls

62 lines (62 loc) · 2.63 KB
abstract openreview title layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
We propose two Thompson Sampling-like, model-based learning algorithms for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a $\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the number of time-steps per episode and $T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP$^+$), achieves the (near)-optimal $\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy. Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose, O-TS-Bandit$^+$, a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit$^+$ achieve the optimal $O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where $\Delta$ denotes the sub-optimality gap.
Bj8o6Qs-TVL
Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
hu23a
0
Optimistic {T}hompson Sampling-based algorithms for episodic reinforcement learning
890
899
890-899
890
false
Hu, Bingshan and Zhang, Tianyue H. and Hegde, Nidhi and Schmidt, Mark
given family
Bingshan
Hu
given family
Tianyue H.
Zhang
given family
Nidhi
Hegde
given family
Mark
Schmidt
2023-07-02
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
216
inproceedings
date-parts
2023
7
2