abstract

openreview

title

layout

series

publisher

issn

id

month

tex_title

firstpage

lastpage

page

order

cycles

bibtex_author

author

date

address

container-title

volume

genre

issued

pdf

extras

We propose two Thompson Sampling-like, model-based learning algorithms for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a $\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the number of time-steps per episode and $T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP$^+$), achieves the (near)-optimal $\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy. Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose, O-TS-Bandit$^+$, a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit$^+$ achieve the optimal $O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where $\Delta$ denotes the sub-optimality gap.

Bj8o6Qs-TVL

Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning

inproceedings

Proceedings of Machine Learning Research

PMLR

2640-3498

hu23a

0

Optimistic {T}hompson Sampling-based algorithms for episodic reinforcement learning

890

899

890-899

890

false

Hu, Bingshan and Zhang, Tianyue H. and Hegde, Nidhi and Schmidt, Mark

given	family
Bingshan	Hu

given	family
Tianyue H.	Zhang

given	family
Nidhi	Hegde

given	family
Mark	Schmidt

2023-07-02

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence

216

inproceedings

date-parts

2023

7

2

https://proceedings.mlr.press/v216/hu23a/hu23a.pdf

label	link
Supplementary PDF	https://proceedings.mlr.press/v216/hu23a/hu23a-supp.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-07-02-hu23a.md

2023-07-02-hu23a.md

Files

2023-07-02-hu23a.md

Latest commit

History

2023-07-02-hu23a.md

File metadata and controls