abstract | openreview | title | layout | series | publisher | issn | id | month | tex_title | firstpage | lastpage | page | order | cycles | bibtex_author | author | date | address | container-title | volume | genre | issued | extras | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
We propose two Thompson Sampling-like, model-based learning algorithms for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a |
Bj8o6Qs-TVL |
Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning |
inproceedings |
Proceedings of Machine Learning Research |
PMLR |
2640-3498 |
hu23a |
0 |
Optimistic {T}hompson Sampling-based algorithms for episodic reinforcement learning |
890 |
899 |
890-899 |
890 |
false |
Hu, Bingshan and Zhang, Tianyue H. and Hegde, Nidhi and Schmidt, Mark |
|
2023-07-02 |
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence |
216 |
inproceedings |
|
|