Skip to content

Latest commit

 

History

History
58 lines (58 loc) · 2.67 KB

2019-06-25-diakonikolas19b.md

File metadata and controls

58 lines (58 loc) · 2.67 KB
abstract section title layout series id month tex_title firstpage lastpage page order cycles bibtex_author author date address publisher container-title volume genre issued pdf extras
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution $p: D \to \mathbb{R}_+$, where $D \subseteq \mathbb{R}^d$, is called a $k$-histogram if there exists a partition of the domain into $k$ axis-aligned rectangles such that $p$ is constant within each such rectangle. Histograms are one of the most fundamental nonparametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, let $q$ be an unknown $d$-dimensional $k$-histogram distribution in fixed dimension $d$, and $p$ be an explicitly given $d$-dimensional $k$-histogram. We want to correctly distinguish, with probability at least $2/3$, between the case that $p = q$ versus $\|p-q\|_1 \geq \epsilon$. We design an algorithm for this hypothesis testing problem with sample complexity $O((\sqrt{k}/\epsilon^2) 2^{d/2} \log^{2.5 d}(k/\epsilon))$ that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if $q$ is only promised to be {\em close} to a $k$-histogram. Moreover, for $k = 2^{\Omega(d)}$, we show a sample complexity lower bound of $(\sqrt{k}/\epsilon^2) \cdot \Omega(\log(k)/d)^{d-1}$ when $d\geq 2$. That is, for any fixed dimension $d$, our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the $d=1$ case was well-understood, but no algorithm with sub-learning sample complexity was known, even for $d=2$. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
contributed
Testing Identity of Multidimensional Histograms
inproceedings
Proceedings of Machine Learning Research
diakonikolas19b
0
Testing Identity of Multidimensional Histograms
1107
1131
1107-1131
1107
false
Diakonikolas, Ilias and Kane, Daniel M. and Peebles, John
given family
Ilias
Diakonikolas
given family
Daniel M.
Kane
given family
John
Peebles
2019-06-25
PMLR
Proceedings of the Thirty-Second Conference on Learning Theory
99
inproceedings
date-parts
2019
6
25