Add Empirical() distribution? #98

zeileis · 2023-02-23T08:42:45Z

Question

For some of our practical meteorological applications we need to deal with probability distributions and probabilistic forecasts that are made via empirical distributions (so-called "ensembles" in the weather forecasting). So to enable all the nice infrastructure from distributions3 and topmodels we (= mostly @retostauffer with some input from me) have written a first draft of a distribution and corresponding d/p/q/r functions.

My feeling is that this distribution class is of general interest and could also be relevant in introductory statistics. So should we prepare a PR for distributions3?

Implementation strategy

The idea is that the empirical observations are handled internally like the parameters of a probability distribution. The moments and quantiles are computed from the empirical observations directly. The cdf() uses the empirical CDF (ECDF) and the pdf() can either use hist() or density().

Illustration

Packages: The current code is in topmodels but unexported at the moment.

install.packages("topmodels", repos = "https://R-Forge.R-project.org")
library("distributions3")

Set up a simple empirical sample and a shifted version with some extra observations to obtain another empirical sample:

x <- c(-3, -2, -1, -0.5, 0, 0.5, 1, 2, 3)
y <- c(0, 0, x + 3)
d <- topmodels:::Empirical(list(x, y))
d
## [1] "Empirical distribution (Min. -3, Max.  3, N = 9)" 
## [2] "Empirical distribution (Min.  0, Max.  6, N = 11)"

Moments and quantiles are straightforward:

mean(d)
## [1] 0.000000 2.454545
variance(d)
## [1] 3.562500 4.322727
quantile(d, 0.5)
## [1] 0.0 2.5

Random sampling is with replacement (i.e., corresponding to bootstrapping):

set.seed(0)
random(d, 10)
##      r_1 r_2 r_3 r_4 r_5 r_6 r_7 r_8 r_9 r_10
## [1,]   3   1  -2  -2  -1   0 0.5   1   0    3
## [2,]   1   0   3   6   0   2 5.0   4   2    4

The ECDF is uniquely defined but the PDF via histogram or kernel density depends on binning/bandwidth:

cdf(d, 0)
## [1] 0.5555556 0.2727273
pdf(d, 0, method = "hist") ## default method
## [1] 0.2222222 0.3636364
pdf(d, 0, method = "hist", breaks = 4)
## [1] 0.1666667 0.2272727
pdf(d, 0, method = "density")
## [1] 0.1978979 0.1271395
pdf(d, 0, method = "density", bw = 1)
## [1] 0.1894239 0.1378353

The text was updated successfully, but these errors were encountered:

alexpghayes · 2023-02-24T23:10:47Z

Hmmm. This is interesting. Morally I feel like the Empirical distribution is a discrete distribution over the observed data with density 1 / n on each point. I'm open to this but I think that the implementation might be starting to blur the line between a distribution as a population, and a distributional estimate of that population.

zeileis · 2023-02-25T04:25:01Z

That was also my first impulse. Reto convinced me that for the data he was interested in, an aaproximated density would be more useful in practice. But given your reaction I would say we should make the discrete behavior the default. The only question remains if we allow histogram/kernel density to be available as alternatives "for convenience". Would you be open to that?

alexpghayes · 2023-02-27T01:14:50Z

The only question remains if we allow histogram/kernel density to be available as alternatives "for convenience". Would you be open to that?

I'm fairly opposed as currently implemented, because the class structure does not map on to the structure of the statistical objects. I think the right way to do this is to have distribution objects and distributional_estimate objects, and for these to be distinct. The idea being roughly that a distribution object corresponds to a single CDF F, and a distributional estimate corresponds to a single estimate F(X) of F for data X. This is essentially the same complaint I originally had about prodist(). In parametric settings, the distinctions might be otherwise minor, but in non-parametric settings they become more clear. A histogram or a kernel density estimate is very much a distributional estimate, but I don't there is a obvious corresponding distribution. More precisely, I think it is misleading to pick a semi-arbitrary non-parametric estimate and claim that it corresponds to an Empirical distribution. I understand this is splitting hairs to some degree.

In practice, I don't know a convenient way to implement a bunch of parametric distribution S3 objects, and then to also derive a bunch of distributional_estimate S3 classes that correspond to each distribution, but this kind of dual class system is my preference here. At the very least, distributions and distributional_estimates classes should have different print methods. Then things like KDEs and histograms should be implemented as standalone distributional estimators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Empirical() distribution? #98

Add Empirical() distribution? #98

zeileis commented Feb 23, 2023

alexpghayes commented Feb 24, 2023

zeileis commented Feb 25, 2023

alexpghayes commented Feb 27, 2023

Add Empirical() distribution? #98

Add Empirical() distribution? #98

Comments

zeileis commented Feb 23, 2023

Question

Implementation strategy

Illustration

alexpghayes commented Feb 24, 2023

zeileis commented Feb 25, 2023

alexpghayes commented Feb 27, 2023