Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DiscreteNonParametric distribution #634

Merged
merged 17 commits into from
Nov 4, 2018

Conversation

GordStephen
Copy link
Contributor

This implements a new generic, discrete, sparse distribution type for representing and sampling from an arbitrary PDF, as was discussed on Discourse a while ago. As of right now the type is called Generic, although there's probably a better name for for it...

If there's interest in merging this, I can add documentation, more robust testing, and any outstanding functionality (mfg, cf, etc).

@jmxpearson jmxpearson requested review from lindahua and removed request for lindahua July 9, 2017 22:31
@andreasnoack
Copy link
Member

Thanks for contributing this. This one is extremely similar to Categorical so it would be great if we could think through if they could either be merged or at least how much code they could share.

@GordStephen
Copy link
Contributor Author

GordStephen commented Jul 11, 2017

Sounds good - they already share the sampler and it should be easy to consolidate most of the other functionality as well. They probably can't be merged since the point of Categorical is that you don't need to store a value vector, while the point of Generic is that you want to store a value vector - but maybe they could share an abstract parent type (ArbitraryDistribution <: DiscreteUnivariateDistribution or similar) with most of the methods defined on that?

Edit: Thinking about it more, if Generic.vals took an AbstractVector{T} rather than a Vector{T}, Categorical would just be a special case of Generic, storing the support in a UnitRange (or Base.OneTo?) instead of a Vector. Maybe that's cleaner?

@andreasnoack
Copy link
Member

An option might be just to use a unit range for the value vector. It should be very light weight. In my book, this Generic distribution is the same as a categorical distribution. However, it might be that we want to separate them for efficiency reasons but I'd prefer if it wasn't necessary.

@GordStephen
Copy link
Contributor Author

GordStephen commented Jul 11, 2017

Agreed, I think that's the same approach as my edit. I can't think of any reasons combining them would need to be any less efficient, although I'll run some performance tests to verify that.

Semantically, to me at least, a categorical distribution suggests that the support is non-numeric, i.e the integers in the support are just proxies for some other label, and their ordering/relative values are somewhat arbitrary. Given that connotation I would be hesitant to label a distribution over a potentially-quantitatively-meaningful real-valued support as Categorical... I don't particularly like Generic either, but whatever it's called, I could see switching Categorical to be an alias for an instance of the more general case:

struct Generic{T, P, S <: AbstractVector{T}}  ... end
Categorical{P} = Generic{Int, P, Base.OneTo{Int}}

Generic could also provide single-argument constructors to maintain the existing Categorical API.

@jmxpearson
Copy link
Contributor

Just a bikeshedding note: what about Atomic? In statistics, I've often seen this kind of distribution with support only at fixed values described as a collection of atoms.

Not that "atomic" doesn't have other meanings in software...

@GordStephen
Copy link
Contributor Author

Still more work to be done, but I've converted Categorical{T} into a type alias for Generic{Int,T,UnitRange{Int}}. It's a drop-in replacement - the only test that needed changing to pass was one that accessed the Categorical.K field directly rather than using the ncategories getter method. I used UnitRange{Int} to maximize compatibility with existing code and tests, although eventually I think Base.OneTo{Int} would better encode the fact that the support is always 1:K.

As far as bikeshedding the name, a few other options:

  • NonParametric, although it could cause confusion with Empirical
  • Arbitrary, although it's no more descriptive than Generic

@GordStephen GordStephen force-pushed the gs/generic-distribution branch from 11a576e to 7a56072 Compare July 13, 2017 04:00
@GordStephen GordStephen force-pushed the gs/generic-distribution branch from 7a56072 to a18623f Compare November 4, 2017 17:36
@GordStephen
Copy link
Contributor Author

I think all of the missing methods should be implemented now, but correct me if I've missed anything. As noted in #661 (comment), the support of DiscreteUnivariateDistributions are assumed to contain Ints, so Generic probably shouldn't have that as its supertype....

@leclere
Copy link

leclere commented Mar 2, 2018

Just wanted to state that it would be usefull for me :-)

Also, if I may, Generic might not be understandable enough, Atomic as proposed earlier might be better, or GenericDiscrete...

@matbesancon
Copy link
Member

@GordStephen could you rebase the PR on master before we can move on with this work? There are conflicts

@GordStephen
Copy link
Contributor Author

Sure, will do

@matbesancon
Copy link
Member

Thanks!

@GordStephen GordStephen force-pushed the gs/generic-distribution branch from 9e45f27 to 236d652 Compare September 23, 2018 14:24
@GordStephen
Copy link
Contributor Author

Ok, this is rebased and updated for 0.7.

@matbesancon
Copy link
Member

I like DiscreteAtomic. @GordStephen could you add tests for this.
Also we need to document the whole PR a bit more. Could you add docstrings to all new types and functions, and in the file of the @docs folder which best suits?

@RuiRojo
Copy link

RuiRojo commented Oct 6, 2018

Hi. Thanks for your work of course.

I don't like anything with Atomic here, since I think it's a fancy "discrete", which doesn't say enough. Nor Generic. But I'm not coming up with anything I like either.

Perhaps if it was just an extension of Categorical, with some extra type parameter. Or maybe GeneralizedCategorical?

@matbesancon
Copy link
Member

"fancy" is not always bad, at least it's specific and associated with discrete distributions. We have to keep in mind that the literature does not have clear universal terms for these concepts (discrete over a setthat's not in N), so it's fine to come up with new, specific terms for this

@RuiRojo
Copy link

RuiRojo commented Oct 7, 2018

In many words, this is a "general discrete distrubition with a finite number of atoms", right?

I like the idea of coming up with a new specific term. But it shouldn't already mean something that's in conflict with other existing distributions, such as just Discrete/Atomic. IIUC, an atomic distribution is the same (if not more general) than a discrete one, so it would be bad if it "derived" from Discrete.

@matbesancon
Copy link
Member

@GordStephen while some thoughts are still given on the naming, could you add some documentation in docs and docstring to all new non-trivial functions (it doesn't hurt, even the unexported) :)

@GordStephen
Copy link
Contributor Author

Will do, although I'm traveling a fair bit this week and next so there might be some delay.

@matbesancon
Copy link
Member

bumping this one if you have time to add the doc pages & docstring

@GordStephen
Copy link
Contributor Author

I've added some docstrings and updated the Univariate docs page, but let me know if there should be updates anywhere else.

I also renamed the distribution to DiscreteNonParametric for now, which is at least unambiguous: one thing we can all agree on is that Generic isn't a great option.

@GordStephen GordStephen changed the title [WIP] Add Generic discrete distribution Add DiscreteNonParametric distribution Oct 28, 2018
@matbesancon
Copy link
Member

DiscreteNonParametric is great, thanks !

@matbesancon
Copy link
Member

matbesancon commented Oct 29, 2018 via email

@GordStephen
Copy link
Contributor Author

Ok, non-Vector probability containers should now be supported for DiscreteNonParametric and therefore also Categorical. I also switched the support of Categorical over to be Base.OneTo{Int} instead of just UnitRange{Int} - happily the only test breakage occured in a test that was explicitly looking for a UnitRange type (switching to AbstractUnitRange solved it).

@matbesancon
Copy link
Member

thanks a lot @GordStephen
merging in 24 hours if no additional comments
cc @ararslan @simonbyrne

@GordStephen
Copy link
Contributor Author

Doing some final checks, I realized the tests for this distribution didn't include test_distr. Unfortunately there are several assumptions about uniformly-spaced integer support for discrete distributions in that function that prevent it from working properly on this distribution, but I did make what noninvasive changes I could to allow most of the tests there to run (which yielded a few minor required changes, now addressed).

@matbesancon matbesancon merged commit cd0ae4e into JuliaStats:master Nov 4, 2018
@matbesancon
Copy link
Member

Thanks, merged

@andreasnoack
Copy link
Member

andreasnoack commented Sep 5, 2019

A bit late here but what I suggested in #634 (comment) was that we just used the name Categorical for this distribution unless it had to be a separate distribution for efficiency reasons. In the end, we ended up using the same distribution for both but invented a new name instead of just using Categorical.

I think we should revisit the name since I believe DiscreteNonParametric is a misnomer. What is non-parametric about this distribution? I'll suggest that we rename it to Categorical. We can keep the existing Categorical constructor that doesn't take a sample space argument.

@nalimilan
Copy link
Member

The fact that Categorical is restricted to integers in 1:k is indeed confusing. It would be simpler to allow specifying particular values as an option.

@GordStephen
Copy link
Contributor Author

To me the main distinction between DiscreteNonParametric (happy to revisit the name) and Categorical should be that the support of DiscreteNonParametric is numeric and so has meaningfully defined moments and other quantitative properties, whereas a Categorical distribution can have an arbitrary and unordered support (e.g. ["apples", "bananas", "coconuts"]). So while I'm certainly in favor of allowing Categorical to have non-integer support, I wouldn't want to just rename DiscreteNonParametric to Categorical.

To the point about whether this distribution is non-parametric, I suppose one could argue that it has parameters, just lots of them (a probability and value for each element of the support). That said, a distribution defined by a kernel density estimate has similar properties, and KDEs are considered to be defined non-parametrically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants