Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sampling integers from discrete distributions #155

Merged
merged 2 commits into from
Sep 22, 2024

Conversation

SabrinaJewson
Copy link
Contributor

No description provided.

@henryjac
Copy link
Contributor

henryjac commented Mar 6, 2024

Very reasonable PR, adding to milestone 0.17 and we'll probably get this merged! Preliminary checks looks good to me.

@henryjac henryjac added this to the 0.17 milestone Mar 6, 2024
@vks
Copy link
Contributor

vks commented Mar 10, 2024

I would be careful with returning integers for distributions that use floats internally. For large numbers, this is lossy.

@SabrinaJewson
Copy link
Contributor Author

For most the distributions, their implementations effectively use integers anyway and this PR just exposes that functionality.

The only exception is the geometric distribution:

x.log(1.0 - self.p).ceil() as u64

The largest finite value this expression can be is if x = 1.4e-45 and 1.0 - self.p = 0.9999999999999999, in which case we get 930262250532780300, which when casted to a u64 is 930262250532780288. Importantly it does stay within the range of a u64, so I would say the lossiness is acceptable here.

@SabrinaJewson
Copy link
Contributor Author

Now resolved the merge conflicts (and squashed everything to one commit).

Copy link

codecov bot commented Aug 31, 2024

Codecov Report

Attention: Patch coverage is 0% with 61 lines in your changes missing coverage. Please review.

Project coverage is 93.76%. Comparing base (a514992) to head (3deb714).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/distribution/multinomial.rs 0.00% 19 Missing ⚠️
src/distribution/geometric.rs 0.00% 10 Missing ⚠️
src/distribution/categorical.rs 0.00% 9 Missing ⚠️
src/distribution/binomial.rs 0.00% 6 Missing ⚠️
src/distribution/hypergeometric.rs 0.00% 6 Missing ⚠️
src/distribution/bernoulli.rs 0.00% 4 Missing ⚠️
src/distribution/discrete_uniform.rs 0.00% 4 Missing ⚠️
src/distribution/poisson.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #155      +/-   ##
==========================================
- Coverage   94.03%   93.76%   -0.28%     
==========================================
  Files          53       53              
  Lines       12055    12090      +35     
==========================================
  Hits        11336    11336              
- Misses        719      754      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@YeungOnion
Copy link
Contributor

Thanks for getting back on this despite how long this was left sitting!

Two requests,

  1. Would you reword to use a conventional commit message so I can rely on automatic change log generation tools? Just a feat: prefix on the first line of the commit would be great.

  2. Are you able to write some regression tests against the pre-existing float output?

@YeungOnion
Copy link
Contributor

Hey, would you want me to resolve the conflicts? I imagine most of them are for the "rand" feature and you've already done the first merge conflict, and I think this is a good feature.

@YeungOnion YeungOnion modified the milestones: 0.17, 0.18 Sep 18, 2024
@SabrinaJewson
Copy link
Contributor Author

I’ve rebased on master and used conventional commits. I’m not sure how to do the regression tests: random number generation is hard to test because, well, it’s random – we could fix a seed but I’m sceptical of adding that to the tests, because it’s not tesitng public API (meaning we could make a non-SemVer-breaking change and have to change the tests, which isn’t considered good testing practice).

That said, I’ve refactored some of the implementations to share as much code as possible, so it should be obviously the case that the correctness of the new algorithm is the same as the correctness of the old algorithm.

@YeungOnion
Copy link
Contributor

Thanks! It all looks good, but I suppose it won't hurt to run CI checks again.


An aside regarding regression tests.
Initially I thought perhaps to just sample an integer and cast to float, then compare it to float, but as you mentioned at the start of this PR, most rely on integer internally, so that's just checking float representation accuracy.

@YeungOnion
Copy link
Contributor

Thanks again!

@YeungOnion YeungOnion merged commit ac52c63 into statrs-dev:master Sep 22, 2024
9 of 10 checks passed
@SabrinaJewson SabrinaJewson deleted the sample-integer branch September 23, 2024 06:37
YeungOnion added a commit to YeungOnion/statrs that referenced this pull request Sep 24, 2024
YeungOnion added a commit to YeungOnion/statrs that referenced this pull request Dec 3, 2024
YeungOnion added a commit to YeungOnion/statrs that referenced this pull request Dec 3, 2024
(3 distributions) * (density evalution + sampling)

lint: run clippy

specify required features for random_clap

random_clap can pmf for multinomial

refactoring for multivariate densities

set some default parameters, on distn basis

fix: specify type sampled after merging statrs-dev#155

fix: sampling should use DistIter, but not collect it

chore: lint for reserve-after-new

doc: specify delimiter in help messages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants