-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The "binedges non-unique" error #33
Comments
Thanks for volunteering! There are multiple ways this could be handled. The method in question discretizes such that each bin has roughly the same number of items within it. However, things get tricky when boundaries happen to fall on repeated samples. Consider trying to discretize the following into two bins with uniform counts: There probably is a nice way to handle this where some initial binning is done and then a process iterates to a better binning by narrowing bins with many elements and widening bins with very few elements. |
Thanks for the explanation! I took a look at ScikitLearn's implementation and it looks like that they simply removes those repeated edges; see https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L226. Also, their main implementation is much easier: https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/preprocessing/_discretization.py#L204 (they refer to uniform count as "quantile" as it just doing uniform width but on the quantile)---have we considered take that implementation? If yes why did we choose the current one (what's the benefit)? |
Another way that I just read to handle this while still achieving uniform count is by random binning (see https://bkamins.github.io/julialang/2020/12/11/binning.html) but it looks like not compatible with out current pipeline as it doesn't get binedges first. |
You are referring to this?
It looks like If you want a method that decides on the number of bins for you, then maybe consider BayesianBlocks. There are a lot of possible discretization methods - feel free to submit a PR for anything new if you want the functionality. |
Yes and no.
This implementation looks neat for me and I wonder what the advantage of the one we have in our code base is as it seems doing something more complex. PS: In my view, when the use the "post-processing" of "remove bins whose width are too small (i.e., <= 1e-8)", users' input is like the maximum number of bins, which is kind of a sensible thing to do as default. |
Using the builtin
If you want to include an option for trimming bins, that's fine. I wouldn't do it by default in order to preserve backwards compatibility, but I am not too opinionated there. |
I can do a PR with the new implementaiton with an option for trimming. |
Right - ideally a uniform count algorithm for 3 bins on the data [1,2,3,4,5,6,6,6,6,6,6,6,6,6] would give something like [1,3], [4,5], [6,6+eps()] and not [1,5] [6+eps] the way the quantile method would. A |
PR #34 created---plz review. |
I met the "binedges non-unique" error from https://github.com/sisl/Discretizers.jl/blob/master/src/disc_uniformcount.jl#L49.
I saw there is a to-do note on making the algorithm properly handle that---can someone explain what needs to be done for that?
I'm happy to give a try if someone gives me some guidence---thanks!
The text was updated successfully, but these errors were encountered: