Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] proba regression: reduction to multiclass classification #378

Closed
fkiraly opened this issue Jun 7, 2024 · 7 comments · Fixed by #410
Closed

[ENH] proba regression: reduction to multiclass classification #378

fkiraly opened this issue Jun 7, 2024 · 7 comments · Fixed by #410
Assignees
Labels
feature request New feature or request good first issue Good for newcomers implementing algorithms Implementing algorithms, estimators, objects native to skpro module:regression probabilistic regression module

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 7, 2024

From the discussion today, a short design for a reducer to multiclass classification mentioned in #7.

Parameters are:

  • an sklearn classifier clf capable of multiclass classification
  • a bins arg, default = 10. Possible values are int, or an ordered list of float.

The algortihm does as follows:

  • if bins is int, replaces this arg internally by that many bins, at the bins + 1 equally spaced quantiles of the empirical training distribution.
  • sorts the training labels into a multiclass label according to which bin it is in
  • in fit, fits clf to this binned training data
  • in predict_proba, uses clf.predict.proba to obtain class probabilities, and uses these together with the bins from bins to obtain a Histogram distribution

One could also think about another algorithm where the bins are cumulative, i.e., being contained in the bin defined by lowest point to i-th bin. This is also valid but one needs to be careful that the resulting cdf is monotonic. Could be a choice of strategy.

FYI @ShreeshaM07, @SaiRevanth25.

@fkiraly fkiraly added good first issue Good for newcomers module:regression probabilistic regression module implementing algorithms Implementing algorithms, estimators, objects native to skpro feature request New feature or request labels Jun 7, 2024
@ShreeshaM07
Copy link
Contributor

Yes I think this would be a good thing to implement once #335 is complete and merged.

@ShreeshaM07
Copy link
Contributor

I will be making the PR for this today had some few doubts needing clarification

  • since bins is going to represent the number of classes wouldn't it make more sense to fetch it from the sklearn classifier using the classes_ attribute?
  • How do we take input of the other parameters to the different available classifiers in sklearn as they are going to be different for each do I take it as a kwargs argument from the user?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 26, 2024

since bins is going to represent the number of classes wouldn't it make more sense to fetch it from the sklearn classifier using the classes_ attribute?

But that's available only once you've fitted it, which is later than construction. How would that work, logically?

How do we take input of the other parameters to the different available classifiers in sklearn as they are going to be different for each do I take it as a kwargs argument from the user?

No, you pass the entire classifier instance. As I'm saying above, parameters are clf - a classifier instance with its own parameters - and bins. I did not state expressly that clf is an instance, though that would follow the common pattern of composition in sklearn-like manner, you use instances, not the class, so parameters of the instance are passed along with it.

@ShreeshaM07
Copy link
Contributor

No, you pass the entire classifier instance. As I'm saying above, parameters are clf - a classifier instance with its own parameters - and bins.

Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.

@ShreeshaM07
Copy link
Contributor

But that's available only once you've fitted it, which is later than construction. How would that work, logically?

Since we are constructing the Histogram distribution only when we call predict_proba that would mean it is already fitted. Is that not how we want it ?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 26, 2024

Oh I thought I had to take input as strings like I did in case of statsmodels. If I take the input as a sklearn classifier instance then thats not an issue at all.

Yes, inputs being strings is "bad design" if a viable alternative is the composition/strategy patterns. Because with strings, you always have to add the encoding manually, whereas in composition you can pass any component that is API compliant.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 26, 2024

Since we are constructing the Histogram distribution only when we call predict_proba that would mean it is already fitted. Is that not how we want it ?

I think you still need the exact bins because you need to pass them to bins of the histogram distribution - knowing their number is not enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers implementing algorithms Implementing algorithms, estimators, objects native to skpro module:regression probabilistic regression module
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants