-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Multiclass classification reduction using Histograms #410
Conversation
from skpro.utils.sklearn import prep_skl_df | ||
|
||
|
||
class SklearnProbaClassifier(BaseProbaRegressor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name should be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't very sure of what to name it so I just named it along the lines of SklearnProbaReg
. I'll modify the name giving it more description.
skpro/classification/sklearn_classifiers/_sklearn_proba_class.py
Outdated
Show resolved
Hide resolved
X = prep_skl_df(X) | ||
y = prep_skl_df(y) | ||
|
||
if isinstance(y, pd.DataFrame) and len(y.columns) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some binning should happen here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be looking into that next. But I am not quite sure what we want to bin here. Could you please elaborate a little on your thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimator should implement the algorithm from the issue description: #378 (comment)
In _fit
, that means:
- bin the regression target
y
into bins, according tobins
. This results in a multioutput targety_binned
. - Fit the classifier to
X
andy_binned
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to keep asking again but I want to clear the exact idea.
bin the regression target y into bins, according to bins. This results in a multioutput target y_binned.
What I would infer from this is
if y
is a dataframe with the output classes
for example
y
0 "classA"
1 "classB"
2 "classA"
3 "classC"
and bins
=3
Then you want to convert this to y_binned
which will look like
0 1 2 --> bins/bin number corresponding to "classA","classB","classC"
0 1 0 0 --> y after binning
1 0 1 0
2 1 0 0
3 0 0 1
This will then be passed to clf.fit(X,y_binned)
. Is that what your thought is?
If it is so then I do not see why we are splitting it into this as we can just do clf.fit(X,y)
and then upon calling clf.predict_proba(val_X)
we get the class probabilites of all classes anyways. So what is the exact reason why we want to bin this data in fit
and fit it using this y_binned
.
You can take a look to see what the current predict_proba
is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would infer from this is
if y is a dataframe with the output classes
I think there is a bit of a miscommunication, sorry for not explaining clearly.
The y
arriving in _fit
will contain floats, not categories, for instance
y
0 5
1 2
2 6
3 3
Let's say we decide to bin with bins = [1,4, 7]
, then internally we convert this to
y_binned
0 "class2"
1 "class1"
2 "class2"
3 "class1"
to this we fit clf
.
In _predict_proba
, using clf.predict_proba
will produce sth like
class1 class2
0 0.6 0.4
1 0.3 0.7
2 0.1 0.9
And this we then convert to an 3 times 1 Histogram
distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the int
option with min/max would enhance user experience, especially if it also has a default.
Re min/max, why not directly min/max, but 0.999 times etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re min/max, why not directly min/max, but 0.999 times etc?
That is to include the min(X) and max(X) values in the bins and not to have them as the bin boundary. It is done in the same way as pandas.cut
in case of int
.
int
: Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rethought about it and since I will be using np.arange
to create the bins array in the int
case it would be best to take it as np.arange(min(y),maxy(y)+step,step)
where step = (max(y)-min(y))/bins
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a good idea, using np.arange
where stop - start
is an approximate multiple of step
leads to a length of the output which is not numerically stable, see the docs of np.arange
which warn exactly of that.
Instead, I would simply use np.linspace
with num=bins
(in the integer case).
Another useful function to be used later on is np.digitize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I have used np.linspace
now it makes more sense and is more accurate. I will still have to do start = 0.999*min(y)and stop = max(y)*1.001 as the pandas.cut
, which I am using to bin the y values, does not include the leftmost point of the bin so I have to increase the domain in order to consider the min(y) value.
@fkiraly , I have completed the classifier and all checks have passed. Let me know if you want any modifications. |
@@ -159,6 +159,17 @@ Adapters to other interfaces | |||
|
|||
SklearnProbaReg | |||
|
|||
Adapters to other classifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call it "Reduction to probabilistic classification"
Adapters to other classifiers | ||
---------------------------- | ||
|
||
.. currentmodule:: skpro.classification.sklearn_classifiers._sklearn_proba_class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result is a regressor, even if internally it uses a classifier, so it should be part of the regression
module.
Similarly, the name of the estimator should not be "classifier", since it implies the estimator is a classifier.
I will think of a name, perhaps you can also try to come up with sth.
|
||
|
||
class MulticlassSklearnProbaClassifier(BaseProbaRegressor): | ||
"""A multiclass classifier fitting a histogram distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as said above, the resulting estimator is a probabilistic regressor, not a multiclass classifier. It is a probabilistic regressor that fits a histogram distribution by presenting binned outcomes to a probabilistic classifier from sklearn.
Now I don't know how to make this into a short description the best way, I'm just saying the above needs to be adapted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
My main comment is about naming and folder structure - the result is a regressor etc, see above.
I have made some changes to the name and the folder structure please let me know if this is suitable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
non-blocking
- I think the name is now very clear
- docstring is also clear - could be improved in formuations but not blocking
class_bin_map_
could be described better, ie, which bin, what are values
blocking: the 1D array case is not covered by tests, can be done in get_test_params
or separately
Is it possible to check for 1D case in |
yes, but we know that the inputs for We should also think of the casewhere the user specifies bins and some |
Sorry I didn't quite get how this would help in testing for 1D case in
I have handled this and moved y values outside bin range to the closest bin. Also placed a note in the docstring to let the user know of this. |
I meant, there should be a test case in |
Ahh I see. Now that the y values outside bin range are also handled it makes all the more sense to include the array case for bins. I hadn't included earlier as I did not know the range of y and that might break the code when the |
Kind reminder are we waiting on something for this merge ? |
no, not in particular - just in case any other devs want to give it a review |
Reference Issues/PRs
fixes #378
What does this implement/fix? Explain your changes.
It adds an adapter to sklearn multiclass classifiers and expresses them as a Histogram probability distribution.
Does your contribution introduce a new dependency? If yes, which one?
Did you add any tests for the change?
No
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skpro
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference
For new estimators
docs/source/api_reference/taskname.rst
, follow the pattern.Examples
section.python_dependencies
tag and ensureddependency isolation, see the estimator dependencies guide.