Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Multiclass classification reduction using Histograms #410

Merged
merged 15 commits into from
Jul 9, 2024

Conversation

ShreeshaM07
Copy link
Contributor

Reference Issues/PRs

fixes #378

What does this implement/fix? Explain your changes.

It adds an adapter to sklearn multiclass classifiers and expresses them as a Histogram probability distribution.

Does your contribution introduce a new dependency? If yes, which one?

Did you add any tests for the change?

No

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

from skpro.utils.sklearn import prep_skl_df


class SklearnProbaClassifier(BaseProbaRegressor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't very sure of what to name it so I just named it along the lines of SklearnProbaReg. I'll modify the name giving it more description.

X = prep_skl_df(X)
y = prep_skl_df(y)

if isinstance(y, pd.DataFrame) and len(y.columns) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some binning should happen here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be looking into that next. But I am not quite sure what we want to bin here. Could you please elaborate a little on your thought.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The estimator should implement the algorithm from the issue description: #378 (comment)

In _fit, that means:

  1. bin the regression target y into bins, according to bins. This results in a multioutput target y_binned.
  2. Fit the classifier to X and y_binned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to keep asking again but I want to clear the exact idea.

bin the regression target y into bins, according to bins. This results in a multioutput target y_binned.

What I would infer from this is
if y is a dataframe with the output classes
for example
y
0 "classA"
1 "classB"
2 "classA"
3 "classC"

and bins=3
Then you want to convert this to y_binned which will look like

   0     1     2   --> bins/bin number corresponding to "classA","classB","classC"
0  1     0     0   --> y after binning 
1  0     1     0
2  1     0     0
3  0     0     1 

This will then be passed to clf.fit(X,y_binned). Is that what your thought is?

If it is so then I do not see why we are splitting it into this as we can just do clf.fit(X,y) and then upon calling clf.predict_proba(val_X) we get the class probabilites of all classes anyways. So what is the exact reason why we want to bin this data in fit and fit it using this y_binned.

You can take a look to see what the current predict_proba is doing
Screenshot from 2024-06-27 16-27-01

image

Copy link
Collaborator

@fkiraly fkiraly Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would infer from this is
if y is a dataframe with the output classes

I think there is a bit of a miscommunication, sorry for not explaining clearly.

The y arriving in _fit will contain floats, not categories, for instance

       y
0  5
1  2
2  6
3  3

Let's say we decide to bin with bins = [1,4, 7], then internally we convert this to

       y_binned
0  "class2"
1  "class1"
2  "class2"
3  "class1"

to this we fit clf.

In _predict_proba, using clf.predict_proba will produce sth like

  class1    class2
0   0.6      0.4
1   0.3      0.7
2   0.1      0.9

And this we then convert to an 3 times 1 Histogram distribution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the int option with min/max would enhance user experience, especially if it also has a default.

Re min/max, why not directly min/max, but 0.999 times etc?

Copy link
Contributor Author

@ShreeshaM07 ShreeshaM07 Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re min/max, why not directly min/max, but 0.999 times etc?

That is to include the min(X) and max(X) values in the bins and not to have them as the bin boundary. It is done in the same way as pandas.cut in case of int.

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rethought about it and since I will be using np.arange to create the bins array in the int case it would be best to take it as np.arange(min(y),maxy(y)+step,step) where step = (max(y)-min(y))/bins.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a good idea, using np.arange where stop - start is an approximate multiple of step leads to a length of the output which is not numerically stable, see the docs of np.arange which warn exactly of that.

Instead, I would simply use np.linspace with num=bins (in the integer case).

Another useful function to be used later on is np.digitize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have used np.linspace now it makes more sense and is more accurate. I will still have to do start = 0.999*min(y)and stop = max(y)*1.001 as the pandas.cut, which I am using to bin the y values, does not include the leftmost point of the bin so I have to increase the domain in order to consider the min(y) value.

@fkiraly fkiraly added enhancement implementing algorithms Implementing algorithms, estimators, objects native to skpro labels Jun 30, 2024
@ShreeshaM07 ShreeshaM07 marked this pull request as ready for review July 1, 2024 11:29
@ShreeshaM07
Copy link
Contributor Author

@fkiraly , I have completed the classifier and all checks have passed. Let me know if you want any modifications.

@@ -159,6 +159,17 @@ Adapters to other interfaces

SklearnProbaReg

Adapters to other classifiers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it "Reduction to probabilistic classification"

Adapters to other classifiers
----------------------------

.. currentmodule:: skpro.classification.sklearn_classifiers._sklearn_proba_class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result is a regressor, even if internally it uses a classifier, so it should be part of the regression module.

Similarly, the name of the estimator should not be "classifier", since it implies the estimator is a classifier.

I will think of a name, perhaps you can also try to come up with sth.



class MulticlassSklearnProbaClassifier(BaseProbaRegressor):
"""A multiclass classifier fitting a histogram distribution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said above, the resulting estimator is a probabilistic regressor, not a multiclass classifier. It is a probabilistic regressor that fits a histogram distribution by presenting binned outcomes to a probabilistic classifier from sklearn.

Now I don't know how to make this into a short description the best way, I'm just saying the above needs to be adapted.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

My main comment is about naming and folder structure - the result is a regressor etc, see above.

@ShreeshaM07
Copy link
Contributor Author

I have made some changes to the name and the folder structure please let me know if this is suitable.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

non-blocking

  • I think the name is now very clear
  • docstring is also clear - could be improved in formuations but not blocking
  • class_bin_map_ could be described better, ie, which bin, what are values

blocking: the 1D array case is not covered by tests, can be done in get_test_params or separately

@ShreeshaM07
Copy link
Contributor Author

blocking: the 1D array case is not covered by tests, can be done in get_test_params or separately

Is it possible to check for 1D case in get_test_params itself? as I will not be able to modify any inputs for predict and predict_proba other than the classifier instance and number of bins.

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 3, 2024

yes, but we know that the inputs for y are numbers, so we should be able to come up with a reasonable range.

We should also think of the casewhere the user specifies bins and some y fall outside - in this case, I think the estimator should not break but do sth sensible (move to closest bin, or discard - perhaps even as an option on what to do)

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented Jul 4, 2024

yes, but we know that the inputs for y are numbers, so we should be able to come up with a reasonable range.

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params. Just to be clear are we talking about the single Histogram DIstribution case produced when input X for predict and predict_proba is of shape(1,X.cols) or this by any chance?

        elif len(y_binned.shape) > 1 and y_binned.shape[1] == 1:
            y_binned = y_binned[:, 0]

We should also think of the casewhere the user specifies bins and some y fall outside - in this case, I think the estimator should not break but do sth sensible (move to closest bin, or discard - perhaps even as an option on what to do)

I have handled this and moved y values outside bin range to the closest bin. Also placed a note in the docstring to let the user know of this.

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 4, 2024

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params.

I meant, there should be a test case in get_test_params, where the bins passed ar an 1D array, currently there isn't one.

@ShreeshaM07
Copy link
Contributor Author

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params.

I meant, there should be a test case in get_test_params, where the bins passed ar an 1D array, currently there isn't one.

Ahh I see. Now that the y values outside bin range are also handled it makes all the more sense to include the array case for bins. I hadn't included earlier as I did not know the range of y and that might break the code when the bins are pre-meditated.
Thanks for noticing.

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented Jul 8, 2024

Kind reminder are we waiting on something for this merge ?

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 9, 2024

no, not in particular - just in case any other devs want to give it a review

@fkiraly fkiraly merged commit d3edf9f into sktime:main Jul 9, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implementing algorithms Implementing algorithms, estimators, objects native to skpro
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[ENH] proba regression: reduction to multiclass classification
2 participants