[ENH] Multiclass classification reduction using Histograms #410

ShreeshaM07 · 2024-06-26T18:48:52Z

Reference Issues/PRs

fixes #378

What does this implement/fix? Explain your changes.

It adds an adapter to sklearn multiclass classifiers and expresses them as a Histogram probability distribution.

Does your contribution introduce a new dependency? If yes, which one?

Did you add any tests for the change?

No

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

fkiraly · 2024-06-26T20:47:08Z

skpro/classification/sklearn_classifiers/_sklearn_proba_class.py

+from skpro.utils.sklearn import prep_skl_df
+
+
+class SklearnProbaClassifier(BaseProbaRegressor):


name should be more descriptive

I wasn't very sure of what to name it so I just named it along the lines of SklearnProbaReg. I'll modify the name giving it more description.

skpro/classification/sklearn_classifiers/_sklearn_proba_class.py

fkiraly · 2024-06-26T20:48:37Z

skpro/classification/sklearn_classifiers/_sklearn_proba_class.py

+        X = prep_skl_df(X)
+        y = prep_skl_df(y)
+
+        if isinstance(y, pd.DataFrame) and len(y.columns) == 1:


some binning should happen here

I'll be looking into that next. But I am not quite sure what we want to bin here. Could you please elaborate a little on your thought.

The estimator should implement the algorithm from the issue description: #378 (comment)

In _fit, that means:

bin the regression target y into bins, according to bins. This results in a multioutput target y_binned.

Fit the classifier to X and y_binned.

Sorry to keep asking again but I want to clear the exact idea.

bin the regression target y into bins, according to bins. This results in a multioutput target y_binned.

What I would infer from this is
if y is a dataframe with the output classes
for example
y
0 "classA"
1 "classB"
2 "classA"
3 "classC"

and bins=3
Then you want to convert this to y_binned which will look like

0 1 2 --> bins/bin number corresponding to "classA","classB","classC" 0 1 0 0 --> y after binning 1 0 1 0 2 1 0 0 3 0 0 1

This will then be passed to clf.fit(X,y_binned). Is that what your thought is?

If it is so then I do not see why we are splitting it into this as we can just do clf.fit(X,y) and then upon calling clf.predict_proba(val_X) we get the class probabilites of all classes anyways. So what is the exact reason why we want to bin this data in fit and fit it using this y_binned.

You can take a look to see what the current predict_proba is doing

What I would infer from this is
if y is a dataframe with the output classes

I think there is a bit of a miscommunication, sorry for not explaining clearly.

The y arriving in _fit will contain floats, not categories, for instance

y 0 5 1 2 2 6 3 3

Let's say we decide to bin with bins = [1,4, 7], then internally we convert this to

y_binned 0 "class2" 1 "class1" 2 "class2" 3 "class1"

to this we fit clf.

In _predict_proba, using clf.predict_proba will produce sth like

class1 class2 0 0.6 0.4 1 0.3 0.7 2 0.1 0.9

And this we then convert to an 3 times 1 Histogram distribution.

I think having the int option with min/max would enhance user experience, especially if it also has a default.

Re min/max, why not directly min/max, but 0.999 times etc?

Re min/max, why not directly min/max, but 0.999 times etc?

That is to include the min(X) and max(X) values in the bins and not to have them as the bin boundary. It is done in the same way as pandas.cut in case of int.

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

I have rethought about it and since I will be using np.arange to create the bins array in the int case it would be best to take it as np.arange(min(y),maxy(y)+step,step) where step = (max(y)-min(y))/bins.

I think this is not a good idea, using np.arange where stop - start is an approximate multiple of step leads to a length of the output which is not numerically stable, see the docs of np.arange which warn exactly of that.

Instead, I would simply use np.linspace with num=bins (in the integer case).

Another useful function to be used later on is np.digitize.

Yes I have used np.linspace now it makes more sense and is more accurate. I will still have to do start = 0.999*min(y)and stop = max(y)*1.001 as the pandas.cut, which I am using to bin the y values, does not include the leftmost point of the bin so I have to increase the domain in order to consider the min(y) value.

ShreeshaM07 · 2024-07-01T11:45:37Z

@fkiraly , I have completed the classifier and all checks have passed. Let me know if you want any modifications.

fkiraly · 2024-07-01T21:54:11Z

docs/source/api_reference/regression.rst

@@ -159,6 +159,17 @@ Adapters to other interfaces

    SklearnProbaReg

+Adapters to other classifiers


I would call it "Reduction to probabilistic classification"

fkiraly · 2024-07-01T21:55:38Z

docs/source/api_reference/regression.rst

+Adapters to other classifiers
+----------------------------
+
+.. currentmodule:: skpro.classification.sklearn_classifiers._sklearn_proba_class


the result is a regressor, even if internally it uses a classifier, so it should be part of the regression module.

Similarly, the name of the estimator should not be "classifier", since it implies the estimator is a classifier.

I will think of a name, perhaps you can also try to come up with sth.

fkiraly · 2024-07-01T21:56:35Z

skpro/classification/sklearn_classifiers/_sklearn_proba_class.py

+
+
+class MulticlassSklearnProbaClassifier(BaseProbaRegressor):
+    """A multiclass classifier fitting a histogram distribution.


as said above, the resulting estimator is a probabilistic regressor, not a multiclass classifier. It is a probabilistic regressor that fits a histogram distribution by presenting binned outcomes to a probabilistic classifier from sklearn.

Now I don't know how to make this into a short description the best way, I'm just saying the above needs to be adapted.

fkiraly

Looks great!

My main comment is about naming and folder structure - the result is a regressor etc, see above.

ShreeshaM07 · 2024-07-02T12:56:05Z

I have made some changes to the name and the folder structure please let me know if this is suitable.

fkiraly

Thx!

non-blocking

I think the name is now very clear
docstring is also clear - could be improved in formuations but not blocking
class_bin_map_ could be described better, ie, which bin, what are values

blocking: the 1D array case is not covered by tests, can be done in get_test_params or separately

ShreeshaM07 · 2024-07-03T20:29:55Z

blocking: the 1D array case is not covered by tests, can be done in get_test_params or separately

Is it possible to check for 1D case in get_test_params itself? as I will not be able to modify any inputs for predict and predict_proba other than the classifier instance and number of bins.

fkiraly · 2024-07-03T20:33:36Z

yes, but we know that the inputs for y are numbers, so we should be able to come up with a reasonable range.

We should also think of the casewhere the user specifies bins and some y fall outside - in this case, I think the estimator should not break but do sth sensible (move to closest bin, or discard - perhaps even as an option on what to do)

ShreeshaM07 · 2024-07-04T07:08:14Z

yes, but we know that the inputs for y are numbers, so we should be able to come up with a reasonable range.

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params. Just to be clear are we talking about the single Histogram DIstribution case produced when input X for predict and predict_proba is of shape(1,X.cols) or this by any chance?

        elif len(y_binned.shape) > 1 and y_binned.shape[1] == 1:
            y_binned = y_binned[:, 0]

We should also think of the casewhere the user specifies bins and some y fall outside - in this case, I think the estimator should not break but do sth sensible (move to closest bin, or discard - perhaps even as an option on what to do)

I have handled this and moved y values outside bin range to the closest bin. Also placed a note in the docstring to let the user know of this.

fkiraly · 2024-07-04T07:36:08Z

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params.

I meant, there should be a test case in get_test_params, where the bins passed ar an 1D array, currently there isn't one.

ShreeshaM07 · 2024-07-04T08:12:26Z

Sorry I didn't quite get how this would help in testing for 1D case in get_test_params.

I meant, there should be a test case in get_test_params, where the bins passed ar an 1D array, currently there isn't one.

Ahh I see. Now that the y values outside bin range are also handled it makes all the more sense to include the array case for bins. I hadn't included earlier as I did not know the range of y and that might break the code when the bins are pre-meditated.
Thanks for noticing.

ShreeshaM07 · 2024-07-08T20:04:06Z

Kind reminder are we waiting on something for this merge ?

fkiraly · 2024-07-09T19:40:02Z

no, not in particular - just in case any other devs want to give it a review

[ENH] Multiclass classification reduction using Histograms

3cea808

fkiraly reviewed Jun 26, 2024

View reviewed changes

skpro/classification/sklearn_classifiers/_sklearn_proba_class.py Outdated Show resolved Hide resolved

fkiraly reviewed Jun 26, 2024

View reviewed changes

ShreeshaM07 added 2 commits June 27, 2024 02:52

predict_proba for 2D

02275c3

__init__.py

913c81e

fkiraly assigned ShreeshaM07 Jun 27, 2024

ShreeshaM07 added 4 commits June 27, 2024 13:43

mapped classes to their probas and classes to their bin boundaries

e880c89

get_test_params

2e16e9d

removed predict_var

a5379c4

y_binned introduced

ccfde64

fkiraly added enhancement implementing algorithms Implementing algorithms, estimators, objects native to skpro labels Jun 30, 2024

ShreeshaM07 added 3 commits July 1, 2024 14:34

fit modified to use min and max of y as bin boundaries

cdd628a

included in regression.rst

d6e173a

bin_mass mapping missing classes

a5d7adb

ShreeshaM07 marked this pull request as ready for review July 1, 2024 11:29

fkiraly reviewed Jul 1, 2024

View reviewed changes

fkiraly requested changes Jul 1, 2024

View reviewed changes

ShreeshaM07 added 2 commits July 2, 2024 18:22

re-structured location of regressor

c4894e8

removed previous classification folder

780aeb3

ShreeshaM07 mentioned this pull request Jul 3, 2024

[MENTEE] Shreesha M sktime/mentoring#40

Open

fkiraly requested changes Jul 3, 2024

View reviewed changes

handled case when y is outside bin range

017dacc

ShreeshaM07 and others added 2 commits July 4, 2024 13:43

bins when 1D array in test params

0fcd7c4

Update regression.rst

acf424a

fkiraly approved these changes Jul 6, 2024

View reviewed changes

fkiraly merged commit d3edf9f into sktime:main Jul 9, 2024
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Multiclass classification reduction using Histograms #410

[ENH] Multiclass classification reduction using Histograms #410

ShreeshaM07 commented Jun 26, 2024

fkiraly Jun 26, 2024

ShreeshaM07 Jun 26, 2024

fkiraly Jun 26, 2024

ShreeshaM07 Jun 26, 2024

fkiraly Jun 27, 2024

ShreeshaM07 Jun 27, 2024

fkiraly Jun 27, 2024 •

edited

Loading

fkiraly Jun 28, 2024

ShreeshaM07 Jun 30, 2024 •

edited

Loading

ShreeshaM07 Jun 30, 2024

fkiraly Jul 1, 2024

ShreeshaM07 Jul 1, 2024

ShreeshaM07 commented Jul 1, 2024

fkiraly Jul 1, 2024

fkiraly Jul 1, 2024

fkiraly Jul 1, 2024

fkiraly left a comment

ShreeshaM07 commented Jul 2, 2024

fkiraly left a comment

ShreeshaM07 commented Jul 3, 2024

fkiraly commented Jul 3, 2024

ShreeshaM07 commented Jul 4, 2024 •

edited

Loading

fkiraly commented Jul 4, 2024

ShreeshaM07 commented Jul 4, 2024

ShreeshaM07 commented Jul 8, 2024 •

edited

Loading

fkiraly commented Jul 9, 2024

		from skpro.utils.sklearn import prep_skl_df


		class SklearnProbaClassifier(BaseProbaRegressor):

		@@ -159,6 +159,17 @@ Adapters to other interfaces

		SklearnProbaReg

		Adapters to other classifiers



		class MulticlassSklearnProbaClassifier(BaseProbaRegressor):
		"""A multiclass classifier fitting a histogram distribution.

[ENH] Multiclass classification reduction using Histograms #410

[ENH] Multiclass classification reduction using Histograms #410

Conversation

ShreeshaM07 commented Jun 26, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

Did you add any tests for the change?

PR checklist

For all contributions

For new estimators

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShreeshaM07 Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShreeshaM07 commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

ShreeshaM07 commented Jul 2, 2024

fkiraly left a comment

Choose a reason for hiding this comment

ShreeshaM07 commented Jul 3, 2024

fkiraly commented Jul 3, 2024

ShreeshaM07 commented Jul 4, 2024 • edited Loading

fkiraly commented Jul 4, 2024

ShreeshaM07 commented Jul 4, 2024

ShreeshaM07 commented Jul 8, 2024 • edited Loading

fkiraly commented Jul 9, 2024

fkiraly Jun 27, 2024 •

edited

Loading

ShreeshaM07 Jun 30, 2024 •

edited

Loading

ShreeshaM07 commented Jul 4, 2024 •

edited

Loading

ShreeshaM07 commented Jul 8, 2024 •

edited

Loading