Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Multiclass classification reduction using Histograms #410

Merged
merged 15 commits into from
Jul 9, 2024
Merged
11 changes: 11 additions & 0 deletions docs/source/api_reference/regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,17 @@ Adapters to other interfaces

SklearnProbaReg

Adapters to other classifiers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it "Reduction to probabilistic classification"

----------------------------

.. currentmodule:: skpro.classification.sklearn_classifiers._sklearn_proba_class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result is a regressor, even if internally it uses a classifier, so it should be part of the regression module.

Similarly, the name of the estimator should not be "classifier", since it implies the estimator is a classifier.

I will think of a name, perhaps you can also try to come up with sth.


.. autosummary::
:toctree: auto_generated/
:template: class.rst

MulticlassSklearnProbaClassifier


Base classes
------------
Expand Down
1 change: 1 addition & 0 deletions skpro/classification/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Probabilistic Multiclass Classification."""
8 changes: 8 additions & 0 deletions skpro/classification/sklearn_classifiers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""Adapters for probabilistic classifiers, towards sklearn."""
# copyright: skpro developers, BSD-3-Clause License (see LICENSE file)

from skpro.classification.sklearn_classifiers._sklearn_proba_class import (
MulticlassSklearnProbaClassifier,
)

__all__ = ["MulticlassSklearnProbaClassifier"]
251 changes: 251 additions & 0 deletions skpro/classification/sklearn_classifiers/_sklearn_proba_class.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
"""Adapter to sklearn multiclass classification using Histograms."""

__author__ = ["ShreeshaM07"]

import numpy as np
import pandas as pd

from skpro.distributions import Histogram
from skpro.regression.base import BaseProbaRegressor
from skpro.utils.sklearn import prep_skl_df


class MulticlassSklearnProbaClassifier(BaseProbaRegressor):
"""A multiclass classifier fitting a histogram distribution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said above, the resulting estimator is a probabilistic regressor, not a multiclass classifier. It is a probabilistic regressor that fits a histogram distribution by presenting binned outcomes to a probabilistic classifier from sklearn.

Now I don't know how to make this into a short description the best way, I'm just saying the above needs to be adapted.


Wraps an sklearn classifier and constructs an skpro classifier from it.
It can be used for predicting the class that a set of X belongs to and
predict_proba can be used to represent the predicted probabilites of
each class for respective values in X in the form of a Histogram
Distribution.

The ``bins`` will be used to bin the ``y`` values in fit into the respective
bins. It then uses these bins as the classes for the classifier and predicts
the probabilites for each class.

Parameters
----------
clf : instance of a sklearn classifier
Classifier to wrap, must have ``predict`` and ``predict_proba``.
bins : int or 1D array of float, default: 10
1. If ``int`` then it will be considered as the number of bins.
2. Else if it is an array then it will be used as the bin boundaries.
If the requirement is ``n`` bins then the ``len(bins)`` must be ``n+1``.

Attributes
----------
classes_ : np.array
Contains the names of the classes that it was fit on.
class_bin_map_ : dict
The key contains the class names. It maps the bin boundaries
[bin start,bin end] to the classes.
classes_proba_ : pd.DataFrame
Contains the class probabilites.
"""

_tags = {
"authors": ["ShreeshaM07"],
"maintainers": ["ShreeshaM07"],
"capability:multioutput": False,
"capability:missing": True,
}

def __init__(self, clf, bins=10):
self.clf = clf
self.bins = bins

super().__init__()

def _bins_int_arr(self, bins, y):
y = np.array(y).flatten()
start = min(y) * 0.999
stop = max(y) * 1.001
bins = np.linspace(start=start, stop=stop, num=bins + 1)
return bins

def _fit(self, X, y):
"""Fit regressor to training data.

Writes to self:
Sets fitted model attributes ending in "_".

Parameters
----------
X : pandas DataFrame
feature instances to fit regressor to
y : pandas DataFrame, must be same length as X
labels to fit regressor to

Returns
-------
self : reference to self
"""
from warnings import warn

from numpy.lib.stride_tricks import sliding_window_view
from sklearn import clone

self.clf_ = clone(self.clf)
bins = self.bins
self._y_cols = y.columns

# create bins array in case of bins being an `int`
if isinstance(bins, int) or isinstance(bins, np.integer):
bins = self._bins_int_arr(bins, y)

# in case of int it will be internally replaced in fit
self._bins = bins

# Generate class names based on bins
class_bins = [f"class{i}" for i in range(len(bins) - 1)]
self._class_bins = class_bins

if len(bins) != len(class_bins) + 1:
warn(
f"len of `bins` is {len(bins)} != len of classes {len(class_bins)}+1."
" Ensure the bins has all the bin boundaries resulting in"
" number of bins + 1 elements."
)

bins_hist = sliding_window_view(bins, window_shape=2)
# maps the bin boundaries [bin start,bin end] to the classes
class_bin_map_ = {}
for i in range(len(bins_hist)):
class_bin_map_[class_bins[i]] = bins_hist[i]
self.class_bin_map_ = class_bin_map_

# bins the y values into classes.
class_series = pd.cut(y.iloc[:, 0], bins=bins, labels=class_bins, right=True)
y_binned = pd.DataFrame(class_series, columns=self._y_cols)

X = prep_skl_df(X)
y_binned = prep_skl_df(y_binned)

if isinstance(y_binned, pd.DataFrame) and len(y_binned.columns) == 1:
y_binned = y_binned.iloc[:, 0]
elif len(y_binned.shape) > 1 and y_binned.shape[1] == 1:
y_binned = y_binned[:, 0]

self.clf_.fit(X, y_binned)
self.classes_ = self.clf_.classes_

return self

def _predict(self, X):
"""Predict labels for data from features.

State required:
Requires state to be "fitted" = self.is_fitted=True

Accesses in self:
Fitted model attributes ending in "_"

Parameters
----------
X : pandas DataFrame, must have same columns as X in `fit`
data to predict labels for

Returns
-------
y : pandas DataFrame, same length as `X`, same columns as `y` in `fit`
labels predicted for `X`
"""
X = prep_skl_df(X)
y_pred = self.clf_.predict(X)
y_pred_df = pd.DataFrame(y_pred, index=X.index, columns=self._y_cols)
return y_pred_df

def _predict_proba(self, X):
"""Predict distribution over labels for data from features.

State required:
Requires state to be "fitted".

Accesses in self:
Fitted model attributes ending in "_"

Parameters
----------
X : pandas DataFrame, must have same columns as X in `fit`
data to predict labels for

Returns
-------
y : skpro BaseDistribution, same length as `X`
labels predicted for `X`
"""
from warnings import warn

X = prep_skl_df(X)
bins = self._bins
classes_ = self.classes_
class_bins = self._class_bins

if len(bins) != len(class_bins) + 1:
warn(
f"len of `bins` is {len(bins)} != len of classes {len(class_bins)}+1."
" Ensure the bins has all the bin boundaries resulting in"
" number of bins + 1 elements."
)

y_pred_proba = self.clf_.predict_proba(X)
# map classes probabilities/bin_mass to class names
classes_proba_ = pd.DataFrame(y_pred_proba, columns=classes_)

# Identify missing classes
missing_classes = set(class_bins) - set(classes_)
if missing_classes:
# Add missing classes with 0 values
for missing_class in missing_classes:
classes_proba_[missing_class] = 0
# Sort columns based on the numerical part of the class names
# in order to match with the bins while calling Histogram distribution
classes_proba_ = classes_proba_.reindex(
sorted(classes_proba_.columns, key=lambda x: int(x[5:])), axis=1
)

self.classes_proba_ = classes_proba_
y_pred_proba = np.array(classes_proba_)

if len(X) == 1:
bin_mass = y_pred_proba[0]
pred_proba = Histogram(bins=bins, bin_mass=bin_mass)
return pred_proba

# converting it to a 2D shape
bin_mass = np.array([y_pred_proba])
# Reshape and swap axes to get the desired structure
bin_mass = bin_mass.swapaxes(0, 1).reshape(-1, 1, bin_mass.shape[-1])

bins = np.array([[bins]] * len(X))

pred_proba = Histogram(
bins=bins, bin_mass=bin_mass, index=X.index, columns=self._y_cols
)
return pred_proba

@classmethod
def get_test_params(cls, parameter_set="default"):
"""Return testing parameter settings for the estimator.

Parameters
----------
parameter_set : str, default="default"
Name of the set of test parameters to return, for use in tests. If no
special parameters are defined for a value, will return `"default"` set.

Returns
-------
params : dict or list of dict, default = {}
Parameters to create testing instances of the class
Each dict are parameters to construct an "interesting" test instance, i.e.,
`MyClass(**params)` or `MyClass(**params[i])` creates a valid test instance.
`create_test_instance` uses the first (or only) dictionary in `params`
"""
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

param1 = {"clf": DecisionTreeClassifier(), "bins": 4}
param2 = {"clf": GaussianNB()}

return [param1, param2]
Loading